I recently spent some time researching the proactive disclosure of federal grants and contributions in Canada; here are some of my key findings about how the data is presented:
Each department publishes their own G&C data (i.e. there is no single repository)
The actual path to the endpoint data varies considerably across the landscape
Despite it being varied, the path generally follows the form of: Proactive Disclosure Overview -} List of Quarters-} List of all G&Cs Issued -} Detailed G&C data
End point data is presented uniformly in an HTML table across the entire domain
Variables include: Recipient Name, Location (City, Province), Date (YYYY-MM-DD), Value ($123,456.78), Purpose (free text), and Comments (free text, often blank)
In addition to the data in the table, the quarter and issuing department can be collected from the webpage itself
All data exists in both official languages
All of the data is already in the public domain
From Data to a Dataset
While disclosure is clearly important, it may no longer be sufficient. There is a clear appetite within the zeitgeist for not only data points that can be observed, but data sets that can be worked with. What I think I’ve stumbled upon is a core challenge that those at the forefront of open data have no doubt already encountered, namely: how do we assemble public information that is already available into something that is more useful. In other words, how do we meet the demands of today’s civil society?
To be honest I didn’t have the answer (or the expertise), so I started to consult broadly with the developer community, looking for a technological solution. I showed them a map of the data and articulated the goal of a single unified dataset. What I found was that the geographic dispersion of the data coupled with its sheer volume makes unification a challenge. Furthermore, while they all agree that unification is possible, they all also agree that that there is a human intelligence component in the collection, that a good AI would reduce but not eliminate that component, and that even small changes to or inconsistencies across the landscape could mean hours of recalibrating the program that assembles the data.
After numerous conversations with experts in the field, I’ve come to the conclusion that it may in fact be far easier (and more cost effective) to amend the way the government publishes the data to the web than it would be to try to assemble it from how it currently publishes it to the web. What I am less certain of, is the best way for the organization to go about doing that. My gut reaction is that we could reduce the work burden significantly by moving away from publishing a separate webpage for every grant or contribution awarded (current model) and publish a single comma-separated value (CSV) file from which that information could not only be gleaned, but mashed up and republished. My assumption is that publishing a singular feed at the department level wouldn’t entail too much additional work given that the data must be consolidated for quarterly publication. In other words, someone inside the organization already has all the data flow to or through them before it hits the web.
After data consolidation at the departmental level, departments could simply syndicate their data set to the newly minted data.gc.ca data portal where they could be assembled into a single government-wide data set, which, in my opinion, is where things get much more interesting.
Is opening the data sufficient?
When governments provide data to citizens, does it also have a responsibility to ensure data literacy and provide tools through which shared data can be used by the citizenry? I’ve spoken to people on both sides of the fence and the question is not easily answered. Naysayers are quick to cite the costs of providing tools and managing ongoing support as justification of their position. Whereas proponents are quick to steer the conversation to vulnerable stakeholder populations whom aren’t likely to have the expertise required to do anything with the data provided.
In its most basic form, government agencies have long relied on private business (e.g. search engines like Google) to ensure that people can find their data. However in a world where government data is not just read, but mashed up, analyzed and republished, search could be seen as falling short. More broadly we find that departments like Statistics Canada have long offered data online manipulation via their CANSIM tables, which allow interested parties to create a modicum of specificity from large datasets at a cost. (Conversely? Similarly?) Other departments, such as Human Resources and Skills Development Canada offers free data-centric services like the Working in Canada Tool.
Two very different mandates, approaches and uses of public data; yet both are reliant on the ever-expanding space between government data and citizens.
My position (in case you are wondering) is to completely bypass the two arguments above by highlighting the importance of understanding how government data is being used. Simply publishing raw data in a CSV file and making it available for download off a departmental website means that there is absolutely no sure way to tell how it was used, modified or redistributed; there is also no way to ensure that any applications built on the data are using the most recent versions of that data. This makes engagement around the data difficult, hinders the government’s ability to improve future data offerings, and could lead to unintentional public misinformation via third party developers. If government agencies want to engage citizenry around data offerings, and mitigate misinformation risks they need to make it easy to link, embed, email, share and socialize their data into devices, machines, programs and websites because this is where the truly transformational opportunities will be.
Case in point, a new model for Grants and Contributions
I want to walk you through a hypothetical, albeit entirely possible, alternative service delivery model using the Grants and Contributions example.
Imagine for a minute that I am (as a private citizen) interested in community development in a northern community and am seeking government assistance. I hop on the department’s website, dive into their G&C data offering and start to poke around. Imagine that the interface allows me to plug in some demographic details about the community within which I live as well as to input some details about the project I want to undertake (e.g. community infrastructure). Now imagine that the system returns all of the community infrastructure grants awarded by the government to communities that share similar demography to that of my own community. That data set is suddenly incredibly useful. It provides me with the names of applicants, their geographical locations, and project overviews. Armed with this information I could reach out to them, learn from them, and build a better application.
Now, imagine that I am the public servant on the receiving end of that application. I’m more likely to be reviewing an application that has some rigour behind it. Furthermore if it includes evidence garnered from the data set, I can easily verify the validity of the supporting documentation by diving into the dataset myself or using it to locate the richer case files that are produced internally through the process.
In the end this could save citizens time and money, it would bolster evidence-based decision making by the government agency, and could form deeper connections between grant recipients by making it easier for them to connect to one another to share information about the process.
Why this is so important
To date the open data landscape has been largely defined by app competitions and hackathons, and while these things are good for the ecosystem, they can’t sustain it alone. This is precisely why I think we need to start thinking more aggressively about how old government data could give rise to new service delivery models.