Opening data is a complex, sometimes messy, undertaking. It’s a technical issue, but like the old Reese’s commercial, there’s no right way to get it done.
Simply understanding the language and process can help lead to a workable approach for opening data in your city or agency.
To help clarify some of the technical aspects of opening data, we recently sat down with the Director of Analytics and Performance Management for the City of Chicago Tom Schenk, Open Data Program Manager for Chicago Jon Levy, and master data plumber and 2013 CfA fellow Dave Guarino.
We’ll go over some lingo and processes in a second, but if you only get this far know that the talent you need to run an open data initiative probably exists in your department or city. Identifying the right people for the job means you need to know what to look for and you need to know enough to identify when your in the dark.
Laying the base for open data – ETL
The basic components of the problem of publishing data are often described in the enterprise world by the acronym “ETL.” The acronym is casually thrown around the data warehousing world, it’s often not used explicitly in the context of governments opening data. ETL stands for three steps in the process of data publishing; Extract, Transform, and Load.
It is the process of taking data from one place, making it useful, and exposing it to the world. Figuring out how to solve these pieces of the problem is core to creating a sustainable and enduring open data system.
- Step 1: Extract – simply means getting data from an original source system where updates are stored and new data gets added, like a 311 or police database.
- Step 2: Transform – involves making changes or additions to the data that make it easier to use, for example taking a parcel identifier and making it into a street address.
- Step 3: Load – getting that transformed data into an easily-accessed place, in our case on an open data site or portal.
Many people who can wield these tools are probably already inside your city and can be found in a data warehousing team, a GIS unit, or a business intelligence operation. Getting an open data site up is a good start, however without the internal human capital and a focus on automation, it can become more burdensome than beneficial.
Diving in – Techniques for Opening Data
ETL describes the structure of the data problem which can be approached in a number of ways. We highlight three general categories of tools:
- Programmatic Approaches – writing code.
- Graphical UGI tools – developed to be user-friendly with drag and drop process
- The inevitable ‘others’
Programmatic Tools and Techniques:
- A lightweight approach for a proof of concept or pilot: just have a data page for the city – not a portal – with links to downloadable datasets. Tap someone in your organization that has programming knowledge like unix or Linux. Have them set up nightly cron (chronological – a scheduler tool) jobs that open a connection to the database with a python call, creating a csv of the desired data and automatically throw it onto your shared server. Then drop a link to its location on the public facing web page.
- Use Python, it’s readable language with a wealth of tools and online databases. csvkit is a great command-line tool for working with CSV files. Many GIS teams are already using Python, so you may have a GIS unit that can be an asset for the open data team.
- SQL and Java are, of course, great general purpose tools for writing code to handle data.
Graphical/GUI Tools: These are generally applied when the programming techniques need to scale and non-programmers (such as those with more analysis backgrounds) might be integrated into the open data process.
- Pentaho Data Integration (Kettle) is a free and open source product with available paid technical support and training.
- SafeFME Data Integration is a paid service. Free trials periods are available for testing workflow or using to prove your concept.
- While we have heard the most of the previous two some others include, Informatica and Talend Open Studio
The inevitable ‘others’:
- OpenRefine (previously google refine) is good for cleaning and exploring data. Its use is optimal for one-off or infrequent data transformations, since it is not designed for automation or reuse.
- Excel and VBA are skills that a lot of people in government, particularly with public policy and economics backgrounds have. It’s possible to write a VBA Macro script to retrieve data from a shared drive.
- Windows Task Scheduler is an easy way to automate processes in a Windows environment (from VBA code to using Socrata’s DataSync tool to upload a CSV to an open data site).
- Your agency may also already have licenses for tools that are not being used, which can be applied to opening data. For example, IBM Cognos is a business intelligence/reporting tool that can tap into many databases and pull out data by dumping everything into a CSV.
Lastly, as Dave points out, remember that the ETL processes is most effective when it is automated. So, in the words of Ron Popeil, “Set it and forget it!”
If you would like to dive deeper into this discussion join us for our session on day three of this year’s Code for America Summit next month – So You’re Ready to Open Data: Now What?
Additional Relevant Resources:
- Sunlight Foundation on Data Plumbing
- Pentaho Tutorial
- From Table to Dataset, Chicago example
- Dave’s original data plumbing post