The last decade has seen a lot of discussion and activity around the democratic and economic potential in government standardizing or publishing its information as open data. Politicians from across the spectrum have extolled the accountability benefits of open data, passing legislation and issuing executive orders. And organizations have pushed the benefits of open data as a key ingredient for government to innovate in performing its missions. Additionally, advocates have encouraged open data as a way to help the economy, with McKinsey estimating that opening up more data could result in over $3 trillion of economic benefits.
While the open data movement has achieved some significant successes – from the DATA Act to data.gov – we have not come close to living up to the potential or even the current rhetoric. Lots of valuable data is still within governments’ walls, especially anything that smacks of personally identifiable or sensitive business information, or where government procured a system in a way that prevents data sharing. The government also provides much of its open data only as spreadsheets, rather than in formats like application programming interface (API) that allow for greater use. In fact, some of the most valuable public information is not machine-readable at all, such as financial regulatory filings. And a lot of data that has been released is not really being used to improve the private, nonprofit or government sector.
Only by honestly recognizing the obstacles can we devise a plan that might live up to open data’s promise. From my experience in both government-wide and individual agency’s open data efforts, I saw the following core obstacles:
- Risk Aversion: The decision on whether and how to standardize or publish a government dataset has all the ingredients of a standard principal-agent problem in economics. The principals (here, the public, legislators and to some extent executive branch leaders) want data to be open and will reap either the societal or reputational benefits of whatever comes from releasing new data. However, the decision of whether to standardize or open up data is made by an agent (here, usually some combination of program managers, information technology professionals and lawyers). The agent gets little direct benefit from release. But they will face substantial costs in the hard work of standardization as well as in terms of reputation, stress or termination if the open data they release is publicized as being inaccurate, embarrassing to the program, or compromises privacy, national security or business interests. As a result, the agents err far more on the side of keeping data closed than their principals want.
- Binary Approach: The discussion of open data has often been presented in binary terms – data is open (meaning that at least it is publicly available in a standardized format for download on a website) or it is not. This type of thinking leads to no access to much data by taking off the table intermediate options that could provide most of the upside for less cost or risk. The experience of statistical agencies suggests that intermediate options could lead to greater access for even sensitive data. For example, the Center for Medicare and Medicaid Services allows companies to apply for limited, secure access to transaction data in order for them to develop innovative products to improve health outcomes or reduce health spending.
- Technical: Adopting a data standard or releasing a dataset is time-consuming technical work that ranges from cleaning the data to deciding on privacy-protections. Governments are increasingly focused on trying to ensure sufficient technological expertise in-house. However, in most places technologically skilled employees are still a bottleneck not only for open data efforts but lots of other competing priorities, from technology modernizations to digital applications. The skills needed to appropriately release datasets that are more sensitive are even more technical, requiring people with understanding of advanced cryptographic and technical approaches such as synthetic data and secure multi-party computation. Usually, the subject-matter experts who control whether a given data set will be opened do not have this expertise, which is understandable because it was not historically a necessary or even useful trait.
- Unclear what Data to Focus On: Just like releasing data requires a rare combination of subject matter and technical expertise, so does figuring out which data sets to prioritize. How government data could be used requires imagination by people from varied perspectives. No matter how fantastic they are, 30-year government veterans cannot always predict what data might be transformative in the hands of others or when linked with other data sets. (This is not an insult – 30-year private sector executives are unlikely to predict what could be transformative in the government.) This is even truer when deciding the details of how data should be shared. Would Veterans Health Administration data still be useful to outsiders if demographic information was stripped out and you could not link together all of a patient’s visits? What if the government released very granular data that it had statistically “fudged” up a little in order to protect privacy?
Open data will never be as pervasive or transformational without addressing these obstacles. One potential answer is to centralize more decisionmaking and technical power, rather than having these decisions and actions made in the thousands of offices that currently “own” the data. At the General Services Administration, we created a Chief Data Officer to serve this role. Several other agencies have done the same, and Congress is currently considering legislation to require every agency to do so. However, one could go much farther across the government. Congress or the Office of Management and Budget could create a commission made up of representatives from across the political spectrum, privacy field, technology backgrounds, private sector, nonprofit and academia. This group of repeat players would develop the relevant expertise and make decisions or recommendations on if and how to open up data. Their composition and process could also provide some lower-case and upper-case political cover when some decisions inevitably prove wrong.
Additionally, the open data community must help agencies understand what data would be most useful, and under what conditions. The government does not have the management or technical bandwidth to release all potentially open data, so prioritization is key. Groups also need to help the government in determining what the next-best alternative is if full openness is not possible. Some agencies have invited such prioritization, such as the Demand-Driven Open Data effort by the Department of Health and Human Services, but such efforts must be deeper and wider across government. Understanding outsiders’ perspective will help governments optimize the tradeoff between releasing data and minimizing risk and cost.
Only by providing a guide to the government and taking the risk off of individuals will governments ever be able to release all the data that our society, democracy and economy need.