Step 1: The quickest and easiest way to make data available on the Internet is to publish the data in its raw form (e.g., an XML file of polling data from past elections). However, the data should be well-structured. Structure allows others to successfully make automated use of the data. Well-known formats or structures include XML, RDF and CSV. Formats that only allow the data to be seen, rather than extracted (for example, pictures of the data), are not useful and should be avoided.
Step 2: Create an online catalog of the raw data (complete with documentation) so people can discover what has been posted.
These raw datasets should be reliably structured and documented, otherwise their usefulness is negligible. Most governments already have mechanisms in place to create and store data (e.g., Excel, Word, and other software-specific file formats).
Posting raw data, with an online catalog, is a great starting point, and reflects the next-step evolution of the Internet – “website as fileserver”.
Step 3: Make the data both human- and machine-readable:
- enrich your existing (X)HTML resources with semantics, metadata, and identifiers;
- encode the data using open and industry standards – especially XML – or create your own standards based on your vocabulary;
- make your data human-readable by either converting to (X)HTML, or by using real-time transformations through CSS or XSLT. Remember to follow accessibility requirements;
- use permanent patterned and/or discoverable “Cool URIs“;
- allow for electronic citations in the form of standardized (anchor/id links or XLINKs/XPointers) hyperlinks.
These steps will help the public to easily find, use, cite and understand the data. The data catalog should explain any rules or regulations that must be followed in the use of the dataset. Also, the data catalog itself is considered “data” and should be published as structured data, so that third parties can extract data about the datasets. Thoroughly document the parts of the web page, using valid XHTML, and choose easily patterned and discoverable URLs for the pages. Also syndicate the data for the catalog (using formats such as RSS) to quickly and easily advertise new datasets upon publication.
Note: this is a cross-post from my blog,
The following is a letter I sent to member of the Colorado General Assembly regarding HB10-1036, which calls for school districts to publish financial data on-line. A good thing, but how the data is published in important. This issue is of importance in light of a recent Denver Post article on School District spending. The bill will be up for hearing on Thursday, March 11, 2010. The hearing start at 1:30 pm in Senate Committee Room 354 and the bill is the second item on the Senate Education committee’s calendar. You may listen to the hearing here.
My issue is with the fiscal analysis and the assumption that school will (or should?) publish the district financial data is PDF. I am not trying to pick on PDF, but rather with the PDF creation process. (As with many things, the problem is a user issue not a technology issue). Please read my letter and if you contact the committee members (list here) and the Senate sponsor, Senator Chris Romer.
– – – – – – – –
Dear Senator or Representative;
I am writing you today in support of HB10-1036, the Public School Financial Transparency Act. However, I would encourage the Colorado General Assembly to more explicitly recommend the use of open standards based technologies when publishing government data. This bill is a good step in furthering financial transparency and in increasing public accessibility to financial data. As the bill declares, all Coloradans have an interest in knowing how moneys are being expended in the pursuit of quality public education. A critical issue to public accessibility to financial data is how the data is published on the Web. The fiscal note to HB10-1036 states that “it is assumed that financial documents can be electronically converted to portable document format (PDF) . . . and posted to online at minimal cost.” This statement is correct in both respects. There is no doubt that providing PDF versions of these documents on-line would be a step in the right direction and would give citizens access to information that would not be easily accessible today. Furthermore, PDF can be one of the most flexible human-readable electronic formats invented and can provide one of the richest possible electronic formats ever devised in terms of capabilities. However, in many cases the process of creating PDFs limits the usefulness of the data contained in the PDF. Therefore, publication of PDFs to the exclusion of other formats limits the value of government data.
I support this bills intended goal of giving citizens access to public information that would be otherwise relatively inaccessible. Open government data and transparency is more than accessibility. In fact the W3C e-Government Interest Group’s (e-Gov IG) draft document on “Publishing Open Government Data” states that “sharing government data enables greater transparency; delivers more efficient public services; and encourages greater public and commercial use and re-use of government information.” What PDF provides in accessibility it can lack in usability and re-usability. That is why PDF only or strong reliance on PDF versions of government data should be augmented. I do not wish to belabor the pitfalls of the publication of open government data in PDF. Instead, I want to share what steps can be taken to provide complete openness and transparency to government information.
The W3C, the Sunlight Foundation, and other open government advocates recommed that government’s should use open standards based technologies when publishing data. Furthermore, in some cases the data or information that is converted to PDF is already in an open format, such as XML. The W3C e-Gov IG’s Publishing Open Government Data document makes the following initial recommendations for publishing government data:
The ultimate goal is to make any data published by government both human and machine readable. Machine readability is import because it allows interested parties to more easily parse the data. Furthermore, machine readability is import because it helps to create opportunities for citizens and organizations to develop new and creative tools to give the data even greater value.
The use of PDF in government and in the private sector is persistent. Therefore, it is highly advisable that when a PDF is created that steps must be taken to include metadata formats, file attachments, and other features that will add value to the document and allow the data in the PDF to be more machine readable. If PDF is going to be the dominate form of publication, then the creation process should aim to create greater interoperability to forward the goal of usability and re-usability.
Again, I support this legislation’s goal of creating transparency and openness in public school finances. However, I would strongly encourage the Colorado General Assembly to more explicitly recommend the use of open standards based technologies when publishing any government data.