Both IT gurus and their non-specialist colleagues need the ability to collect, analyze and act on data, but there’s a significant obstacle to that level of access: Most data exist in non-conventional, unstructured formats. In fact, up to 90% of the world’s data is unstructured, found in PDFs, emails, Microsoft Word documents, social media threads, photos, videos, sensor data, audio files and the like. Because each type of unstructured data stores information differently, it doesn’t integrate well with the others, and certainly not with their structured data siblings, such as Microsoft Excel spreadsheets.
When systems expect to work with data in a certain format, a mix of incompatible structures forces employees to perform extra, time-consuming steps — parsing, cleaning, converting or manually extracting data, for example. That creates friction, and manual efforts are more prone to mistakes than automated alternatives.
“Data is often locked in the different formats, with no access control for [people who] protect the data,” explains Rebecca Cai, Hawaii’s Chief Data Officer. “That creates a challenge to share the data and to perform some analysis across departments, because there are a lot of business problems that require cross-departmental data.”
Standardization and automation can help solve the problem, however. For example:
1 — Use a Small Set of Agreed-Upon Formats
Defining a few preferred formats will make data easier to access and analyze. For example, Comma-Separated Values (CSV) files, which use plain-text formatting, are compatible with numerous software applications and are simple to create, edit and transfer. The JavaScript Object Notation (JSON) interchange format is a text-based, human-readable way to move data between web clients and web servers.
2 — Automate Data Transfers
Data tools, such as extract, transform, load (ETL) and extract, load, transform (ELT) options, will automatically convert different inputs into a single data structure. Other solutions include application programming interfaces (APIs), that is, software that enables applications, platforms and services to interact seamlessly.
3 — Create Clear Rules for Organizing Data
These include guidelines for labeling and storing it and rules regarding metadata, which is data about the data (e.g., its author, file size, access and modification dates). Metadata helps agencies track their information. And keeping a data catalog — essentially a library catalog of an organization’s data — gives employees a window into the agency’s assets.
A version of this article appears in our new guide “How to Make Gov Data Accessible and AI-Ready.” Download the guide here for more practical, proven ways to unlock data insights.
