Tools to Clean Your Data

GovLoop’s online training addressed the concept of “bad data” with a panel of experts:

Bobby Caudill, Global Industry Marketing, U.S. Public Sector, Informatica
Murthy Mathiprakasam, Principal Product Marketing Manager, Informatica
Lori Walsh, Chief, Center for Risk and Quantitative Analysis, U.S. Security and Exchange Commission

Titled “Beware Bad Data”, this online training highlighted how modern technology and tools are able to help government agencies transform their bad data into the knowledge essential to informing policy.

Throughout the training the panel continuously talked about the different tools that are utilized to turn bad data good. The tools vary significantly by what the data problem is in terms of if you need to fix it or need to work around it.

Here is a deeper dive into those tools based on the phase of the data analysis process:

Phase 1: Ingestion

Informatica offers a full range of tools and services to expedite the amount of time it takes to bring in data. For example, Vibe Data Stream is a tool applicable for different web services such as Amazon Kinesis and Machine Data. The tool is “an industrial strength, automated data ingestion engine that connects real-time data from a wide variety of sources for processing.”

Phase 2: Storage/Processing

Walsh and her team utilize Software as a Service (SaaS) to fix data that might have inconsistent formatting. SaaS, a scalable and flexible tool, runs in the cloud to save time and money.

Additionally, Walsh uses basic data improvement techniques to create comparable data. For example, if you have a large list of cities some might be written as “New York City” while others are written as “New York, NY”. These fields must be standardized before any patterns can be identified.

Informatica also offers different types of transformation techniques to develop alternate data visuals for each unique need.

Phase 3: Moving/Securing Data

Mitigating the risks of moving data requires the careful use of tools so that all information is protected. Informatica’s data migration methodology is useful in reducing risks while maintaining the overall integrity of the data.

Core products discussed were Data Quality, a tool that delivers data quality in a unified platform to all key stakeholders. Another, Data Masking, ensures that sensitive data is masked from people that do not have the authority to view it.

Phase 4: Analysis

Walsh utilizes macro building capabilities within excel spreadsheets to create repeatable processes across data sets. When you receive a data set and don’t know necessarily where it came from, you can automate the cleaning process. Therefore, the process becomes more efficient and results are more consistent across the board.

To reduce the impact of not so good data on their results, Walsh also uses more visualization and text based tools that do a deeper, more robust search to avoid the necessity of intensely cleaning data. Systems such as Palantir and i2 let them do a more robust search that doesn’t rely on having pristine high quality data.

To learn more about how you can turn bad data good, listen to the whole online training on-demand! Also, check out the complete training recap!


Leave a Comment

Leave a comment

Leave a Reply