, ,

Is Big Data the Same as Dirty Data?

Conversations on Big Data is a series of discussions about using analytics in creative and interesting ways that the Partnership for Public Service and the IBM Center for The Business of Government designed to broaden the perspective about quantitative analytics. Wikipedia defines Big Data as “…an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using traditional data processing applications”[1]. Other sources support that definition: “Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.”[2] Big data “size is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data”.[3] Those who have been working with analytics for many years may see an application of Moore’s Law in the evolution of applications that can now generate data on this scale. Yet, those very same analytics users have been dealing with the challenges of defining and implementing data governance standards to ensure, among other things, a high level of data quality and integrity. How do we ensure the veracity needed for credible analytics while at the same time leveraging Big Data, which by definition is a massive amount of raw data?

According to the officials who have participated in our Conversations, dirty data is better than no data. Dean Silverman of the Office of Compliance Analytics at the IRS says that “there’s an endless amount of data and if you don’t know where you’re going, any road will take you there.” In other words, simply taking the time to decipher and understand the structure of your data can provide insight and lead to deeper, more structured analytics. However, if you need to address a specific issue or problem, Dean says that, too, “starts with the data.” The data governance best practices referred to above that were developed to maintain the quality of structured data can be extended to unstructured data, such as PDF documents and image formats. Hadoop, an open source framework developed to support Big Data, is used to capture unstructured data and “mine” for patterns that can be used to build a data “map.” That map can provide the basis to integrate the data into an enterprise model and potentially determine correlations and relationships. Over time, Big Data tools store the history of those patterns and can continuously apply them as that huge volume of variable data in a variety of formats continues to stream into the organization at a high velocity. Variations in those patterns can be detected as the data stream is in motion, offering potential opportunities to influence behavior in real-time. The patterns detected within the raw data itself provide the standards for evaluation.

The insight required to apply analytics to Big Data is acquired through processes similar to more traditional business intelligence methods. First you have to acquire the data and store it, so that you can analyze it and propose a working model of its structure. Once the data is acquired and modeled, it can be managed according to data governance best practices. That working model will include history regarding patterns in data that can be applied in real-time to those data sources and streams. As variations occur the working model is refined until the user community is confident that they fully understand that model. Then the data can be transformed and integrated into enterprise repositories. The working model of a data stream can also be used to identify real-time opportunities to influence behavior by initiating predefined actions triggered by occurrences of specific variations in the patterns.

Visualization tools enable access to Big Data. They enable analysts to take note of the variations and refine their data and logic models accordingly. This is an extension of the same principles that financial services companies have been using for over a decade to promote greater use of credit cards by finding patterns in customers’ spending habits that are used to drive marketing campaigns. Changes in the patterns indicate opportunities; to understand the opportunities and communicate the benefits they can provide, you need to first understand the data and the patterns.

In conclusion, for all its Volume, Variability, Variety, and Velocity, we discovered that leveraging Big Data requires good, old-fashioned data management best practices. Before you can effectively use data as a resource, you must understand it. Data Profiling is not a new concept, but it is critical when it comes to Big Data, because it cannot be done manually. Automated data profiling feeds information to Big Data analysts that improve their maps and working models of the data. Other Big Data tools continuously scrutinize incoming data streams, refining the models and furthering understanding of the content. As understanding grows, data management standards for Big Data can be developed and applied to ensure that Big Data is intelligently integrated into traditional data repositories at an enterprise level while simultaneously leveraging its power in real time.

Check out past Conversations on Big Data for more tips and insights on using big data to improve your organization’s mission effectiveness.

[1] Wikipedia

[2] Snijders, C., Matzat, U., & Reips, U.-D. (2012). ‘Big Data’: Big gaps of knowledge in the field of Internet. International Journal of Internet Science, 7, 1-5. http://www.ijis.net/ijis7_1/ijis7_1_editorial.html

[3] Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, Samee Ullah Khan, The rise of “big data” on cloud computing: Review and open research issues, Information Systems, Volume 47, January 2015, Pages 98-115, ISSN 0306-4379, http://dx.doi.org/10.1016/j.is.2014.07.006

Leave a Comment

Leave a comment

Leave a Reply