On Monday, December 5, Bob Gourley went on the Enterprise CIO Forum to explain Big Data and why it matters. First, he defined Big Data simply as the data your organization cannot currently analyze. Though some technologists give more precise definitions, this sums up the challenge enterprises now face. If you can deal with all of your data now, you don’t have a Big Data problem, but as soon as you have more data than you can effectively manage to finding the answers you need fast enough to use them, you need a Big Data solution. Structured data and relational databases can also be Big Data but what we’re really talking about is the type and volume of information that exceeds traditional methods. New solutions include MapReduce, originally developed at Google to analyze and index the entire Internet, and Hadoop which grew to use those new methods.
We see Big Data solutions daily through tools such as Twitter and LinkedIn, which analyze massive amounts of information from user accounts and actions to perform searches and generate content in real time. Twitter, for example, looks at millions of tweets fast enough to determine what topics are currently trending, and LinkedIn can analyze your networks and profile to suggest people who you would want to connect with.
Big Data doesn’t have one tool or even a toolset, it has a framework. For example, there is a growing and evolving ecosystem around Hadoop, shepherded by Cloudera, including a variety of capabilities for structured and unstructured data. Hadoop itself allows the use of commodity hardware to efficiently store and process massive amounts of unstructured data. According to Gourley, Hadoop is the essence of the current Big Data phenomenon, though there are other niche solutions out there. At the moment, government, finance, energy, and science are all turning to the Hadoop family for their Big Data solutions. Hadoop, formally known as Apache Hadoop, is an open standard managed by the Apache Foundation, and is combined with software such as HBASE, Hive, and Flume in distributions, such as the popular Cloudera’s Distribution including Apache Hadoop.
Big Data has created a “Cambrian Explosion” of capabilities and uses. For example, by analyzing social media and messages, organizations have distilled member’s “digital characters” to find criminals, rogue traders, and unusual behaviors. Other use cases include detecting cyber attacks and better internal search results and recommendations to clients, such as the federal government’s USASearch.
Initially, major IT companies were cautiously exploring Big Data solutions, but now many have jumped on the Hadoop bandwagon. Microsoft showed great agility when it recently abandoned its proprietary software Dryad to contribute to the open source Hadoop community, and many major companies have their own Hadoop distributions such as SGI, IBM, EMC, and Dell. Users now have the choice to download Cloudera’s distribution and use its management tools to configure and run Hadoop, or purchase an incredibly powerful piece of hardware from SGI that already has Hadoop configured and guaranteed to run. That same firm will sell you training and services while providing patches, making it a good option for an enterprise CIO.
In the next year, expect to see the continuing evolution of management tools for Hadoop clusters. Currently, Cloudera’s configuration manager is the dominant tool and will continue to evolve as well. Dozens of firms are also beginning to provide applications on top of Hadoop, which will allow analysts to interact with Big Data themselves rapidly, without the help of the IT department. And as Hadoop grows more prevalent, now is the time to go out and get training.
- Hadoop is an Open Source Revolution: Federal Computer Week Interview (ctovision.com)
- Microsoft Focuses Big Data Efforts on Hadoop (ctovision.com)
- GSA USASearch Wins 2011 Government Big Data Solutions Award (ctovision.com)