Perhaps The Most Important Big Data Presentation From Hadoop World

By BobGourley

“Somewhere, something incredible is waiting to be known” said Carl Sagan. This inspiring quote is a tremendous mantra for those in the Big Data movement and it played a central organizing theme to Mike Olson‘s keynote at the 2012 Hadoop World/Strata Conference in NYC. Mike highlighted this quote in his motivational keynote because of its perpetual truth. No matter what we learn we are always on the edge of new knowledge. There is always something else. Which makes this a great time to be in this industry. We are poised on the edge of great discoveries for industry and society.

Mike then hit on some examples of how Big Data analysis is transforming society and industry today:

– The Large Hadron Collider at CERN, which generates 27 terabytes of useful data per day, is giving tremendous insights into how the universe works, including the recent discovery of the Higgs Boson, the particle which gives mass to matter. Researchers globally have access to this data for their own research. For example, the Unviversity of Nebraska Lincoln has captured all this data in a Hadoop Cluster and make this available to researchers globally. Discoveries will continue.

– Healthcare is full of incredible Big Data use cases and success stories. Many of those come from activities around the Human Genome. New research has indicated that our genetic makeup is not the only biological determinate that makes us human. Human’s have about 1 trillon cells all including the same genome and this is incredibly important to continue to research but new research shows over 10 trillion other cells (mostly bacteria) ride on the human and have impacts on how our body function and even which genes are triggered. This world of epigenetics forces us to ask continually bigger questions in our search for more knowledge, and it is helping us better understand the onset and progress of disease and new treatments. This includes breakthrough research by the National Cancer Institute, for example. All the breakthroughs in these fields are connected to Hadoop-based projects.

– New predictive models on social matters and policy issues are also making a difference in the world including analytical models and model enabled analysis. Much of this is based on the Hadoop family of capabilities.

– What if how we generate and consume electricity could be better measured and action taken to understand and act upon those measurements?

The above are a few of the many use cases in place now using these capabilities.

Far more is being done with the current big data capabilities, and new more powerful capabilities are also coming, fast. Community-focused companies like Cloudera and the open source community itself are moving towards more simplified, unified, efficient Big Data platforms with a vision towards a single repository and comprehensive analytics that can explore it. And capabilities are now being announced to enhance the ability to work with speed over big data.

The Hadoop framework includes many tools centered around Hadoop, but at its core is an ability to run mapreduce jobs. Although info from those mapreduce jobs can be put in HDFS and queried in real time, it actually takes time to produce the results. Current mapreduce jobs are not real time by any means. It is hard to ask and rapidly iterate new questions over Big Data. In the opinion of many, it is just too slow and in need of a serious, significant improvement.

Which leads to Mike Olson and Cloudera’s big announcement in this video, a capability known as Impala. Cloudera Impala is designed from the ground up with a powerful goal in mind: enabling you to work over big data at the speed of thought. Cloudera has been working on this for 2 years. It is a 100% open source real time query engine that lets you work in real time query engine fully integrated with Hadoop. It allows you to take advantage of map reduce and ask speed of thought queries of your data. This is a new way to get at your data. And the data doesn’t move. It runs on your data in your Hadoop cluster.

Implala has been in a private beta with several customers. One, for example, is a Fortune 500 company specializing in agriculture and genomics which needed this capability to better automate their data-driven R&D decisions to reduce time to market from years to months. Result: better ways to feed humanity, now and in the future. This is seriously good. Cloudera, Hadoop and Impala have knocked down silos in these crucial area.

Mike ends his note by understanding that we need to keep asking big (and bigger) questions.

The rest of the conference included many detailed presentations on ways the community is asking and answering these big questions. We will report on more of them here.

[Watch this at if embed does not show: YouTube]