The Future of Hadoop in Bioinformatics

Earlier, I wrote on the use of Hadoop in the exciting, evolving field of Bioinformatics. I have since had the pleasure of speaking with Dr. Ron Taylor of Pacific Northwest National Library, the author of “An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics“, on what’s changed in the half-year since its publication and what’s to come.

As Dr. Taylor expected, Hadoop and it’s “ecosystem” including MapReduce are the dominant open source Big Data solution for next generation DNA sequencing analysis. This is currently the sub-field generating the most data and requiring the most computationally expensive analysis. For example, de novo assembly pieces together tens of millions of short reads (which may be 50 bases long on ABI SOLiD sequencers). To do so, every read needs to be compared to the others, which scales in proportion to n(logn), meaning, even assuming reads that are 100 base pairs in length and a human genome of 3 billion pairs, analyzing an entire human genome will take 7.5 times longer than if it scaled linearly. By dividing the task up into a Hadoop cluster, the analysis will be faster and, unlike other high performance computing alternatives, it can run on regular commodity servers that are much cheaper than custom supercomputers. This, combined with the savings from using open source software, ease of use due to seamless scaling, and the strength of the Hadoop community make Hadoop and related software the parallelization solution of choice in next generation sequencing.In other areas, however, traditional HPC is still more common and Hadoop has not yet caught on. Dr. Taylor believes that in the next year to 18 months, this will change due to the following trends:

The growth of Hadoop, software, and services :

The initial delay in the adoption of Hadoop for Big Data was mostly due to a lack of information and inertia within the community. Those researchers who knew about Hadoop still saw it as untested and were concerned about stability issues. As Hadoop gains exposure and grows more stable through patches and new releases, researchers will become more comfortable using it. Also, new Hadoop-related software will extend its applicability to other areas of bioinformatics. Dr. Taylor gave the example of Mahout, a Hadoop machine learning library, that can be used for classification (the automatic labeling of data) and clustering (forming groups of similar data within a larger set), both useful in bioinformatics. The Hadoop and MapReduce paradigm is also being explored for automated reasoning and rule engines, which have tremendous potential. IBM’s Watson on Jeopardy! has already used Hadoop to pre-process large unstructured datasets for automated reasoning.

The community around Hadoop is also developing, increasing researcher confidence. Already, helpful users, the wealth of related software, and growing availability of support make Hadoop the open-source solution of choice, and new related projects are on the way. Services for larger and more complex deployments ia also growing, with Cloudera as the leading provider. Dr. Taylor expects that as projects upgrade their clusters and new clusters come online, more and more will be running Cloudera’s Distribution including Apache Hadoop (CDH), which is free to download, open source, and simplified.

The evolution of bioinformatics:

Currently, Hadoop is used mostly in next generation sequencing because that’s where most of the Big Data is generated. As techniques advance, however, other fields are performing complex analytics on ever-expanding data sets, requiring innovative data solutions. New work on subjects like clustering, classification, and microarrays, which represent a tremendous amount of biological information in 2 dimensional arrays, is creating a need for parallelized analysis. High-throughput expression data for genes, proteins, and metabolites is also used for topological network analysis, the inference of biological network not yet mapped out, and can benefit from Big Data analysis. As Hadoop and software in its ecosystem like Mahout and HBase develop these capabilities and researchers develop tests, algorithms, and applications, scientists in bioinformatics will turn more and more to Hadoop to solve new problems. Dr. Taylor predicts an explosion of papers in the next year on applying Hadoop to bioinformatics in novel ways, which will both further the spread of Hadoop and advance the field of bioinformatics.

New projects are also developing that will require Hadoop, such as the Department of Energy’s knowledgebase. The DoE is working to build a predictive understanding of biological systems behavior by using microbial and plant genetic data, high-throughput analysis, modeling, and simulation, with the goal of solving energy and environmental problems. To do so, they are constructing a knowledgebase, a clustered cyberinfrastructure containing data, organizational methods, standards, analysis tools, and interfaces. The DoE knowledgebase will employ Hadoop and cloud computing to provide the bioinformatics community a freely available computational environment. Use and discovery within this space will both continue to advance bioinformatics and encourage the use of Hadoop.