Hadoop for Bioinformatics

Bioinformatics is the application of computer science in the form of statistics and analytics to molecular biology. This exciting field is bringing about great breakthroughs, especially in genetics, where computers and algorithms are being used to map genomes. Advances in this field show promise in helping us understand life and advance science. The field is producing knowledge of direct relevance to our future. Researchers are learning how to fight disease, how to tailor cancer treatments for individuals and many other health related solutions. Bioinformatics research is also being used in fields as widespread as energy research (for example, in how to produce fuels from algae) to food production (why are some seeds more resilient than others?). Advances in bioinformatics, however, bring great challenges in processing, storing and analyzing data. This is a Big Data challenge worth of study by computer scientists of all disciplines.

The human DNA sequence is 3.5 million molecules long and there are over 58,000 proteins on record, making bioinformatics very computationally expensive. Researchers are generating exponentially more data as techniques and equipment improve, which they then need to turn into usable information and filter before scientists can do their work. Usually, this process causes delays. DNA sequencing labs can produce over 100 terabytes of data a week, straining the space and processing power of the sequencing community. Simply throwing more computing power at the problem does not work, as algorithms not designed for such masses of data do not scale well.

That’s where Hadoop comes in. For example, when looking for a match for a certain protein for processes like docking, fitting one protein into another, MapReduce can distribute the 58,000 possible proteins across a cluster in the cloud, then researchers can insert the query protein and a regular matching algorithm to get the best results back faster by splitting the job so that is scales better. Any algorithm that can fit in a single machine can be used with MapReduce, and the results can be found orders of magnitude faster. Hadoop began to be used in bioinformatics in May 2009 with the introduction of Cloudburst. Cloudburst is a new parallel read-mapping algorithm optimized for mapping next-generation sequence data to the human genome and other reference genomes. It is an algorithm developed for Hadoop that aligns short “reads” of DNA so that they can be compared, a difficult task due to insertions and other variation in the generic sequence. As in the example, Cloudburst scales better and is implemented in clusters rented cheaply through cloud computing.

Since the introduction of Cloudburst, Hadoop has taken off in the bioinformatics community. Crossbow, a software pipeline for whole genome resequencing analysis, uses Hadoop to compress over 1000 hours of computations into only a few hours. Researchers at Indiana University have compared Hadoop and MapReduce favorably with other solutions across several bioinformatics applications, predicting that its influence in the field will continue to grow.

Hadoop deployments requiring enterprise-grade availability and interoperability with other systems leverage Cloudera’s Distribution of Hadoop (CDH), which is a simplified, open system made up of the most useful components of the Hadoop ecosystem and is available for free download. For example, the Department of Energy’s Kandinsky, a 68 node cluster at Oak Ridge National Laboratory, runs CDH for an exploratory environment to develop a large biological knowledgebase. As Cloudera provides support, training, and management apps for Hadoop, their role in bioinformatics will grow as Hadoop becomes even more prevalent and more complex projects are undertaken.

For the bioinformatics community, Hadoop is cost-effective, flexible, and relatively easy to use. In the end, however, the real advantage is in allowing more innovation. As projects become cheaper and faster, more hypotheses can be tested and algorithms developed, lowering the cost of experimentation and the constraints on researchers. Through the introduction of Hadoop, discoveries can be made that revolutionize our understanding of biology, health, and the natural world.