There are many great use cases for Apache Hadoop, the open source framework for scalable, reliable, and distributed computing on commodity hardware built around Hadoop Distributed File System and MapReduce, such as delivering search engine results, sequencing genomes, and indexing entire libraries of text, but the Million Monkeys Project by Jesse Anderson may be the easiest to understand and the most fun.
The project was inspired by the Infinite Monkey Theorem which, in the simplest and most popular terms, states that a million monkeys with a million typewriters will, by randomly hitting the keys, eventually recreate the works of Shakespeare. The idea is that, though at any given instance the chance of a monkey typing a sonnet is essentially zero, with infinite instances it becomes almost certain. Anderson wanted to try this for himself but he didn’t have a million monkeys, a million typewriters, and infinite time and resources, so instead he used his home computer, Amazon’s Elastic Compute Cloud, and Hadoop to achieve the same results.
Anderson first generated a million virtual monkeys on Amazon’s EC2, which were really pseudo random number generators that would provide strings of 9 random characters. Anderson had to find a very efficient and reliable pseudo random number generator because at that scale, creating the strings was one of the most computationally expensive steps in the process, and he eventually settled on Sean Luke’s Mersenne Twister. Next, he compared the generated string to the entirety of Shakespeare’s work and, if he found the string anywhere, he would mark it in almost real time, creating what he calls “performance art with monkeys and computers.” Comparing a 9 character string with every continuous set of 9 letters in all of William Shakespeare’s 38 works is no small task, and Anderson used a Bloom Filter to reduce CPU usage by 20-30%. A Bloom Filter works by creating a hash of the monkey’s string and comparing it to a file with all of the hashes and offsets of Shakespeare. Since hashes are shorter and simpler than the strings, this goes much faster but, because more than one string can result in a given hash, just because the hashes match doesn’t mean the strings will. If a match is found, the strings are then compared character by character.
The project took 1.5 months, generated 7.5 trillion character groups, and checked them against 5.5 trillion (5,429,503,678,976) possible combinations. The project was concluded on October 6 when the last work, The Taming Of The Shrew, was completed. Normally, such a massive task would be out of the reach of one man without a team of computer scientists and supercomputers, but because Hadoop was able to break the overwhelming job into little segments running in parallel on servers in Amazon’s cloud, Jesse Anderson managed to do it himself on commodity hardware. Though the Million Monkeys Project was mostly for fun, it shares many similarities to other serious use cases for Hadoop. DNA sequencing, for example, involves matching short reads of a few dozen pairs to a full genome of millions or billions. of pairs Just like with the monkeys, the job gets much more manageable when broken down into smaller segments and, since commodity hardware and open source software spare research budgets, Hadoop has become a dominant tool in the sequencing community.
- What Is Hadoop? Here is a 101 with Mike Olson (ctovision.com)
- Quickstart Guide: Stand up your cloud-based servers with Amazon Web Services EC2 (ctovision.com)
- Hadoop for Bioinformatics (ctovision.com)