Common Hadoopable Problems

If you’re reading this, you probably already know about Apache‘s Hadoop, a popular data storage and analysis platform. Hadoop can inexpensively store any type of information from any source on commodity hardware and allow for fast, distributed analysis run in parallel on multiple servers in a Hadoop Cluster. It’s powerful, agile, scalable, and, due to replication, resilient to hardware and systems failures.

But when should you use Hadoop? The white paper “Ten Common Hadoopable Problems: Real-World Hadoop Use Cases” by Cloudera, the leading provider of Hadoop-based software and services, explains that Hadoop is well suited for large quantities of complex data that requires new analytics.

Some examples of what Hadoop can do are:

  • Find love: To create matches, a dating sites needed to run descriptions, survey results, demographics, web activity, and past successes and failures through complex scoring and matching algorithms for all of its thousands of clients. The data it had to analyze, often unstructured, came from a variety of sources and, as new subscribers joined the service, it had to be able to scale to accommodate the rapidly expanding number of possible pairings. The company switched from its original, custom-made system, which could not keep up, to Hadoop, allowing for more data and more complex, evolving compatibility models. Hadoop offers similar advantages to other recommendation engines, such as those that match web adds to site visitors or products to shoppers.
  • Make the smart grid smarter: A large power company used Hadoop to store and analyze sensor data from its smart grid and individual generators to monitor network performance and help prevent power outages. This requires examining a massive amount of data in real time and storing it for forensic analysis after an outage to determine what went wrong. Not only was the company able to stream this data off all of its sensors and perform continuous analysis in a Hadoop cluster, but it was able to do so cheaply so that it could afford to keep long-term historical data. While most Hadoop users don’t need to worry about generators and power grids, similar systems can be used to maintain data centers with hundreds or thousands of servers.
  • Fight crime: Hadoop has been used extensively to detect fraud and abuse online. An anti-virus company uses Hadoop to store its large library of malware signatures then use it to compare, detect, and identify new or emerging threats. A major email provider has implemented a Hadoop cluster to weed out spam and identify spammers by analyzing all emails in real time. Many online vendors use Hadoop to store web logs and track activity, IP addresses, and user locations to detect fraud. Hadoop works here because it can cheaply store the massive amounts of complex data generated through online activity and analyze it fast enough to spot abuse.

Original post

Leave a Comment

Leave a comment

Leave a Reply