Geo-based Content Processing Using HBase

Another great use case for Hadoop and HBase is explained in this video from the Chicago Data Summit on April 26 by NAVTEQ on how HBase can be used to process geographical content on a massive scale. NAVTEQ is “the leading map provider” for location-enabled devices such as vehicle navigation systems and smartphone maps. They provide the data that powers the navigation systems of companies like Garmin, Nokia, and BMW, which provide much more than just maps, also including realtime traffic information and location data. That requires pulling constantly changing information from over 80,000 sources, a Big Data challenge. While NAVTEQ had a system in place to deal with this mass of data, as chief architect Ravi Veeramachaneni explains, they faced some major problems. Their system was inefficient and expensive to scale because of the overhead from the costs of Oracle licenses. Their content needed to be available in real time, which meant updating and delivering much faster than before while supporting customers with both connected and disconnected devices. They also needed to decouple content from maps and make it more flexible, able to quickly add new content providers and support both structured and unstructured data. Their content is large and complex, with hundreds of millions of content records and hundreds of providers as well as community input, an average of 120 attributes per record but as many as 400, and over 270 classifications of content. This content is sparse, unstructured, provided in multiple formats, and constantly growing.

To solve these Big Data problems, NAVTEQ turned to Hadoop and HBase. The system they designed would be able to handle spikes in content by scaling out efficiently, had flexible business rules to add new providers on the fly, and maintained high quality of content by corroborating multiple sources to verify information. NAVTEQ resolved to use open source whenever possible, but to provide piece of mind, they turned to Cloudera for commercial support for the open source Hadoop ecosystem.HBase was a natural choice as it scales well by running on top of Hadoop, it stores null values such as the hundreds of attributes that may not apply to a given record without taking up disc space space or disk input/output time, and it supports unstructured data. HBase has a built-in version manager, which is helpful for seeing what content has been changed and is current. Hbase also has a well developed community around it, including Cloudera, which NAVTEQ signed to support their project after a few rocky months of trying to manage their clusters on their own.

Yet even with all of HBase’s useful features and Cloudera’s help, managing and improving NAVTEQ’s Big Data has been a challenge, filled with successes, mistakes, and lessons learned. Veeramachaneni devotes a good part of the talk to listing problems and stumbling blocks he has encountered and how to overcome them, such as not running big HBase projects on virtual machines and how some stability issues were solved by newer versions of HBase and Cloudera’s Distribution Including Apache Hadoop, now on version CDH3. For that portion,Veeramachaneni’s lecture is valuable for any enterprise looking to jump into Hadoop and HBase for their Big Data so that they can learn from NAVTEQ’s experience and save themselves some headaches along the way.