Context on hBase

HBase is an open source data base with some very special features you should take note of. This post provides some context of hBase. Most of this information was gleaned from the great reference material at Cloudera.com, which has become the go-to site for learning about Big Data approaches, especially capabilities related to Apache Hadoop.

What is HBase?

  • It is a database that is the open source implementation of Google’s “BigTable” database.
  • It is a “sparse, distributed, persistent multidimensional sorted map”
  • It is indexed by a row key, column key and a timestamp.
  • Users store data in roses in labeled tables. A row has a sortable key and some number of columns. The table is stored sparely, so roes in the same table can have widely varying columns, if the user desires.
  • The key/value pairs in Hbase are kept in alphabetical order.

What does it do?

  • HBase lets you store and retrieve data. Like other new data systems it is designed to do that at scale.
  • It is an open source project that is part of the Apache Software Foundation’s Hadoop project.
  • It is designed to work very well with other Hadoop capabilities, like the Hadoop Distributed Filesystem (HDFS). This means it has the same great reliability/cost benefits/scaling capabilities as HDFS.
  • It is a Columnar Database. This is much more efficient for many new types of data retrieval. Columnar databases assign a number to each row of data but does searches by columns. So if you have data that has lots of rows (some databases can have huge numbers of rows), it can really speed things up.

Does it scale?

  • Facebook uses it for their messaging platform. So those dynamic messages you see when you login, those are all being driven by hBase for you and almost a billion people. So yes this scales.

Key Features:

  • Linear and modular scalability.
  • Strictly consistent reads and writes.
  • Automatic and configurable sharding of tables
  • Automatic failover support between RegionServers.
  • Convenient base classes for backing Hadoop MapReduce jobs with HBase tables.
  • Easy to use Java API for client access.
  • Block cache and Bloom Filters for real-time queries.
  • Query predicate push down via server side Filters
  • Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
  • Extensible jruby-based (JIRB) shell
  • Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

When will you want to use HBase?

  • It is not a silver bullet. It is great for many things, but not everything. It is NOT optimized for transactional applications or relational analytics. It is also not a substitute for HDFS when doing large batch MapReduce.
  • Use it if your application has a variable schema where each row is different.
  • Consider it if your data is stored in collections.
  • Use it if you have use cases similar to those in the list below.

Some use cases, from http://wiki.apache.org/hadoop/Hbase/PoweredBy

  • Adobe – We currently have about 30 nodes running HDFS, Hadoop and HBase in clusters ranging from 5 to 14 nodes on both production and development. We plan a deployment on an 80 nodes cluster. We are using HBase in several areas from social services to structured data and processing for internal use. We constantly write data to HBase and run mapreduce jobs to process then store it back to HBase or external systems. Our production cluster has been running since Oct 2008.
  • Caree.rs – Accelerated hiring platform for HiTech companies. We use HBase and Hadoop for all aspects of our backend – job and company data storage, analytics processing, machine learning algorithms for our hire recommendation engine. Our live production site is directly served from HBase. We use cascading for running offline data processing jobs.
  • Explorys uses an HBase cluster containing over a billion anonymized clinical records, to enable subscribers to search and analyze patient populations, treatment protocols, and clinical outcomes.
  • Facebook uses HBase to power their Messages infrastructure.
  • Filmweb is a film web portal with a large dataset of films, persons and movie-related entities. We have just started a small cluster of 3 HBase nodes to handle our web cache persistency layer. We plan to increase the cluster size, and also to start migrating some of the data from our databases which have some demanding scalability requirements.
  • Flurry provides mobile application analytics. We use HBase and Hadoop for all of our analytics processing, and serve all of our live requests directly out of HBase on our 50 node production cluster with tens of billions of rows over several tables.
  • GumGum is an In-Image Advertising Platform. We use HBase on an 15-node Amazon EC2 High-CPU Extra Large (c1.xlarge) cluster for both real-time data and analytics. Our production cluster has been running since June 2010.
  • Infolinks – Infolinks is an In-Text ad provider. We use HBase to process advertisement selection and user events for our In-Text ad network. The reports generated from HBase are used as feedback for our production system to optimize ad selection.
  • Kalooga is a discovery service for image galleries. We use Hadoop, HBase and Pig on a 20-node cluster for our crawling, analysis and events processing.
  • Lily is an open source content repository, backed by HBase and SOLR from Outerthought – scalable content applications.
  • Mahalo, “…the world’s first human-powered search engine”. All the markup that powers the wiki is stored in HBase. It’s been in use for a few months now. MediaWiki – the same software that power Wikipedia – has version/revision control. Mahalo’s in-house editors produce a lot of revisions per day, which was not working well in a RDBMS. An hbase-based solution for this was built and tested, and the data migrated out of MySQL and into HBase. Right now it’s at something like 6 million items in HBase. The upload tool runs every hour from a shell script to back up that data, and on 6 nodes takes about 5-10 minutes to run – and does not slow down production at all.
  • Meetup is on a mission to help the world’s people self-organize into local groups. We use Hadoop and HBase to power a site-wide, real-time activity feed system for all of our members and groups. Group activity is written directly to HBase, and indexed per member, with the member’s custom feed served directly from HBase for incoming requests. We’re running HBase 0.20.0 on a 11 node cluster.
  • Mendeley We are creating a platform for researchers to collaborate and share their research online. HBase is helping us to create the world’s largest research paper collection and is being used to store all our raw imported data. We use a lot of map reduce jobs to process these papers into pages displayed on the site. We also use HBase with Pig to do analytics and produce the article statistics shown on the web site. You can find out more about how we use HBase in these slides [http://www.slideshare.net/danharvey/hbase-at-mendeley].
  • Ning uses HBase to store and serve the results of processing user events and log files, which allows us to provide near-real time analytics and reporting. We use a small cluster of commodity machines with 4 cores and 16GB of RAM per machine to handle all our analytics and reporting needs.
  • OpenLogic stores all the world’s Open Source packages, versions, files, and lines of code in HBase for both near-real-time access and analytical purposes. The production cluster has well over 100TB of disk spread across nodes with 32GB+ RAM and dual-quad or dual-hex core CPU’s.
  • Openplaces is a search engine for travel that uses HBase to store terabytes of web pages and travel-related entity records (countries, cities, hotels, etc.). We have dozens of MapReduce jobs that crunch data on a daily basis. We use a 20-node cluster for development, a 40-node cluster for offline production processing and an EC2 cluster for the live web site.
  • Powerset (a Microsoft company) uses HBase to store raw documents. We have a ~110 node hadoop cluster running DFS, mapreduce, and hbase. In our wikipedia hbase table, we have one row for each wikipedia page (~2.5M pages and climbing). We use this as input to our indexing jobs, which are run in hadoop mapreduce. Uploading the entire wikipedia dump to our cluster takes a couple hours. Scanning the table inside mapreduce is very fast — the latency is in the noise compared to everything else we do.
  • ReadPath uses HBase to store several hundred million RSS items and dictionary for its RSS newsreader. Readpath is currently running on an 8 node cluster.
  • resu.me – Career network for the net generation. We use HBase and Hadoop for all aspects of our backend – user and resume data storage, analytics processing, machine learning algorithms for our job recommendation engine. Our live production site is directly served from HBase. We use cascading for running offline data processing jobs.
  • Runa Inc. offers a SaaS that enables online merchants to offer dynamic per-consumer, per-product promotions embedded in their website. To implement this we collect the click streams of all their visitors to determine along with the rules of the merchant what promotion to offer the visitor at different points of their browsing the Merchant website. So we have lots of data and have to do lots of off-line and real-time analytics. HBase is the core for us. We also use Clojure and our own open sourced distributed processing framework, Swarmiji. The HBase Community has been key to our forward movement with HBase. We’re looking for experienced developers to join us to help make things go even faster!
  • Sematext runs Search Analytics, a service that uses HBase to store search activity and MapReduce to produce reports showing user search behaviour and experience.
  • Sematext runs Scalable Performance Monitoring (SPM), a service that uses HBase to store performance data over time, crunch it with the help of MapReduce, and display it in a visually rich browser-based UI. Interestingly, SPM features SPM for HBase, which is specifically designed to monitor all HBase performance metrics.
  • SocialMedia uses HBase to store and process user events which allows us to provide near-realtime user metrics and reporting. HBase forms the heart of our Advertising Network data storage and management system. We use HBase as a data source and sink for both realtime request cycle queries and as a backend for mapreduce analysis.
  • Streamy is a recently launched realtime social news site. We use HBase for all of our data storage, query, and analysis needs, replacing an existing SQL-based system. This includes hundreds of millions of documents, sparse matrices, logs, and everything else once done in the relational system. We perform significant in-memory caching of query results similar to a traditional Memcached/SQL setup as well as other external components to perform joining and sorting. We also run thousands of daily MapReduce jobs using HBase tables for log analysis, attention data processing, and feed crawling. HBase has helped us scale and distribute in ways we could not otherwise, and the community has provided consistent and invaluable assistance.
  • Stumbleupon and Su.pr use HBase as a real time data storage and analytics platform. Serving directly out of HBase, various site features and statistics are kept up to date in a real time fashion. We also use HBase a map-reduce data source to overcome traditional query speed limits in MySQL.
  • SubRecord Project is an Open Source project that is using HBase as a repository of records (persisted map-like data) for the aspects it provides like logging, tracing or metrics. HBase and Lucene index both constitute a repo/storage for this platform.
  • Shopping Engine at Tokenizer is a web crawler; it uses HBase to store URLs and Outlinks (AnchorText + LinkedURL): more than a billion. It was initially designed as Nutch-Hadoop extension, then (due to very specific ‘shopping’ scenario) moved to SOLR + MySQL(InnoDB) (ten thousands queries per second), and now – to HBase. HBase is significantly faster due to: no need for huge transaction logs, column-oriented design exactly matches ‘lazy’ business logic, data compression, MapReduce support. Number of mutable ‘indexes’ (term from RDBMS) significantly reduced due to the fact that each ‘row::column’ structure is physically sorted by ‘row’. MySQL InnoDB engine is best DB choice for highly-concurrent updates. However, necessity to flash a block of data to harddrive even if we changed only few bytes is obvious bottleneck. HBase greatly helps: not-so-popular in modern DBMS ‘delete-insert’, ‘mutable primary key’, and ‘natural primary key’ patterns become a big advantage with HBase.
  • Traackr uses HBase to store and serve online influencer data in real-time. We use MapReduce to frequently re-score our entire data set as we keep updating influencer metrics on a daily basis.
  • Trend Micro uses HBase as a foundation for cloud scale storage for a variety of applications. We have been developing with HBase since version 0.1 and production since version 0.20.0.
  • Twitter runs HBase across its entire Hadoop cluster. HBase provides a distributed, read/write backup of all mysql tables in Twitter’s production backend, allowing engineers to run MapReduce jobs over the data while maintaining the ability to apply periodic row updates (something that is more difficult to do with vanilla HDFS). A number of applications including people search rely on HBase internally for data generation. Additionally, the operations team uses HBase as a timeseries database for cluster-wide monitoring/performance data.
  • Udanax.org (URL shortener) use 10 nodes HBase cluster to store URLs, Web Log data and response the real-time request on its Web Server. This application is now used for some twitter clients and a number of web sites. Currently API requests are almost 30 per second and web redirection requests are about 300 per second.
  • Veoh Networks uses HBase to store and process visitor(human) and entity(non-human) profiles which are used for behavioral targeting, demographic detection, and personalization services. Our site reads this data in real-time (heavily cached) and submits updates via various batch map/reduce jobs. With 25 million unique visitors a month storing this data in a traditional RDBMS is not an option. We currently have a 24 node Hadoop/HBase cluster and our profiling system is sharing this cluster with our other Hadoop data pipeline processes.
  • VideoSurf – “The video search engine that has taught computers to see”. We’re using Hbase to persist various large graphs of data and other statistics. Hbase was a real win for us because it let us store substantially larger datasets without the need for manually partitioning the data and it’s column-oriented nature allowed us to create schemas that were substantially more efficient for storing and retrieving data.
  • Visible Technologies – We use Hadoop, HBase, Katta, and more to collect, parse, store, and search hundreds of millions of Social Media content. We get incredibly fast throughput and very low latency on commodity hardware. HBase enables our business to exist.
  • WorldLingo – The WorldLingo Multilingual Archive. We use HBase to store millions of documents that we scan using Map/Reduce jobs to machine translate them into all or selected target languages from our set of available machine translation languages. We currently store 12 million documents but plan to eventually reach the 450 million mark. HBase allows us to scale out as we need to grow our storage capacities. Combined with Hadoop to keep the data replicated and therefore fail-safe we have the backbone our service can rely on now and in the future. WorldLingo is using HBase since December 2007 and is along with a few others one of the longest running HBase installation. Currently we are running the latest HBase 0.20 and serving directly from it: MultilingualArchive.
  • Yahoo! uses HBase to store document fingerprint for detecting near-duplications. We have a cluster of few nodes that runs HDFS, mapreduce, and HBase. The table contains millions of rows. We use this for querying duplicated documents with realtime traffic.
  • HP IceWall SSO – is a web-based single sign-on solution and uses HBase to store user data to authenticate users. We have supported RDB and LDAP previously but have newly supported HBase with a view to authenticate over tens of millions of users and devices.

Original post

Leave a Comment

Leave a comment

Leave a Reply