Matching Impedance: When to use HBase

(For the duration of this discussion, I’m going to assume you have at least heard of HBase. If not, go check it out first or you might be a little confused.)

Ever since I read the original Bigtable paper, I knew that its design was something that would befuddle a lot of developers. As an industry, we are largely educated into the world of relational databases, the ubiquitous system of tables, relationships, and SQL. On the whole, relational databases are one of the most widespread, reliable, and well-understood technologies out there. This is one reason why so many developers today are resistant to different storage technologies, such as object databases and distributed hash tables.

However, at some point, the model starts to break down. Usually there are two kinds of pain that people run into: scaling and impedance mismatch. The scaling issue usually boils down to the fact that most RDBMSs are monolithic, single-process systems. The way you scale this type of database (MySQL, Oracle, etc) is by adding bigger and more expensive hardware – more CPUs, RAM, and especially disks. In this regard, at least the problem is already solved: you just have to spend the money. Unfortunately, the cost of this approach does not scale nearly linearly – getting a machine that can support twice as many disks costs more than twice as much money.

Impedance mismatch is a more subtle and challenging problem to get over. The problem occurs when more and more complex schemas are shoehorned into a tabular format. The traditional issue is mapping object graphs to tables and relationships and back again. One common case where this sort of problem comes to light is when your objects have a lot of possible fields but most objects don’t have an instance of every field. In a traditional RDBMS, you have to have a separate column for each field and store NULLs. Essentially, you have to decide on a homogeneous set of fields for every object. Another problem is when your data is less structured than a standard RDBMS allows. If you will have an undefined, unpredictable set of fields for your objects, you either have to make a generic field schema (Object has many Fields) or use something like RDF to represent your schema.

HBase seeks to address some of these issues. Still, there are situations where HBase is the wrong tool for the job. As a developer, you need to make sure you take the time to see beyond the hype about this technology or that and really be sure that you’re matching impedance.

When HBase Shines

One place where HBase really does well is when you have records that are very sparse. This might mean un- or semi-structured data. In any case, unlike row-oriented RDBMSs, HBase is column-oriented, meaning that nulls are stored for free. If you have a row that only has one out of dozens of possible columns, literally only that single column is stored. This can mean huge savings in both disk space and IO read time.

Another way that HBase matches well to un- or semi-structured data is in its treatment of column families. In HBase, individual records of data are called cells. Cells are addressed with a row key/column family/cell qualifier/timestamp tuple. However, when you define your schema, you only specify what column families you want, with the qualifier portion determined dynamically by consumers of the table at runtime. This means that you can store pretty much anything in a column family without having to know what it will be in advance. This also allows you to essentially store one-to-many relationships in a single row! Note that this is not denormalization in the traditional sense, as you aren’t storing one row per parent-child tuple. This can be very powerful – if your child entities are truly subordinate, they can be stored with their parent, eliminating all join operations.

In addition to handling sparse data well, HBase is also great for versioned data. As mentioned, the timestamp is a part of the cell “coordinates”. This is handy, because HBase stores a configurable number of versions of each cell you write, and then allows you to query what the state of that cell is at different points in time. Imagine, for instance, a record of a person with a column for location. Over time, that location might change. HBase’s schema would allow you to easily store a person’s location history along with when it changed, all in the same logical place.

Finally, of course, there’s the scaling. HBase is designed to partition horizontally across tens to hundreds of commodity PCs. This is how HBase deals with the problem of adding more CPUs, RAM and disks. I don’t feel like I need to go far down the road of discussing this idea, because it seems to be the one thing everyone gets about HBase. (If you need more convincing, read the original Bigtable paper. It’s got graphs!)

When HBase Isn’t Right

I’ll just go ahead and say it: HBase isn’t right for every purpose. Sure, you could go ahead and take your problem domain and squeeze it into HBase in one way or another, but then you’d be committing the same error we’re trying avoid by moving away from RDBMSs in the first place.

Firstly, if your data fits into a standard RDBMS without too much squeezing, chances are you don’t need HBase. That is, if a modestly expensive server loaded with MySQL fits your needs, then that’s probably what you want. Don’t make the mistake of assuming you need massive scale right off the bat.

Next, if your data model is pretty simple, you probably want to use a RDBMS. If your entities are all homogeneous, you’ll probably have an easy time of mapping your objects to tables. You also get some nice flexibility in terms of your ability to add indexes, query on non-primary-key values, do aggregations, and so on without much additional work. This is where RDBMSs shine – for decades they’ve been doing this sort of thing and doing it well, at least at lower scale. HBase, on the other hand, doesn’t allow for querying on non-primary-key values, at least directly. HBase allows get operations by primary key and scans (think: cursor) over row ranges. (If you have both scale and need of secondary indexes, don’t worry – Lucene to the rescue! But that’s another post.)

Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that matter), is store large amounts of binary data. When I say large amounts, I mean tens to hundreds of megabytes. Certainly both RDBMSs and HBase have the capabilities to store large amounts of binary data. However, again, we have an impedance mismatch. RDBMSs are built to be fast metadata stores; HBase is designed to have lots of rows and cells, but functions best when the rows are (relatively) small. HBase splits the virtual table space into regions that can be spread out across many servers. The default size of individual files in a region is 256MB. The closer to the region limit you make each row, the more overhead you are paying to host those rows. If you have to store a lot of big files, then you’re best off storing in the local filesystem, or if you have LOTS of data, HDFS. You can still keep the metadata in an RDBMS or HBase – but do us all a favor and just keep the path in the metadata.

Conclusion

This post certainly doesn’t cover every use case and benefit or drawback of HBase, but I think it gives a pretty decent start. My hope is that people will be able to gain some insight into when they should start thinking of HBase for their applications, and also use this as a springboard for more questions about how to make use of HBase and ideas about how to make it better. So, I’ll end with a request – please, tell us what’s missing!

This entry was posted in HBase. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

14 Comments

  1. Stu Hoo
    Posted March 12, 2008 at 12:25 pm | Permalink

    Thanks for the article! Subscribed.

    One other thing you could probably touch on is the overhead involved with denormalizing your data to fit it in HBase: you will be storing duplicates of your fields (as opposed to having them recorded once in a dimension table in an RDBMS).

    Therefore, going from an RDBMS to HBase will require a larger storage/cpu/memory investment to achieve the same type of performance. This is another reason to stay with a monolithic system until you are completely convinced that you will need to grow off of one machine. It is not a decision to rush into.

  2. Posted March 12, 2008 at 8:22 pm | Permalink

    @Stu: I’m not sure exactly what you mean. If you’re suggesting that because your schema is less normalized, you’ll be storing more copies of some data, then yes, that’s the case. However, data denormalization for the sake of performance is a technique that is in use in standard RDBMSs as well.

    Basically, my belief is that disk storage is cheap, but disk seeks are not. If you can save some disk seeks by precomputing the answers to questions you’ll ask at some point, then by all means you should do that, if it matches your use case.

  3. Posted March 13, 2008 at 6:48 pm | Permalink

    Bryan

    Thanks for the post. I run a vertical search engine for images http://www.searchartgalleries.com and am evaluating getting away from Mysql due to response time and scaling issues. I’m curious what you view is on Hbase (or Hypertable) suitability for batch vs. realtime operations. For instance now I use a file system to store crawled HTML and images, then at a point the batch indexer loads all the metadata into a few mysql instances. Later this is dumped into a different server for real time user access. HBase seems ideal for the batch stuff, but I haven’t seen anyone talk it up as a good candidate for serving real time content such as text, image attributes, and user information -all the stuff people use an RDBMS for. What’s you view?

  4. Posted March 15, 2008 at 12:39 pm | Permalink

    @Logan: While it will hopefully ultimately be tuned to be fast enough for real-time access, right now it is probably only suitable for batch processing. That said, if you’re running out of room to scale a MySQL now, then perhaps you would be willing to accept worse performance, but performance that would be consistent even across tens or hundreds of machines.

    Once we get over some of the more pressing issues in the 0.2 release, we will turn our attention to improving random read performance and making HBase more suitable for real-time applications.

  5. ewilde
    Posted May 15, 2008 at 1:39 am | Permalink

    considering the following schema:

    1. a table of states with 2 columns: stateid, statename

    2. a table for cities with 3 columns: cityid, cityname, stateid

    3. a table for contacts with 3 columns: contid, contname, cityid

    when moving this schema to hbase would it be better to store it in one table with 4 columns: contid, contname, cityname, statename ???

    let’s say i need to retrieve all the contacts that lives in the state: LA. in “normal” database the query would look like:

    select contname from contacts c1
    join cities c2 on c1.cityid=c2.cityid
    join states s1 on s1.stateid=c2.stateid
    where s1.stateid=[id of LA]

    assuming indexes on cities.stateid and contacts.cityid, this query would run relatively very fast.

    and what if i just need the number of contacts live in LA? what would be the hbase-equivalent to this:

    select count(*) from contacts c1
    join cities c2 on c1.cityid=c2.cityid
    join states s1 on s1.stateid=c2.stateid
    where s1.stateid=[id of LA]

    i still do not mention SUM aggregations….

  6. Posted November 3, 2008 at 3:49 pm | Permalink

    Brian,

    Any updates on HBase real-time access? We are building a quite large reporting database. There will be about 20-30 reports long various daily, weekly, monthly data in an ad network.

    Do you recommend HBase for this sort of application?

  7. Yonatan Maman
    Posted December 17, 2008 at 9:28 am | Permalink

    you mentioned that:

    (If you have both scale and need of secondary indexes, don’t worry – Lucene to the rescue! But that’s another post.)

    can you elaborate on that ?

  8. Posted December 17, 2008 at 11:00 am | Permalink

    @Yonatan:

    I guess I should be more cautious with what I promise to write about :) . Basically, I haven’t gone down that road personally, and I think others are doing more work with HBase and Lucene that I’m not up to date on.

  9. Yonatan Maman
    Posted December 17, 2008 at 11:51 am | Permalink

    I see.
    what is the best practice (as far as you know) to handle multiple keys in Hbase table ?
    I wonder what do you think about the question I posted in stackoverflow
    http://stackoverflow.com/questions/375194/how-to-design-hbase-schema

    Thanks

  10. Posted January 2, 2009 at 2:55 pm | Permalink

    I need to store voice file with typical, enterprise id, person id etc. There could be 100,000 files per day initially with each files few hundred kbs
    Even though schema fits well in the RDBMS world for metadata , features such as redundancy is exciting enough to move into Hbase world.
    With RDBMS solution one would typically need db backups as well as file system backps of voice files.
    With HBASE replicating the copy to n instances backups becomes obselete no?

  11. Posted January 5, 2009 at 12:16 pm | Permalink

    @Kumar: HBase’s use of replication in HDFS is not a backup strategy, it’s a high-availability strategy. If you are worried about losing your data for a broad variety of reasons, you should still run backups.

  12. P Marlowe
    Posted January 29, 2009 at 2:09 pm | Permalink

    Hi Bryan,

    Is HBase appropriate for data that is updated frequently. I should say, it is really appended. The reason I ask is that HDFS files are claimed to immutable with the usage pattern of “write-once-read-many”.

    Your thoughts?

  13. Posted January 29, 2009 at 2:14 pm | Permalink

    @P Marlowe:

    Yeah, HBase can be used to provide append-like behavior on top of HDFS. I would say, make sure that you’re working with relatively small updates though. Don’t just use HBase to append enormous objects because that’s the only “append” you can find. There are other strategies (like the Collector, detailed elsewhere on this blog) for that kind of behavior.

  14. Eitan
    Posted November 30, 2009 at 2:53 pm | Permalink

    Is it suitable for financial data collection?
    I am looking for a storage for 100ths millions of tick data, i.e. time series based.
    It should provide very fast inserts, very fast read and should be able to behave like an array when accessing the dataset.

    Thanks

5 Trackbacks

  1. By links for 2008-04-24 « Bloggitation on April 23, 2008 at 5:32 pm

    [...] Matching Impedance: When to use HBase (tags: database sysadmin cluster hadoop hbase) [...]

  2. By HBase « Ganbatte…! on January 29, 2009 at 1:05 am

    [...] [7] Matching Impedance: When to use HBase, http://blog.rapleaf.com/dev/?p=26 [...]

  3. [...] Otra excelente recurso a leer para empezar [...]

  4. [...] HBase, Hypertable, When to use HBase. [...]

  5. [...] milliseconds,而hdfs真个文件目录的读取才14188 milliseconds http://blog.rapleaf.com/dev/?p=26,这篇文章中说到 Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>