HBase Interview on InfoQ

by Bryan Duxbury

Jim Kellerman, Michael Stack, and I recently responded to an email interview about HBase and related topics. You can find the result here.

Matching Impedance: When to use HBase

by Bryan Duxbury

(For the duration of this discussion, I’m going to assume you have at least heard of HBase. If not, go check it out first or you might be a little confused.)

Ever since I read the original Bigtable paper, I knew that its design was something that would befuddle a lot of developers. As an industry, we are largely educated into the world of relational databases, the ubiquitous system of tables, relationships, and SQL. On the whole, relational databases are one of the most widespread, reliable, and well-understood technologies out there. This is one reason why so many developers today are resistant to different storage technologies, such as object databases and distributed hash tables.

However, at some point, the model starts to break down. Usually there are two kinds of pain that people run into: scaling and impedance mismatch. The scaling issue usually boils down to the fact that most RDBMSs are monolithic, single-process systems. The way you scale this type of database (MySQL, Oracle, etc) is by adding bigger and more expensive hardware - more CPUs, RAM, and especially disks. In this regard, at least the problem is already solved: you just have to spend the money. Unfortunately, the cost of this approach does not scale nearly linearly - getting a machine that can support twice as many disks costs more than twice as much money.

Impedance mismatch is a more subtle and challenging problem to get over. The problem occurs when more and more complex schemas are shoehorned into a tabular format. The traditional issue is mapping object graphs to tables and relationships and back again. One common case where this sort of problem comes to light is when your objects have a lot of possible fields but most objects don’t have an instance of every field. In a traditional RDBMS, you have to have a separate column for each field and store NULLs. Essentially, you have to decide on a homogeneous set of fields for every object. Another problem is when your data is less structured than a standard RDBMS allows. If you will have an undefined, unpredictable set of fields for your objects, you either have to make a generic field schema (Object has many Fields) or use something like RDF to represent your schema.

HBase seeks to address some of these issues. Still, there are situations where HBase is the wrong tool for the job. As a developer, you need to make sure you take the time to see beyond the hype about this technology or that and really be sure that you’re matching impedance.

When HBase Shines

One place where HBase really does well is when you have records that are very sparse. This might mean un- or semi-structured data. In any case, unlike row-oriented RDBMSs, HBase is column-oriented, meaning that nulls are stored for free. If you have a row that only has one out of dozens of possible columns, literally only that single column is stored. This can mean huge savings in both disk space and IO read time.

Another way that HBase matches well to un- or semi-structured data is in its treatment of column families. In HBase, individual records of data are called cells. Cells are addressed with a row key/column family/cell qualifier/timestamp tuple. However, when you define your schema, you only specify what column families you want, with the qualifier portion determined dynamically by consumers of the table at runtime. This means that you can store pretty much anything in a column family without having to know what it will be in advance. This also allows you to essentially store one-to-many relationships in a single row! Note that this is not denormalization in the traditional sense, as you aren’t storing one row per parent-child tuple. This can be very powerful - if your child entities are truly subordinate, they can be stored with their parent, eliminating all join operations.

In addition to handling sparse data well, HBase is also great for versioned data. As mentioned, the timestamp is a part of the cell “coordinates”. This is handy, because HBase stores a configurable number of versions of each cell you write, and then allows you to query what the state of that cell is at different points in time. Imagine, for instance, a record of a person with a column for location. Over time, that location might change. HBase’s schema would allow you to easily store a person’s location history along with when it changed, all in the same logical place.

Finally, of course, there’s the scaling. HBase is designed to partition horizontally across tens to hundreds of commodity PCs. This is how HBase deals with the problem of adding more CPUs, RAM and disks. I don’t feel like I need to go far down the road of discussing this idea, because it seems to be the one thing everyone gets about HBase. (If you need more convincing, read the original Bigtable paper. It’s got graphs!)

When HBase Isn’t Right

I’ll just go ahead and say it: HBase isn’t right for every purpose. Sure, you could go ahead and take your problem domain and squeeze it into HBase in one way or another, but then you’d be committing the same error we’re trying avoid by moving away from RDBMSs in the first place.

Firstly, if your data fits into a standard RDBMS without too much squeezing, chances are you don’t need HBase. That is, if a modestly expensive server loaded with MySQL fits your needs, then that’s probably what you want. Don’t make the mistake of assuming you need massive scale right off the bat.

Next, if your data model is pretty simple, you probably want to use a RDBMS. If your entities are all homogeneous, you’ll probably have an easy time of mapping your objects to tables. You also get some nice flexibility in terms of your ability to add indexes, query on non-primary-key values, do aggregations, and so on without much additional work. This is where RDBMSs shine - for decades they’ve been doing this sort of thing and doing it well, at least at lower scale. HBase, on the other hand, doesn’t allow for querying on non-primary-key values, at least directly. HBase allows get operations by primary key and scans (think: cursor) over row ranges. (If you have both scale and need of secondary indexes, don’t worry - Lucene to the rescue! But that’s another post.)

Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that matter), is store large amounts of binary data. When I say large amounts, I mean tens to hundreds of megabytes. Certainly both RDBMSs and HBase have the capabilities to store large amounts of binary data. However, again, we have an impedance mismatch. RDBMSs are built to be fast metadata stores; HBase is designed to have lots of rows and cells, but functions best when the rows are (relatively) small. HBase splits the virtual table space into regions that can be spread out across many servers. The default size of individual files in a region is 256MB. The closer to the region limit you make each row, the more overhead you are paying to host those rows. If you have to store a lot of big files, then you’re best off storing in the local filesystem, or if you have LOTS of data, HDFS. You can still keep the metadata in an RDBMS or HBase - but do us all a favor and just keep the path in the metadata.

Conclusion

This post certainly doesn’t cover every use case and benefit or drawback of HBase, but I think it gives a pretty decent start. My hope is that people will be able to gain some insight into when they should start thinking of HBase for their applications, and also use this as a springboard for more questions about how to make use of HBase and ideas about how to make it better. So, I’ll end with a request - please, tell us what’s missing!

Thoughts on HBase Users’ Group

by Bryan Duxbury

Chris Kline and I visited Powerset for the second HBase Users’ Group meeting. About 20 people attended - the perfect number for small breakout discussions.

Michael Stack led off with an overview of the latest HBase news - Hadoop’s move to a top level Apache project and the subsequent HBase subproject promotion, the state of HBase 0.1, and our plans for 0.2. After we talked through all that, we broke up into small groups to discuss various topics at hand.

In attendance were some guys from Xoopit, IBM, Wikia Search, France Telecom, and a few others. (I didn’t get to talk to all of them, so excuse me if I omitted your company.) People were pretty much divided into two camps: those who had tried HBase, and those who were really curious about how it was coming and would LIKE to try it.

Amongst those who had tried it, they had similar experiences to what Rapleaf and Powerset have already seen. There are some performance and scalability problems which we’re well aware of and working to fix. Too many open file handles, HDFS corruption, out-of-memory issues in region servers, etc. The good news is that people seem to be willing to work within the problems and still make progress. As a bonus, the folks from Wikia seemed to indicate that they’ve solved some of the problems with a patch they’ve cooked up.

Those who hadn’t tried HBase yet were all circling for the common reasons - they have lots of data, and no way to scale it. There was a pretty wide variety of applications that people want to run on HBase (web caches, big graphs/RDF stores, …), which is awesome, because we need people to try a ton of different patterns of access and load in order to find HBase’s weak spots. I fielded a lot of questions about the capabilities and common use patterns of HBase, which really reaffirmed for me that we need to spend a lot more time on documentation and wiki pages. We have the answers, but if people need to get face time with us to get them, we’ve already failed.

All in all, this meetup was really productive, and it bolsters my confidence that we’ll be able to produce a strong community to support the great piece of software we’re producing. Now all we need is more patches…

HBase Users’ Group at Powerset March 4th

by Bryan Duxbury

Powerset is generously hosting another HBase Users’ Group meeting on March 4th at their offices. You can see the details here.

We’re hoping that this one will be more discussion oriented, rather than just be presentations like the one we had here at Rapleaf a few months ago. Sign up quick!

Making sure Ruby Daemons die

by Chris Kline

We use the Daemons Ruby Gem for a variety of applications. It has served us well, but we found ourselves wrapping the “stop” command with a shell script that makes sure the process actually dies. This behavior is necessary for our deploy scripts which restart daemons. Thanks to the magic of Ruby, we were able to eliminate these extra scripts with a simple Daemons extension.

Before extension: stop command returns immediately, pid file is deleted and we have no clue if the process is dead.
After extension: stop command blocks until the process is dead, giving feedback along the way.

To make sure the stop command doesn’t hang indefinitely, we send the TERM signal, then send the KILL signal (kill -9) if the process hasn’t died after a configurable amount of time. To use this extension, specify :force_kill_wait in seconds as part of the Daemons options hash:

require 'daemons_extension'
Daemons.run_proc('dantalion', :force_kill_wait => 30) { ...

The implementation starts with making sure the pid file matches the UNIX ‘ps’ command. Let’s crack open the Daemons::ApplicationGroup class and redefine the find_applications method:

# We want to redefine find_applications to not rely on
# pidfiles (e.g. find application if pidfile is gone)
# We recreate the pid files if they're not there.
def find_applications(dir)
  # Find pid_files, like original implementation
  pid_files = PidFile.find_files(dir, app_name)
  @monitor = Monitor.find(dir, app_name + '_monitor')
  pid_files.reject! {|f| f =~ /_monitor.pid$/}
 
  # Find the missing pids based on the UNIX pids
  pidfile_pids = pid_files.map {|pf| PidFile.existing(pf).pid}
  missing_pids = unix_pids - pidfile_pids

  # Create pidfiles that are gone
  if missing_pids.size > 0
    puts "[daemons_ext]: #{missing_pids.size} missing pidfiles: " +
          "#{missing_pids.inspect}... creating pid file(s)."
    missing_pids.each do |pid|
      pidfile = PidFile.new(dir, app_name, multiple)
      pidfile.pid = pid # Doesn't seem to matter if it's a string or Fixnum
    end
  end

  # Now get all the pid file again
  pid_files = PidFile.find_files(dir, app_name)

  return pid_files.map {|f|
    app = Application.new(self, {}, PidFile.existing(f))
    setup_app(app)
    app
  }
end

Now we can be sure that we’ll send a signal to our process even if the pid file was initially missing. You can reference the attached code to see how unix_pids is implemented. Next, we redefine Daemons::ApplicationGroup.stop_all:

# Specify :force_kill_wait => (seconds to wait) and this method will
# block until the process is dead.  It first sends a TERM signal, then
# a KILL signal (-9) if the process hasn't died after the wait time.
def stop_all(force = false)
  @monitor.stop if @monitor
 
  wait = options[:force_kill_wait].to_i
  if wait > 0
    puts "[daemons_ext]: Killing #{app_name} with force after #{wait} secs."

    # Send term first, don't delete PID files.
    @applications.each {|a| a.send_sig('TERM')}

    begin
      started_at = Time.now
      Timeout::timeout(wait) do
        num_pids = unix_pids.size
        while num_pids > 0
          time_left = wait - (Time.now - started_at)
          puts "[daemons_ext]: Waiting #{time_left.round} secs on " +
                "#{num_pids} #{app_name}(s)..."
          sleep 1
          num_pids = unix_pids.size
        end
      end
    rescue Timeout::Error
      @applications.each {|a| a.send_sig('KILL')}
    ensure
      # Delete Pidfiles
      @applications.each {|a| a.zap!}
    end

    puts "[daemons_ext]: All #{app_name}(s) dead."
  else
    @applications.each {|a|
      if force
        begin; a.stop; rescue ::Exception; end
      else
        a.stop
      end
    }
  end
end

Now we can be sure that our process is dead when the stop command returns… at least as sure as kill -9 will kill a process (I have seen where kill -9 didn’t kill a process, but that was 8 years ago on SunOS). The extension will also work for :multiple => true. Note that because of the system calls, this extension will not work on all operating systems… also a good reason not to patch Daemons.

Download daemons_extension.rb

OpenSocial Thoughts

by Manish Shah

Recently, I attended an OpenSocial Hackathon at Six Apart to learn more about the development efforts of Google and the various app developers. I thought I would share some of my notes from the meet up.

I wasn’t sure what to expect with the API that Google is developing. It seemed like a lot of people were unsure about what was going on with it since it dropped off the map. It seemed very real and that Google did have some engineers staffed on it and they were writing some code to support it. I think it dropped off the map because they are pushing for a major release (ver 0.7) that they want people to take seriously. The intermediate releases will not last long and will change so it didn’t make sense to push it.

Overall, I wasn’t too surprised with the what I learned. OpenSocial is just a spec for a Container to support (Plaxo, Myspace, Hi5, etc) and for Gadgets to expect (Rockyou, Slide, etc). So when a Gadget developer makes a function call, they can expect that call to be supported by the container (w/ some caveats..see below). Google is just helping develop the language used between a container and a gadget. So defining how a gadget can call a function to get container’s user’s friends (requestGetFriends).

Aside from writing the API spec, Google is developing some cool open source code to support OpenSocial. One piece is a javascript sanitizer called Caja. Another is a framework for people to quickly setup a site to act as a Container which they are calling Shindig. I wouldn’t be surprised if Google later offers some hosting / computing service (similar to amazon web services) to OpenSocial developers.

The API that Google is developing is a request-based model. From what I gathered, everything a Gadget developer wants to do (get friends, send messages, navigate to friends page) they have to issue a request to the Container before doing so. The container will support the request method, but they are not obligated / required to support the functionality requested by the Gadget. So if the gadget needs to be able to see friends of a user to work optimally, but the container denies / blocks that feature, the gadget is out of luck. Also, there really isn’t a good way for Gadget developers to link users on different Containers. The same user could have the gadget installed on 2 different sites (Bebo & Hi5 say), but to the Gadget company, its 2 different people. That would suck if I racked up a ton of points or whatever and couldn’t use them transparently.

Overall it seems kind of cumbersome for widget maker to develop apps / gadgets / whatever. It seems like the power is still in the hands of the Container site. Having to request every function to see if its available seems like a nightmare. It might not, but thats my impression. If my assumptions are wrong here, I’d be happy to hear about it if you have more experience working with the OpenSocial API.

We at Rapleaf love what Google and OpenSocial are enabling. We are strong believers in open systems and open data and we are pushing to build tools to progress towards these achievable goals. OpenSocial is a good step forward for what we are all trying to accomplish. It will be interesting to see what other strides are made as a result of OpenSocial.

Ruby-HBase ORM Already?

by Bryan Duxbury

It only took a few weeks for someone to decide to start working on an ORM layer to go over the ruby-hbase gem. Here’s Quinn Slack’s announcment for Rhino. It looks pretty new, and I’m sure it will go through a lot of iterations, but I’m excited to see where this could lead. How cool would it be to have a real-time Rails website with an HBase backend?

Ruby and HBase

by Bryan Duxbury

Lately, I’ve been investigating HBase, the Bigtable-like massively parallel database that runs on top of the Hadoop distributed file system, as an alternative to the traditional MySQL. We have some very high write throughput applications that just plain suffer under MySQL’s master-slave replication, since that model still requires the data to be written to the entire cluster eventually. In contrast, a write to HBase only requires a disk write to a number of machines equal to the replication factor - usually three - a significant improvement in large clusters.

With all the benefits of HBase, it’s nearly a no-brainer to try and make use of it in our applications. Aside from the maturity of the project occasionally being an issue, one big problem was the lack of an API that’s accessible from non-Java languages. HBase, and Hadoop in general, uses an internal Java-Java RPC framework that relies on too much internal Java magic to be reversed engineered for use by other languages.

So, after a bit of discussion with the current maintainers over at Powerset, Michael Stack and Jim Kellerman, I built a tiny Java REST servlet that maps resource URIs onto the HBase client API. It’s pretty short, only a few hundred lines of Java. (Many thanks to Stack for the framework he wrote for me.) The result is that in HBase trunk, you can start up the REST servlet external to the other running HBase servers and interact with your cluster from any language! Victory!

I’ve actually taken it one step further. I created a new gem called ruby-hbase that wraps the REST interface with a class structure for easier access. You get the HBase::HTable ruby object, which has a number of methods that mimic the HBase Java client API, but Rubyized, of course. The documentation is currently fairly lacking, but you can take a look at the specs to see some use cases. Here’s a teensy bit of sample code:

require 'rubygems'
require 'ruby-hbase'

# instantiate a table with its full URL
@table = Hbase::HTable.new("http://rs1:60050/api/test_table")
# put a row into HBase
@table.put("bryan_test", {"name:first" => "bryan"})
# get an entire row out of HBase
@table.get("bryan_test") # => {"name:first" => "bryan"}
# completely delete a row
@table.delete("bryan_test")

This is very much a beta version of the gem, as there are things that aren’t implemented and probably more than a few edge cases that haven’t been addressed. I welcome the input of others as we continue to evolve this gem. A further step I’d like to see in the future of Ruby and HBase integration is a very simple ActiveRecord-like base class so you can build your models against HBase instead of a SQL database. We’ll see how that one develops.

(I recently gave a brief talk at the HBase Meetup sponsored by Rapleaf and Powerset about this topic. Here are the slides.)

Timing out Rails from Mongrel

by Chris Kline

While we’re on the timeout tip, I’d like to talk about a fail-safe timeout for mongrel rails. We’ve talked about making sure that net/http and memcache-client behave, but what about other slow actions? Since mongrel only processes one rails request at a time, other requests can start to pile up. In rare cases, those other requests would wait forever. It would be nice to make sure that each rails request takes no longer than a set amount of time.

Since there is no timeout around a rails request in mongrel, we decided to put one in. Mongrel rails uses a mutex around all rails requests (taken from the process method in mongrel-1.0.1/lib/mongrel/rails.rb line 77):

@guard.synchronize {
  @active_request_path = request.params["PATH_INFO"]
  Dispatcher.dispatch(cgi, ActionController::CgiRequest::DEFAULT...
  @active_request_path = nil
}

Our solution was to redefine the @guard instance variable to make sure the rails request timed out after the lock was obtained. We first constructed a simple TimeoutMutex class:

class TimeoutMutex < Mutex
  def initialize(timeout = 30)
    super()
    @timeout = timeout
  end

  def synchronize
    lock
    begin
      Timeout::timeout(@timeout) {yield}
    ensure
      unlock
    end
  end
end

We then popped open the RailsHandler class in mongrel and redefined @guard:

class Mongrel::Rails::RailsHandler
  cattr_accessor :rails_timeout
  alias :original_initialize :initialize
  @@rails_timeout = 30

  def initialize(dir, mime_map = {})
    original_initialize(dir, mime_map = {})
    @guard = TimeoutMutex.new(@@rails_timeout)
  end
end

The @guard mutex is also used in the reload! method, so that will also get the same timeout. Given the fact that we don’t use this method and mongrel startup reads: “HUP => reload (without restart). It might not work well.”, we decided not to worry about it.

If you want to have a different timeout value, just set Mongrel::Rails::RailsHandler.rails_timeout. Since timeout throws an exception into the thread it times out, you could end up with an exception at any point in your rails code. Most of the time, this would tell you which methods need attention. You can also catch the Timeout::Error exception in you application controller’s rescue_action_in_public method.

Stack Traces in ThreadDump: So Close!

by Bryan Duxbury

Some of you may already be familiar with ThreadDump, a tool Rapleaf has created for figuring out where all your Ruby processes’ threads are currently executing in your code. This little gem is particularly useful when your process is not responding or a server process is slow.

However, one notable limitation of ThreadDump is that it can only give you the last line being executed by each thread, not the entire stack trace of how you got there. Sometimes that’s OK, because certainly any information is better than none, and it’ll get you to the solution you need. Unfortunately, other times, there are many different code paths that will get you down to the same point, and the path is relevant to helping you understand what’s going on.

Alas, the reason that ThreadDump does not show stack traces is because of pesky memory access errors. To do its wonderful magic, ThreadDump has a C extension that can look at Ruby’s thread list and AST to figure out the last stop. There’s even pointers in the Ruby thread structure to previous FRAMEs that contain pointers to other AST NODEs that should give you the ability to produce the stack trace. However, when you attempt to access the previous FRAME of a thread other than the current thread, you get memory access violation errors. Sigh.

Here’s my theory as to why this is happening: while the last thread’s current FRAME is accessible, the rest of the stack is stored somewhere in a stack segment that has been context-switched out. Thus, the pointers I have to previous FRAMEs are really pointing to the wrong place. When a Ruby thread is context-switched in, the stack segment gets restored correctly, which is why the current thread’s stack IS accessible.

Now, I am not a C ninja, so I’m not sure if this idea even makes sense, but could it be possible to do some sort of pointer arithmetic to find my way to the real FRAME location? If so, I’m sure it would have something to do with the stack_ptr (or some other attribute) available in the thread struct. Is this even possible? Are there any C/Ruby gurus who have an idea of how to do this?