The Collector

Last night at the Rapleaf-hosted Hadoop meetup, I talked about a project we’ve created here at Rapleaf called the Collector.

Basically, Rapleaf is starting to Hadoopify our workflow, and like a lot of people out there, we’ve found the need to manage many processes writing to HDFS so that our data can be processed by Hadoop Map/Reduce. Our particular application has hundreds of Ruby processes spread across a bunch of machines all needing to write to the same place.

A very attractive approach espoused in the Google Filesystem paper is this idea of “record append” – you don’t particularly care about the order, and you can tolerate duplication, but what you really care about is speed and durability. Alas, HDFS doesn’t have a record append feature, so you’re left trying to find some approximation.

The Collector is one such approximation. It’s a service written in Java with a Thrift interface and only one meaningful method – append. The append method takes the name of a logical file (we call it a “bucket”) and a byte array record, and writes it to an HDFS file named in a particular convention. There’s only one Collector instance per host that contains data sources, so the net effect is one file per host per rotation interval.

To add durability, we have a local write-ahead log that every record gets written to before it is written to the actual HDFS file. Then, if there is some error that causes the HDFS stream to be interrupted, we can recover the already-written data by replaying the write-ahead log.

The Collector is pretty fast – it can take writes pretty much as fast as HDFS can take writes. It also scales pretty well – since it’s deployed on each host, it grows with your application.

Finally, to handle the map/reduce end of the workflow, we have custom InputFormat implementations that deal with the “bucket” as a whole so the user never has to think about all the files in it.

The Collector is not yet open source, though I believe that we will make it so soon. There’s probably 6 hours of work or so to be done to clean it up for release, but then we can stick it on SourceForge or try to contrib it directly to Hadoop. I’ll keep the blog posted with information about how this effort is going.

This entry was posted in Hadoop. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Comments

  1. Cody Caughlan
    Posted October 22, 2008 at 1:36 pm | Permalink

    I would have _loved_ to attend this event, had I heard about it coming up. Do you guys have a presentation schedule or some place where these events are posted? I looked on EventBrite and couldnt find anything…

    Thanks
    /Cody

  2. Posted October 22, 2008 at 1:40 pm | Permalink

    @Cody: We publicized this event through the Hadoop mailing list.

2 Trackbacks

  1. [...] use the Consolidator all the time. It is particularly useful when used in conjunction with the Collector since the Collector tends to generate a lot of small files. In general, though, Hadoop jobs [...]

  2. [...] while working on the Collector, I noticed that we had an issue with graceful shutdown of our servers. The Collector uses a JVM [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>