Monthly Archives: June 2009

A new Cascading pipe – MultiGroupBy

Cascading is an awesome tool, but there’s a particular situation I have run into a few times where the abstractions have broken down. The situation occurs when you have multiple pipes that you need to group together on a common key, but other than the key the pipes have different fields. Let’s use the following [...]

Posted in Cascading, Hadoop, MapReduce | 1 Comment

Multiple ways of copying data out of HDFS

There are multiple ways of getting data out of HDFS on to a local machine that does not belong to the HDFS cluster. The method used really depends on the needs of the data-transfer. The simplest way of getting data out of HDFS on to a non-cluster machine is to use the functions built into [...]

Posted in Hadoop, HDFS | 5 Comments

Backing up Hadoop’s HDFS

Very little information can be found on methods of backing up data on a Hadoop cluster. To be clear, by Hadoop cluster I mean HDFS. Any solution would have to take a few things into account. The first is how to keep your data in multiple places for security. There is information on how people [...]

Posted in Hadoop, HDFS | 1 Comment

Thrift Union Pattern

At Rapleaf, we use Thrift structs as the basic cornerstone of many of our processes. I won’t go into great detail talking about Thrift in general, but suffice it to say that Thrift is easy and flexible enough to be used as the primary means of storing data and communicating between our various components. One [...]

Posted in Miscellaneous, Thrift | 1 Comment
  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers