Category Archives: Hadoop

Analyzing network load in Map/Reduce

Hadoop Map/Reduce can put a heavy toll on your network. Just how heavy, though, isn’t obvious. This is an especially important consideration when you are expanding your cluster. Rapleaf recently encountered this situation, and in the process we devised a neat theoretical model for analyzing how network topology affects Map/Reduce. When does Hadoop put the [...]

Posted in Hadoop | Tagged , | Leave a comment

Thrift and pseduo-RDF schemas

BackType‘s Nathan Marz recently wrote a really great post about using Thrift and an RDF-like schema to get type-safe, extensible, high-performance schemas for use in Hadoop environments. He really hit the nail on the head describing the use pattern and the positives and negatives. A variation on this approach is something we’ve been doing at [...]

Also posted in Thrift | 1 Comment

The Wrath of DrWho, or Unpredictable Hadoop Memory Usage

A while back we encountered a really peculiar problem with one of our Hadoop apps. The app did a bunch of HDFS operations before launching a small Map/Reduce job and going on to do a bunch of memory intensive operations. It would run very happily for a few days to a week at a time, [...]

Also posted in HDFS | 7 Comments

Cycles of Doom in Batch Processing Workflows

We integrate new data into our databases via a large batch processing workflow. The execution time of this workflow directly affects the time it takes to get new data to our customers, so keeping the runtime small is of paramount importance to us. There’s an interesting effect that can happen which we’ve dubbed the “cycle [...]

Also posted in MapReduce | 10 Comments

Command-line auto completion for Hadoop DFS commands

We like to keep things simple here at Rapleaf. One small tweak we made right after we installed hadoop was to alias ‘hadoop dfs’ to ‘hdfs’. It rolls off the fingers nicely. We are also constantly typing ‘hdfs -ls this’ or ‘hdfs -du that’. If we are not sure what this/that is, we type ‘hdfs [...]

Also posted in bash, HDFS | Tagged , , , , , , | 16 Comments

Dead Simple MapReduce Workflow Configuration

If you use MapReduce for any real-world application, chances are your workflow consists of more than one MapReduce job. Rapleaf has workflows consisting of over one hundred jobs. A lot of times, you need to make configurations to the workflow that should apply to every job. For example, you may want each job to run [...]

Also posted in MapReduce | Tagged , | Leave a comment

Getting the serial terminal to work over IPMI on a Dell R410

As avid readers of the blog know, we use Hadoop a lot and talk about it quite a bit. We are in the process of expanding our Hadoop cluster and decided to go with the new Dell R410 1U machines.  From talks with other Hadoop users the sweet-spot is one spindle (drive) for every 2 [...]

Posted in Hadoop | Tagged , | Leave a comment

Cascading & Clojure SF Meetup

I’m happy to announce that Rapleaf will be hosting a Cascading + Clojure Meetup in San Francisco, CA on September 24th. At this meetup, we’ll cover some real world uses cases of Cascading and Clojure and provide information on how these technologies are progressing. Here is what we have in store: Bradford Cross from FlightCaster [...]

Also posted in Cascading, MapReduce, Miscellaneous | Tagged , | 6 Comments

Thrift Unions Part II, or How I Reduced Memory Usage by 95%

In a previous post, we discussed the Thrift Union pattern of struct definition. To quickly summarize, the benefits are simplicity, flexibility, and low disk usage, but the downside (at least in Java) is high memory usage. Well, the downside is soon to be a downside no more. Over on THRIFT-409, I have been working on [...]

Also posted in Thrift | 1 Comment

Using random numbers in Hadoop MapReduce is dangerous

If you’re using random numbers in your MapReduce jobs, you could be suffering from data loss. The cause of the data loss is subtle and happens due to Hadoop’s behavior in dealing with TaskTrackers that are lost in the middle of a job. Let’s go through an example of how the data loss can occur. [...]

Also posted in MapReduce | 11 Comments
  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers