Monthly Archives: March 2010

Thrift and pseduo-RDF schemas

BackType‘s Nathan Marz recently wrote a really great post about using Thrift and an RDF-like schema to get type-safe, extensible, high-performance schemas for use in Hadoop environments. He really hit the nail on the head describing the use pattern and the positives and negatives. A variation on this approach is something we’ve been doing at [...]

Posted in Hadoop, Thrift | 1 Comment

Accelerate your test suite with Cascading 1.1

One big downside of using Cascading for our applications has been the runtime of our regression test suite. We test with quantities of data nowhere near our regular production volume, but we still end up running lots of jobs. In our experience, this ends up making our tests take a long time (in the tens [...]

Posted in Cascading | Tagged , | 1 Comment

Dealing with skewed key sizes in Cascading

Rapleaf indexes data from a wide variety of sources and across all different sorts of people. As a result, some of the people we analyze end up having a lot more data about them stored in our systems than others. For instance, your average person has around a hundred friends, whereas Ashton Kutcher’s Twitter account [...]

Posted in Cascading, MapReduce | 2 Comments
  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers