Monthly Archives: August 2009

Consistent Sampling using Hash Function

I learned this technique in a lecture I saw by Ron Kohavi. The idea is that we want to create a random sample but we want to make sure that if a value appears multiple times, it is either always sampled or never sampled. Let’s look at a specific example – let’s say we have [...]

Posted in Miscellaneous | Leave a comment

Intern Experience

Last Spring, as I was looking for a summer job, one of my friends back home asked if I would be interested in applying for an internship at Rapleaf. He had worked there previously and had really enjoyed it. At the time, I really had no idea what was in store for me, but looking [...]

Posted in Miscellaneous | Leave a comment

Thrift Unions Part II, or How I Reduced Memory Usage by 95%

In a previous post, we discussed the Thrift Union pattern of struct definition. To quickly summarize, the benefits are simplicity, flexibility, and low disk usage, but the downside (at least in Java) is high memory usage. Well, the downside is soon to be a downside no more. Over on THRIFT-409, I have been working on [...]

Posted in Hadoop, Thrift | 1 Comment

Using random numbers in Hadoop MapReduce is dangerous

If you’re using random numbers in your MapReduce jobs, you could be suffering from data loss. The cause of the data loss is subtle and happens due to Hadoop’s behavior in dealing with TaskTrackers that are lost in the middle of a job. Let’s go through an example of how the data loss can occur. [...]

Posted in Hadoop, MapReduce | 11 Comments
  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers