Using random numbers in Hadoop MapReduce is dangerous

If you’re using random numbers in your MapReduce jobs, you could be suffering from data loss.

The cause of the data loss is subtle and happens due to Hadoop’s behavior in dealing with TaskTrackers that are lost in the middle of a job. Let’s go through an example of how the data loss can occur.

Let’s take a simple job with two map tasks and two reducers. Map Task #1 processes one record of value “1″. Map Task #2 processes one record of value “2″. Each map task chooses a random number between one and two and emits (key => random#, value => record). The reducers are identity reducers, and a key-value pair with key=X goes to Reducer #X .

  1. Map Task #1 emits (key =>2, value =>1) and the task succeeds.
  2. Map Task #2 emits (key =>2, value =>2) and the task succeeds.
  3. Reduce Task #1 reads in all map outputs, emits nothing, and succeeds.
  4. Before Reduce Task #2 starts, the TaskTrackers for the Map Tasks are lost.
  5. Hadoop declares the original tasks’ map outputs as “lost”, and re-executes the map tasks on a different TaskTracker.
  6. Map Task #1 emits (key => 1, value =>1) and the task succeeds.
  7. MapTask #2 emits (key => 2, value => 2) and the task succeeds.
  8. Reduce Task #2 copies over the map outputs, emits (key=>2, value =>2) and succeeds.
  9. Our output contains one record (key=>2, value=>2) and we have lost one record.

Yikes. The fundamental problem here is that Hadoop’s fault tolerance behavior assumes that mappers always produce the same outputs. From the perspective of fault tolerance, this is a reasonable constraint – otherwise, losing a single TaskTracker would require Hadoop to re-execute every single Reduce task. Arguably, this behavior wouldn’t be “fault tolerant”, as a small localized problem has a massive effect on the job’s execution.

If you need to use random numbers in your job, the solution is to make sure the map task always generates the same random numbers. One way to do this is to pick a “global seed” for the job and distribute the number to the tasks via the JobConf. Then, when initializing each task, initialize its random number generator with the seed “global seed” + “task id”. Within the job each task will have consistent results, but random numbers will still be distinct across tasks and across runs of the job.

Random numbers can cause other kinds of unexpected behavior besides data loss, even when the random number isn’t used as the key (These examples are significantly more complex so I won’t go into them). Just remember that idempotent tasks are critical for Hadoop to produce expected results.

(Thanks to Todd Lipcon from Cloudera for pointing this out to me)

This entry was posted in Hadoop, MapReduce. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

9 Comments

  1. Posted August 14, 2009 at 10:03 am | Permalink

    This is a problem with Hadoop, not with MapReduce per se. The title is misleading, and should read “Using random numbers in Hadoop is dangerous”, right? Hadoop is not the only MapReduce implementation.

  2. nathan
    Posted August 14, 2009 at 12:01 pm | Permalink

    @CraigH: A fair point – title updated.

  3. Johno
    Posted August 18, 2009 at 12:37 pm | Permalink

    Your proposed solution will not create valid random numbers across multiple instances. Assuming that you want your random numbers to be valid across instances (e.g, have no overlapping streams, etc), you need to be using a parallel random number generator like Lecuyer. You will need to create as many streams as tasks and then assign each stream to the tasks. You can reassign streams if one fails.

  4. Posted August 21, 2009 at 12:23 pm | Permalink

    This isn’t a problem that’s unique to Hadoop. Database systems and applications built on them have had to deal with this for a long time — sources of non-determinism, like reading from the system clock or the PNG are a problem if they have side effects. If, in a database transaction, you consult the clock and branch on the result, or insert the result into a table, then rolling back the transaction and redoing it will result in a different database state at the end.

    Distributed and transactional systems generally have trouble with non-determinism. I learned this a long time ago from user-defined functions in PostgreSQL that looked at the clock. You have to consider the non-determinism, and how it will interact with rollbacks, retries, commits and restarts. You need to design the app that runs on the system to take that stuff into account.

    In this case, the solution you propose — seeding the PNG consistently — is a great answer.

  5. Arun C Murthy
    Posted August 21, 2009 at 4:36 pm | Permalink

    Hmm… the assumption in a distributed, fault-tolerant system that Hadoop Map-Reduce is that each task-attempt is *equivalent*, not *same*. That is a subtle difference. For e.g. if you use a multithreaded MapRunner you could get keys in a different order for different map-attempts, but they are ‘equivalent’ as far as the reducers are concerned.

    PS: Your title should read – Using random numbers as *keys* in Hadoop Map-Reduce… the currentl one is misleading.

  6. Posted September 22, 2009 at 2:38 am | Permalink

    Is there the way to configure Hadoop / Amazon Web Services in such a way that it does not perform same task twice (and doesn’t need to compare results) ?

  7. nathan
    Posted September 22, 2009 at 11:34 am | Permalink

    @Alexey: Not sure if I understand your question completely. Hadoop *must* perform a task again if the tasktracker is lost. Otherwise the reducers can’t acquire their inputs and the job must fail.

  8. Posted September 23, 2009 at 5:58 am | Permalink

    @nathan
    So this happens only if some task is failed due to node failure or error in users’ computations / Hadoop itself ? When the tasktracker is lost?

  9. nathan
    Posted September 23, 2009 at 4:10 pm | Permalink

    @Alexey: Specifically, it happens when a task succeeds and *then* the tasktracker is lost. Some reducers will get the output of the first successful run of the task, while other reducers will get the output of the second successful run of the task. Tasktrackers can get lost for lots of reasons – disk going bad, disk filling up, load getting too high, etc.

2 Trackbacks

  1. [...] nathan 2009-08-14 14:56:39 View post: Using random numbers in MapReduce is dangerous [...]

  2. [...] 0 and N-1. (Make sure you are using a deterministic means of selecting random numbers as discussed here.) Finally, perform your CoGroup on these pipes, specifying the join fields to be the fields that [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>