Cycles of Doom in Batch Processing Workflows

We integrate new data into our databases via a large batch processing workflow. The execution time of this workflow directly affects the time it takes to get new data to our customers, so keeping the runtime small is of paramount importance to us. There’s an interesting effect that can happen which we’ve dubbed the “cycle of doom” which is critical to understand when optimizing these batch processing workflows.

Let’s look at an example. Let’s say we have a workflow which takes 8 hours to run, and so it processes 8 hours of data each iteration. Let’s say you want to add another piece to the workflow which will add 2 hours to the processing time. So that means the workflow runtime will now be 10 hours, with each iteration processing 10 hours of data.

Wrong.

That additional workflow piece will have a secondary effect on the system. The additional workflow piece added 2 hours to process 8 hours of data. This means that next time the workflow runs, there will be 10 hours of data to process. Since the next iteration has more data, it will take longer to run. Which means the next iteration will have even more data, and so on. This is the “cycle of doom.” How long will this spin out of control? How can we determine whether this stabilizes at some point?

Obviously there’s more to this analysis, so let’s take a more formal look at this effect. First, let’s write an equation for how long it takes one iteration of a workflow to complete:

(ITERATION TIME) = OVERHEAD + (TIME TO PROCESS ONE HOUR OF DATA) * (HOURS OF DATA)

Overhead includes things like time to start jobs, loading bloom filters, fixed network latencies, etc. “Time to process one hour of data” means how much time is spent processing the actual data, minus constant factors such as job-startup time. For simplification, let’s use shorter variable names and write this equation as:

T = O + P * H

At some point, the iteration time will stabilize at a constant runtime. We can determine the stabilization point by plugging in T = H, or in English, when the amount of time it takes to run an iteration is equal to the amount of hours of data processed by the iteration. Solving for T, we get:

T = O + P * T
T = O / (1 – P)

There’s a lot of interesting results we can take from this equation. First, if the amount of time it takes to process one hour of data (minus overhead) is more than an hour, then you will hit a cycle of doom and the iteration time will be infinity.

The more interesting result is the effect of overhead. Overhead gets multiplied. So if you have enough cluster capacity to process one hour of data in half an hour, each hour of overhead will actually add 2 hours to your iteration time. If you barely have enough cluster capacity to process one hour of data (let’s say it takes 59 minutes), then each hour of overhead will add 60 hours to your iteration time. As you can see, anything you can do to decrease overhead or decrease the data processing time (like by adding more machines) can have a huge effect on your performance.

Let’s take a look at a real-world example with Rapleaf’s workflow. We recently replaced a piece of our workflow that was writing large amounts of data to a database locally and then copying that database into HDFS. The database was getting really fragmented, so it would take 10 hours to do that copy. We replaced the database with a different system that does streaming writes which got rid of the fragmentation problem. Our workflow runtime went from 30 hours to 6 hours. So, getting rid of 10 hours of overhead had a 5x improvement on iteration time, even though it only accounted for 30% of the original runtime.

Let’s plug the numbers into our equation. We end up with:

30 = O / (1 – P)
6 = (O-10) / (1 – P)

Solving, we get P = 7/12 and O = 12.5. 10 hours of overhead corresponded to 4/5 of our total overhead, which explains the 5x improvement. We also discovered our “data processing rate” to be about half hour for each hour of data.

This equation doesn’t model everything. One factor not modeled in Rapleaf’s workflow is the fact that every iteration adds data to our database, which will slightly raise query times for the next iteration. There’s also variance in the amount of data we receive each hour which changes the analysis somewhat. Even though this equation is just an approximation, we’ve found it to be a fairly close approximation and it has really helped us understand the fundamental factors affecting our iteration times.

This entry was posted in Hadoop, MapReduce. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

9 Comments

  1. MikeD
    Posted December 18, 2009 at 12:38 pm | Permalink

    How is this model affected by elastic computing (spinning up more processing machines)? Is that not helpful since bottlenecks may be storage services?

  2. Nathan Marz
    Posted December 18, 2009 at 2:28 pm | Permalink

    @MikeD: It depends on how scalable your workflow is. If your processors contend on shared resources, then you will have limited ability to reduce that “P” parameter by increasing the number of processors.

  3. markI
    Posted December 23, 2009 at 9:55 pm | Permalink

    huuu… If you increase the number of machines and diminish the size of the blocks, theoretically you can reduce a lot of overhead, no? Then again if real time is your goal a finite state machine approach might become more scalable, but that means redo your architecture without Hadoop! How much data are we talking about? 100Tb? 1Pb?

  4. James Brown
    Posted December 30, 2009 at 10:56 am | Permalink

    Why should I waste my time with math when I can keep adding cheap disks and ram?

    This article is so 1982.

  5. Aaron Anderson
    Posted January 3, 2010 at 7:37 pm | Permalink

    Enterprise disk is NOT cheap. I really hate that people think that. Even IT managers say that “disk is cheap, throw some in the mainframe… I’ve got a 1.5TB drive on my desk that cost me $150 dollars. ” When mainframe storage for an IBM series might cost upwards of 200,000 dollars for a disk pack to hold 2TB.

    This article is so 2010.

  6. Posted January 4, 2010 at 9:52 am | Permalink

    @James Brown: Disks and RAM may be cheap, but the machines to house them are not. Nor, more importantly, are the cabinets, cooling, or power. Blithely scaling up (or out) is not always the solution, no matter what all the cloud folks would like us to believe.

    @Aaron Anderson: In our application, we’re not talking about a mainframe, so we don’t have to buy enterprise disks. We have a big cluster of “commodity” machines that use standard SATA disks, so adding 2TB is not nearly the $200k investment. However, having an extra 2TB of storage is pretty much never the problem we need to solve.

  7. Trøll Åndersen
    Posted January 6, 2010 at 12:24 pm | Permalink

    @James Brown In Norway’s summer we salt your kind and leave you out in the sun for a few weeks. We are not amused by your joke involving integers and lunar cycles.

    @ Aaron Faen, why are you paying so much for disk? That sounds like the price we paid for back when I was managing the Iralalaru project in ‘02.

  8. Posted January 7, 2010 at 1:45 pm | Permalink

    LOL at hadoop being 1982 or the concept that more disk space reduces processing time. When you design/architect a system and find that your “go to solution” is throwing hardware at it, you know you have a poor design.

  9. James Brown
    Posted January 8, 2010 at 12:46 pm | Permalink

    @Jon: LOL at the concept that batch processing time is only CPU bound in any given system. In this context, adding disks does not mean I am only looking to increase storage.

    I will leave your misinterpretation of my 1982 comment as a homework exercise.

One Trackback

  1. [...] originally wrote about this effect here. I hope you find this writeup to be clearer and more extensive. [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>