One challenge we constantly face at Rapleaf is how to remain agile as we grow. In the early days, we would say “I’m going to deploy” to a handful of other engineers in the room. When we needed to move beyond that, we designed a deployment system with 3 key objectives:
- Remain agile. Engineers should be able to deploy as frequently as they need to.
- Communicate. Other engineers should know when a deploy is about to happen and when it is finished.
- Must be robust. Have access controls, prevent multiple deploys and check for common problems.
We were able to create a simple deployment system using Screen, CruiseControl, Capistrano and Openfire (Jabber). Screen was an obvious choice when we considered all of our requirements. Before screen, we all deployed from our local machines. There were a couple of problems with this: many people could deploy at once and their environments could be different. Using a shared screen on a single machine solved both of these problems. Screen also has access controls, multiple windows and scrollback history. We have a window for each application and our screenrc file looks something like this:
# Set status line
caption always "%{B}%t deploy %{R}REMEMBER TO 'svn up' %{B}%? @%u%?%? %{d}[%h]%?%=%D %M %d %C %A"
# command usually invoked by C-a " would also be available as C-a space
bind ' ' windowlist -b
# Create a screen window per application
screen -ln -t "app1" -h 5000 1
screen -ln -t "app2" -h 5000 2
# Allow multiple users to connect
multiuser on
# Engineers that can deploy
acladd engineer1,engineer2
CruiseControl is a great continuous integration tool. We always specify a revision when we deploy, so we can easily check cruise to make sure that the revision has passed the test suite.
Capistrano is the basis for all of our application deploys, including java apps. We identified several functions we wanted to perform in every deploy and put those into a common library. Those common functions include: notifying a jabber conference room on start/finish, running a sanity check on revision being deployed and dealing with our load balancer. Capistrano has many other nice features that we use, but I won’t go into that here.
We’ve been running this system for a while and really like it. We were able to meet all of our objectives with a simple design and some great open-source tools.

Pseudo-Combiners in Cascading
In order to get maximum performance from MapReduce, you need to minimize the amount of data that you have to transfer around the network. If nearly your entire input must be transferred from your mappers to your reducers, then you’ll be putting a great deal of stress on your disks and network. One thing that comes highly recommended is the use of combiners, which allow for part of the reducing to be done during the map phase in cases where you are performing associative and commutative aggregations such as counting, summing, or finding the minimum or maximum. This is especially true when you have very few group keys, which would force large numbers of tuples to be passed into a small number of reducers.
Unfortunately, while MapReduce supports combiners, Cascading does not. Instead, we decided to hack together our own solution, which we’re calling a “pseudo-combiner”. A traditional combiner maintains a buffer of tuples and does sorting and aggregation when the buffer fills up. Our pseudo-combiner maintains a map with the group field as the key and the combiner output as the value. For every tuple, we will perform the combiner functionality when we update the entry in the map. This is better for our most common use case, which involves counting billions of values in fewer than 100 categories. Since we can very easily hold all of our categories in the map, we can ensure that we only have one output per key value from each mapper.
Our implementation uses an LRUHashMap, which is an in-house extension of LinkedHashMap. The LRUHashMap uses the LinkedHashMap to maintain a cap on how many entries are allowed in the map and evicts the oldest key, value pairs when the map grows beyond its limit. The evicted pair is made available so that we can emit the correct output for it. When all the input tuples have been read, we merely flush all the contents of the hash and emit all necessary tuples for them.
The abstract class we’ve designed has three functions that should be implemented by every combiner:
protected abstract T initialize(Tuple tuple);
protected abstract void update(T toUpdate, Tuple newTuple);
protected abstract Tuple getTuple(T mapValue);
}
The initialize function is called the first time we see a key, and allows us to store the initial combiner value for the key. The update function is called whenever we see a value for a key we’ve already seen. The current value in the map and the new tuple are passed in. The final getTuple function is called whenever we need to decide what to emit for an entry in our map. This occurs on eviction and when we flush the contents of the map at the end.
Thus, our use case of counting would look like this:
// When we see the first tuple, we initialize the count to 1
protected Long initialize(Tuple tuple) {
return 1;
}
// On each subsequent tuple, we increment the count
protected void update(Long toUpdate, Tuple newTuple) {
toUpdate++;
}
// When we need to emit a tuple for the key, we emit the count stored in the map
protected Tuple getTuple(Long mapValue) {
return new Tuple(mapValue);
}
}
Our adoption of combiners gave us huge performance improvements, cutting one of our stats jobs from an hour down to around two minutes, which means we can run the stats hourly.