One big downside of using Cascading for our applications has been the runtime of our regression test suite. We test with quantities of data nowhere near our regular production volume, but we still end up running lots of jobs. In our experience, this ends up making our tests take a long time (in the tens of minutes), killing our ability to iterate quickly.
After looking more deeply into the issue, we discovered that it all came down to one particular polling interval. When Cascading launches a Flow, it will launch one or more actual Hadoop jobs and then wait until each completes before launching the next in the pipeline. The problem is that the amount of time Cascading waits between checking job completion is set to something like 5 seconds by default! This setting makes plenty of sense in the case of real world jobs, which should all be at least minutes in length – 5 seconds isn’t going to make a difference one way or another. However, in our ultra-short job scenario, this makes all the difference. If your Flow works out to 10 jobs that run serially, the fastest it could complete is 50 seconds.
Initially, we customized Cascading 1.0.8 to reduce this wait time down to about 100 milliseconds. However, when we recently upgraded to Cascading 1.1, we were pleasantly surprised to find that this polling interval was now configurable. Generally, it looks something like this:
properties.put("cascading.flow.job.pollinginterval", 100);
new FlowConnector(properties).connect(...).complete();
With a convenient way to change this parameter, the only other thing we need is a convenient way to set this value environmentally. Ideally, we’d like to leave the parameter alone during production runs and only set it low during our test suite. This is a little tricky because, unlike Hadoop, Cascading itself doesn’t provide any global configuration framework.
The solution we ended up going with was to provide a class with a static method for getting new FlowConnectors that replaces the standard constructor. This method allows the user to provide any options they need and merges in whatever the current environmental polling interval should be. It looks something like this:
public static final Map<Object, Object> DEFAULT_PROPERTIES = new HashMap<Object,Object>();
private CascadingHelper() {}
public static FlowConnector getFlowConnector() {
return new FlowConnector(DEFAULT_PROPERTIES);
}
public static FlowConnector getFlowConnector(Map<Object, Object> properties) {
Map<Object, Object> combined = new HashMap<Object, Object>();
for (Map.Entry<Object,Object> entry : DEFAULT_PROPERTIES.entrySet()) {
combined.put(entry.getKey(), entry.getValue());
}
for (Map.Entry<Object,Object> entry : properties.entrySet()) {
combined.put(entry.getKey(), entry.getValue());
}
return new FlowConnector(combined);
}
}
Finally, in our test suite, we made all of our tests inherit from a common base test class that reconfigures the default polling interval in the class constructor. Voila, our test suite takes 40% less time!
The only downside we’re currently presented with is that if a user forgets to use either the FlowConnector provider or the test base class, then the Flows in the tests run slowly, so constant vigilance is required. Still, making this change has caused our build to run somewhere between 2x and 3x as fast, which is just plain awesome.

Thrift and pseduo-RDF schemas
BackType’s Nathan Marz recently wrote a really great post about using Thrift and an RDF-like schema to get type-safe, extensible, high-performance schemas for use in Hadoop environments. He really hit the nail on the head describing the use pattern and the positives and negatives.
A variation on this approach is something we’ve been doing at Rapleaf for a few years now. Our heavy use of Thrift, including for this style of data storage, is what lead us to contribute patches for union support, amongst many other features. I’m really glad to see that others are making as much use of these features as we are.
One thing he’s doing that we’re not is treating different types of “relations” as first-class separate data types. Instead, we have a single DataUnit class with all the possible attributes on it. This means that the information we can store about different node types isn’t as constrained. This has been suitable for us up to this point because all of our data is about individuals. I could certainly see how, as we add different kinds of nodes (like Facebook Groups or corporations) into our system, using this strategy would be a big consistency improvement.
Additionally, our data model includes an extra field of general metadata we call Pedigree. This stores the when, where, and how elements of the data we collect. It’s not a significant extension to the data model Nathan describes, but it’s an important one.
All in all, this is a great step forward. Hope to see more of this kind of stuff out of you, Nathan!