We’ve created a challenge problem based on one of the core problems we’ve had to solve in our MapReduce workflow. A word of warning – this isn’t one of those toy problems other companies put out on their careers page. This one is so hard it will make you cry.
-
Recent Posts
Popular Posts
- Rent or Own: Amazon EC2 vs. Colocation Comparison for Hadoop Clusters 27 comment(s) | 10834 view(s)
- Mysql Replication Adapter 26 comment(s) | 6656 view(s)
- Making sure Ruby Daemons die 20 comment(s) | 7353 view(s)
- Matching Impedance: When to use HBase 19 comment(s) | 22271 view(s)
- Goodbye MapReduce, Hello Cascading 17 comment(s) | 9671 view(s)
- Rapleaf Challenge Problem 12 comment(s) | 3792 view(s)
- BloomFilter 11 comment(s) | 5441 view(s)
- Using random numbers in Hadoop MapReduce is dangerous 11 comment(s) | 4028 view(s)
- Ruby and HBase 10 comment(s) | 5263 view(s)
- Cycles of Doom in Batch Processing Workflows 10 comment(s) | 2660 view(s)
Categories
- Anonymouse (1)
- Apache (1)
- bash (1)
- Cascading (6)
- Daemons (1)
- encryption (1)
- Extensions (2)
- Google (1)
- Grub (1)
- Hadoop (22)
- HBase (6)
- HDFS (4)
- Kickstart (1)
- MapReduce (9)
- mcrypt (1)
- Miscellaneous (26)
- Mongrel (2)
- Mysql (2)
- OpenSocial (1)
- Operations (1)
- Ruby (7)
- Security (2)
- Thrift (6)
- Xen (1)
Archives
- August 2010
- July 2010
- June 2010
- May 2010
- April 2010
- March 2010
- February 2010
- January 2010
- December 2009
- November 2009
- October 2009
- September 2009
- August 2009
- July 2009
- June 2009
- May 2009
- March 2009
- February 2009
- December 2008
- November 2008
- October 2008
- September 2008
- August 2008
- July 2008
- April 2008
- March 2008
- February 2008
- January 2008
- December 2007
- November 2007
- October 2007
- September 2007
- August 2007

12 Comments
A little much with the hyberbole, perhaps? The problem’s not _that_ hard.
@Kyle: I would love to see your solution
Isnt this just a distributed union-find?
Union-find is an algorithm to solve the same problem in a shared-memory setting. What makes this problem difficult is there is no shared memory in a MapReduce setting so different techniques are needed.
Hi guys, this is a really cool problem, thanks for posting it!
I have some questions:
1) I’m curious how many vertices are in the largest connected component?
Presuming your graph has n = ~1B and m = ~10B+ (undirected), and presuming all vertices are of the same general class (e.g. email addresses that are in some way associated), wouldn’t you expect 95%+ of the vertices to be in the largest component?
i.e. When the average number of links per vertex is > 1, the fraction of all vertices in the largest component should increase quickly, even in a highly clustered graph, presuming at least some ties are non-local. (eg articles by Watts, Strogatz, Newman, Kleinberg, etc)
2) Do you wish to distinguish between strongly and weakly connected components for the sake of this problem? I know you use undirected ties in the example algorithm, but the underlying data seems to be directed, which could be neat for calculating diffusion probability scores or shannon entropy over time.
3) Presuming the ultimate goal is to identify cohesive subgroups in a graph that reflect some group shared identity or pattern of homophily (eg propensity to purchase a product, donate to a political cause, etc), I believe subgroup measures would likely better identify meaningful social groups than the connected components (as the CC is so broad, homophily predictions based on it will have a lot of noise).
Therefore, I’m curious what other network structure measures you have implemented or are considering?
I think i have a solution that takes O(n) space and log(n-1) iterations. Looking at the description, RapLeaf’s algorithm takes n iterations. Is that true?
@uday: Rapleaf’s algorithm takes O(log(n)) iterations to run. The same is true of the example “suboptimal solution.”
@Jon: The graphs that Rapleaf deals with in production have a lot of components. The largest component contains a tiny fraction of all the nodes. Of course, different problems could have different graph topologies and a solution to this problem should be able to handle many different kinds of graphs.
How’s this? I think it’s O(2n) in space, and O(log n) in number of passes.
Job 1: Create two tuples per edge.
Map:
For each edge (a,b)
Let M = min(a,b)
Emit (a, M, a, b)
Emit (b, M, a, b)
Reduce:
None
Job 2: Find the smallest node ID that holds a reference to a node.
Map:
For each tuple (n, M, a, b)
Emit K = n, V = (M, a, b)
Reduce:
1. Iterate over all V
Let M’ = min (V.M, M’)
2. Iterate over all V,
Emit (V.a, M’, V.a, V.b)
Emit (V.b, M’, V.a, V.b)
Repeat Job 2 until there are no passes where M’ > M.
Job 3: Cleanup (the reduce step here is just to ensure a unique line per node)
Map:
For each tuple (n, M, a, b)
Emit K = n, V = M
Reduce:
Emit (K, min (V))
I’m interpreting the statement “Rapleaf’s algorithm takes O(log(n)) iterations to run” to implicitly allow “and each iteration processes O(m) records through map/reduce, unlike the less performant solution that processes many more records through later iterations. Therefore, a total of O(m log(n)) records are processed through many states of map/reduce, and the total run time is O(m log(m) log(n)) holding cluster size constant and considering each cluster’s loading to be big enough to require log linear sort time”.
I believe I have a solution that has more iterations than that, but for which after the first O(log(log(n)) iterations the size of the map/reduce jobs gets smaller and smaller, decreasing exponentially, so the total number of records processed through my algo is something like O(m * log(m) * log(log(n))) which seriously trumps O(m * log(m) * log(n)) .
I have submitted my solution to Nathan to referee.
-dk
I hadn’t read this blog when I wrote my solution. I therefore didn’t know that the graphs had no large connected component. My solution will perform well even if there is only a single connected component or one huge component and a second with only two nodes in it.
The reason that requires care to work well is that a solution that didn’t care about this issue might well have a job whose reducer sees a large number of records with a single key on the last iteration. This would be bad because map/reduce jobs are run on large clusters with hundreds of machines. Each machine runs the mapper code on a subset of the reducer output data, but all of the records with the same key have to go to the same reducer instance — leaving the rest of the reducer machines idle if most of the records share a common key.
-dk
Matrix Madness problem was kind of fun… As a non-CS major, it took me around three hours to figure out. It has a very simple yet non-intuitive solution. You should find it if you’re good at seeing patterns.