Rapleaf Challenge Problem

We’ve created a challenge problem based on one of the core problems we’ve had to solve in our MapReduce workflow. A word of warning – this isn’t one of those toy problems other companies put out on their careers page. This one is so hard it will make you cry.

Rapleaf Challenge Problem

This entry was posted in Hadoop, Miscellaneous. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

12 Comments

  1. Posted December 17, 2008 at 2:29 am | Permalink

    A little much with the hyberbole, perhaps? The problem’s not _that_ hard.

  2. nathan
    Posted December 17, 2008 at 12:21 pm | Permalink

    @Kyle: I would love to see your solution :)

  3. Vijay Chakravarthy
    Posted December 26, 2008 at 4:47 pm | Permalink

    Isnt this just a distributed union-find?

  4. nathan
    Posted January 6, 2009 at 4:54 pm | Permalink

    Union-find is an algorithm to solve the same problem in a shared-memory setting. What makes this problem difficult is there is no shared memory in a MapReduce setting so different techniques are needed.

  5. Jon
    Posted February 8, 2009 at 10:55 pm | Permalink

    Hi guys, this is a really cool problem, thanks for posting it!

    I have some questions:

    1) I’m curious how many vertices are in the largest connected component?

    Presuming your graph has n = ~1B and m = ~10B+ (undirected), and presuming all vertices are of the same general class (e.g. email addresses that are in some way associated), wouldn’t you expect 95%+ of the vertices to be in the largest component?

    i.e. When the average number of links per vertex is > 1, the fraction of all vertices in the largest component should increase quickly, even in a highly clustered graph, presuming at least some ties are non-local. (eg articles by Watts, Strogatz, Newman, Kleinberg, etc)

    2) Do you wish to distinguish between strongly and weakly connected components for the sake of this problem? I know you use undirected ties in the example algorithm, but the underlying data seems to be directed, which could be neat for calculating diffusion probability scores or shannon entropy over time.

    3) Presuming the ultimate goal is to identify cohesive subgroups in a graph that reflect some group shared identity or pattern of homophily (eg propensity to purchase a product, donate to a political cause, etc), I believe subgroup measures would likely better identify meaningful social groups than the connected components (as the CC is so broad, homophily predictions based on it will have a lot of noise).

    Therefore, I’m curious what other network structure measures you have implemented or are considering?

  6. uday mantripragada
    Posted February 18, 2009 at 12:49 pm | Permalink

    I think i have a solution that takes O(n) space and log(n-1) iterations. Looking at the description, RapLeaf’s algorithm takes n iterations. Is that true?

  7. nathan
    Posted February 19, 2009 at 1:00 pm | Permalink

    @uday: Rapleaf’s algorithm takes O(log(n)) iterations to run. The same is true of the example “suboptimal solution.”

  8. nathan
    Posted February 19, 2009 at 1:54 pm | Permalink

    @Jon: The graphs that Rapleaf deals with in production have a lot of components. The largest component contains a tiny fraction of all the nodes. Of course, different problems could have different graph topologies and a solution to this problem should be able to handle many different kinds of graphs.

  9. Patrick Angeles
    Posted March 21, 2009 at 10:08 am | Permalink

    How’s this? I think it’s O(2n) in space, and O(log n) in number of passes.

    Job 1: Create two tuples per edge.

    Map:
    For each edge (a,b)
    Let M = min(a,b)
    Emit (a, M, a, b)
    Emit (b, M, a, b)

    Reduce:
    None

    Job 2: Find the smallest node ID that holds a reference to a node.

    Map:
    For each tuple (n, M, a, b)
    Emit K = n, V = (M, a, b)

    Reduce:
    1. Iterate over all V
    Let M’ = min (V.M, M’)

    2. Iterate over all V,
    Emit (V.a, M’, V.a, V.b)
    Emit (V.b, M’, V.a, V.b)

    Repeat Job 2 until there are no passes where M’ > M.

    Job 3: Cleanup (the reduce step here is just to ensure a unique line per node)

    Map:
    For each tuple (n, M, a, b)
    Emit K = n, V = M

    Reduce:
    Emit (K, min (V))

  10. Dick King
    Posted April 6, 2009 at 5:40 pm | Permalink

    I’m interpreting the statement “Rapleaf’s algorithm takes O(log(n)) iterations to run” to implicitly allow “and each iteration processes O(m) records through map/reduce, unlike the less performant solution that processes many more records through later iterations. Therefore, a total of O(m log(n)) records are processed through many states of map/reduce, and the total run time is O(m log(m) log(n)) holding cluster size constant and considering each cluster’s loading to be big enough to require log linear sort time”.

    I believe I have a solution that has more iterations than that, but for which after the first O(log(log(n)) iterations the size of the map/reduce jobs gets smaller and smaller, decreasing exponentially, so the total number of records processed through my algo is something like O(m * log(m) * log(log(n))) which seriously trumps O(m * log(m) * log(n)) .

    I have submitted my solution to Nathan to referee.

    -dk

  11. Dick King
    Posted April 15, 2009 at 10:34 am | Permalink

    I hadn’t read this blog when I wrote my solution. I therefore didn’t know that the graphs had no large connected component. My solution will perform well even if there is only a single connected component or one huge component and a second with only two nodes in it.

    The reason that requires care to work well is that a solution that didn’t care about this issue might well have a job whose reducer sees a large number of records with a single key on the last iteration. This would be bad because map/reduce jobs are run on large clusters with hundreds of machines. Each machine runs the mapper code on a subset of the reducer output data, but all of the records with the same key have to go to the same reducer instance — leaving the rest of the reducer machines idle if most of the records share a common key.

    -dk

  12. Posted May 17, 2010 at 10:36 pm | Permalink

    Matrix Madness problem was kind of fun… As a non-CS major, it took me around three hours to figure out. It has a very simple yet non-intuitive solution. You should find it if you’re good at seeing patterns.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>