<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Rapleaf Challenge Problem</title>
	<atom:link href="http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/</link>
	<description>For engineers, by engineers.</description>
	<lastBuildDate>Mon, 07 May 2012 08:24:59 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
	<item>
		<title>By: Jonathan Feucht</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1782</link>
		<dc:creator>Jonathan Feucht</dc:creator>
		<pubDate>Tue, 18 May 2010 05:36:23 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1782</guid>
		<description>Matrix Madness problem was kind of fun... As a non-CS major, it took me around three hours to figure out. It has a very simple yet non-intuitive solution. You should find it if you&#039;re good at seeing patterns.</description>
		<content:encoded><![CDATA[<p>Matrix Madness problem was kind of fun&#8230; As a non-CS major, it took me around three hours to figure out. It has a very simple yet non-intuitive solution. You should find it if you&#8217;re good at seeing patterns.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dick King</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1772</link>
		<dc:creator>Dick King</dc:creator>
		<pubDate>Wed, 15 Apr 2009 17:34:29 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1772</guid>
		<description>I hadn&#039;t read this blog when I wrote my solution.  I therefore didn&#039;t know that the graphs had no large connected component.  My solution will perform well even if there is only a single connected component or one huge component and a second with only two nodes in it.

The reason that requires care to work well is that a solution that didn&#039;t care about this issue might well have a job whose reducer sees a large number of records with a single key on the last iteration.  This would be bad because map/reduce jobs are run on large clusters with hundreds of machines.  Each machine runs the mapper code on a subset of the reducer output data, but all of the records with the same key have to go to the same reducer instance -- leaving the rest of the reducer machines idle if most of the records share a common key.

-dk</description>
		<content:encoded><![CDATA[<p>I hadn&#8217;t read this blog when I wrote my solution.  I therefore didn&#8217;t know that the graphs had no large connected component.  My solution will perform well even if there is only a single connected component or one huge component and a second with only two nodes in it.</p>
<p>The reason that requires care to work well is that a solution that didn&#8217;t care about this issue might well have a job whose reducer sees a large number of records with a single key on the last iteration.  This would be bad because map/reduce jobs are run on large clusters with hundreds of machines.  Each machine runs the mapper code on a subset of the reducer output data, but all of the records with the same key have to go to the same reducer instance &#8212; leaving the rest of the reducer machines idle if most of the records share a common key.</p>
<p>-dk</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dick King</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1762</link>
		<dc:creator>Dick King</dc:creator>
		<pubDate>Tue, 07 Apr 2009 00:40:55 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1762</guid>
		<description>I&#039;m interpreting the statement &quot;Rapleaf&#039;s algorithm takes O(log(n)) iterations to run&quot; to implicitly allow &quot;and each iteration processes O(m) records through map/reduce, unlike the less performant solution that processes many more records through later iterations.  Therefore, a total of O(m log(n)) records are processed through many states of map/reduce, and the total run time is O(m log(m) log(n)) holding cluster size constant and considering each cluster&#039;s loading to be big enough to require log linear sort time&quot;.  

I believe I have a solution that has more iterations than that, but for which after the first O(log(log(n)) iterations the size of the map/reduce jobs gets smaller and smaller, decreasing exponentially, so the total number of records processed through my algo is something like O(m * log(m) * log(log(n))) which seriously trumps O(m * log(m) * log(n)) .

I have submitted my solution to Nathan to referee.

-dk</description>
		<content:encoded><![CDATA[<p>I&#8217;m interpreting the statement &#8220;Rapleaf&#8217;s algorithm takes O(log(n)) iterations to run&#8221; to implicitly allow &#8220;and each iteration processes O(m) records through map/reduce, unlike the less performant solution that processes many more records through later iterations.  Therefore, a total of O(m log(n)) records are processed through many states of map/reduce, and the total run time is O(m log(m) log(n)) holding cluster size constant and considering each cluster&#8217;s loading to be big enough to require log linear sort time&#8221;.  </p>
<p>I believe I have a solution that has more iterations than that, but for which after the first O(log(log(n)) iterations the size of the map/reduce jobs gets smaller and smaller, decreasing exponentially, so the total number of records processed through my algo is something like O(m * log(m) * log(log(n))) which seriously trumps O(m * log(m) * log(n)) .</p>
<p>I have submitted my solution to Nathan to referee.</p>
<p>-dk</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Patrick Angeles</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1752</link>
		<dc:creator>Patrick Angeles</dc:creator>
		<pubDate>Sat, 21 Mar 2009 17:08:29 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1752</guid>
		<description>How&#039;s this? I think it&#039;s O(2n) in space, and O(log n) in number of passes.

Job 1: Create two tuples per edge.

Map:
For each edge (a,b)
Let M = min(a,b)
Emit (a, M, a, b)
Emit (b, M, a, b)

Reduce:
None

Job 2: Find the smallest node ID that holds a reference to a node.

Map:
For each tuple (n, M, a, b)
Emit K = n, V = (M, a, b)

Reduce:
1. Iterate over all V
  Let M&#039; = min (V.M, M&#039;)

2. Iterate over all V,
  Emit (V.a, M&#039;, V.a, V.b)
  Emit (V.b, M&#039;, V.a, V.b)

Repeat Job 2 until there are no passes where M&#039; &gt; M.

Job 3: Cleanup (the reduce step here is just to ensure a unique line per node)

Map:
For each tuple (n, M, a, b)
Emit K = n, V = M

Reduce:
Emit (K, min (V))</description>
		<content:encoded><![CDATA[<p>How&#8217;s this? I think it&#8217;s O(2n) in space, and O(log n) in number of passes.</p>
<p>Job 1: Create two tuples per edge.</p>
<p>Map:<br />
For each edge (a,b)<br />
Let M = min(a,b)<br />
Emit (a, M, a, b)<br />
Emit (b, M, a, b)</p>
<p>Reduce:<br />
None</p>
<p>Job 2: Find the smallest node ID that holds a reference to a node.</p>
<p>Map:<br />
For each tuple (n, M, a, b)<br />
Emit K = n, V = (M, a, b)</p>
<p>Reduce:<br />
1. Iterate over all V<br />
  Let M&#8217; = min (V.M, M&#8217;)</p>
<p>2. Iterate over all V,<br />
  Emit (V.a, M&#8217;, V.a, V.b)<br />
  Emit (V.b, M&#8217;, V.a, V.b)</p>
<p>Repeat Job 2 until there are no passes where M&#8217; &gt; M.</p>
<p>Job 3: Cleanup (the reduce step here is just to ensure a unique line per node)</p>
<p>Map:<br />
For each tuple (n, M, a, b)<br />
Emit K = n, V = M</p>
<p>Reduce:<br />
Emit (K, min (V))</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: nathan</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1742</link>
		<dc:creator>nathan</dc:creator>
		<pubDate>Thu, 19 Feb 2009 20:54:31 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1742</guid>
		<description>@Jon: The graphs that Rapleaf deals with in production have a lot of components. The largest component contains a tiny fraction of all the nodes. Of course, different problems could have different graph topologies and a solution to this problem should be able to handle many different kinds of graphs.</description>
		<content:encoded><![CDATA[<p>@Jon: The graphs that Rapleaf deals with in production have a lot of components. The largest component contains a tiny fraction of all the nodes. Of course, different problems could have different graph topologies and a solution to this problem should be able to handle many different kinds of graphs.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: nathan</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1732</link>
		<dc:creator>nathan</dc:creator>
		<pubDate>Thu, 19 Feb 2009 20:00:20 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1732</guid>
		<description>@uday: Rapleaf&#039;s algorithm takes O(log(n)) iterations to run. The same is true of the example &quot;suboptimal solution.&quot;</description>
		<content:encoded><![CDATA[<p>@uday: Rapleaf&#8217;s algorithm takes O(log(n)) iterations to run. The same is true of the example &#8220;suboptimal solution.&#8221;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: uday mantripragada</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1722</link>
		<dc:creator>uday mantripragada</dc:creator>
		<pubDate>Wed, 18 Feb 2009 19:49:50 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1722</guid>
		<description>I think i have a solution that takes O(n) space and log(n-1) iterations. Looking at the description, RapLeaf&#039;s algorithm takes n iterations. Is that true?</description>
		<content:encoded><![CDATA[<p>I think i have a solution that takes O(n) space and log(n-1) iterations. Looking at the description, RapLeaf&#8217;s algorithm takes n iterations. Is that true?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jon</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1712</link>
		<dc:creator>Jon</dc:creator>
		<pubDate>Mon, 09 Feb 2009 05:55:50 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1712</guid>
		<description>Hi guys, this is a really cool problem, thanks for posting it!

I have some questions:

1) I&#039;m curious how many vertices are in the largest connected component?

Presuming your graph has n = ~1B and m = ~10B+ (undirected), and presuming all vertices are of the same general class (e.g. email addresses that are in some way associated), wouldn&#039;t you expect 95%+ of the vertices to be in the largest component?

i.e. When the average number of links per vertex is &gt; 1, the fraction of all vertices in the largest component should increase quickly, even in a highly clustered graph, presuming at least some ties are non-local. (eg articles by Watts, Strogatz, Newman, Kleinberg, etc)

2) Do you wish to distinguish between strongly and weakly connected components for the sake of this problem? I know you use undirected ties in the example algorithm, but the underlying data seems to be directed, which could be neat for calculating diffusion probability scores or shannon entropy over time.

3) Presuming the ultimate goal is to identify cohesive subgroups in a graph that reflect some group shared identity or pattern of homophily (eg propensity to purchase a product, donate to a political cause, etc), I believe subgroup measures would likely better identify meaningful social groups than the connected components (as the CC is so broad, homophily predictions based on it will have a lot of noise).

Therefore, I&#039;m curious what other network structure measures you have implemented or are considering?</description>
		<content:encoded><![CDATA[<p>Hi guys, this is a really cool problem, thanks for posting it!</p>
<p>I have some questions:</p>
<p>1) I&#8217;m curious how many vertices are in the largest connected component?</p>
<p>Presuming your graph has n = ~1B and m = ~10B+ (undirected), and presuming all vertices are of the same general class (e.g. email addresses that are in some way associated), wouldn&#8217;t you expect 95%+ of the vertices to be in the largest component?</p>
<p>i.e. When the average number of links per vertex is &gt; 1, the fraction of all vertices in the largest component should increase quickly, even in a highly clustered graph, presuming at least some ties are non-local. (eg articles by Watts, Strogatz, Newman, Kleinberg, etc)</p>
<p>2) Do you wish to distinguish between strongly and weakly connected components for the sake of this problem? I know you use undirected ties in the example algorithm, but the underlying data seems to be directed, which could be neat for calculating diffusion probability scores or shannon entropy over time.</p>
<p>3) Presuming the ultimate goal is to identify cohesive subgroups in a graph that reflect some group shared identity or pattern of homophily (eg propensity to purchase a product, donate to a political cause, etc), I believe subgroup measures would likely better identify meaningful social groups than the connected components (as the CC is so broad, homophily predictions based on it will have a lot of noise).</p>
<p>Therefore, I&#8217;m curious what other network structure measures you have implemented or are considering?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: nathan</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1702</link>
		<dc:creator>nathan</dc:creator>
		<pubDate>Tue, 06 Jan 2009 23:54:21 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1702</guid>
		<description>Union-find is an algorithm to solve the same problem in a shared-memory setting. What makes this problem difficult is there is no shared memory in a MapReduce setting so different techniques are needed.</description>
		<content:encoded><![CDATA[<p>Union-find is an algorithm to solve the same problem in a shared-memory setting. What makes this problem difficult is there is no shared memory in a MapReduce setting so different techniques are needed.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Vijay Chakravarthy</title>
		<link>http://blog.rapleaf.com/dev/2008/12/11/rapleaf-challenge-problem/#comment-1692</link>
		<dc:creator>Vijay Chakravarthy</dc:creator>
		<pubDate>Fri, 26 Dec 2008 23:47:40 +0000</pubDate>
		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=40#comment-1692</guid>
		<description>Isnt this just a distributed union-find?</description>
		<content:encoded><![CDATA[<p>Isnt this just a distributed union-find?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

