<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Engineering Rapleaf &#187; Bryan Duxbury</title>
	<atom:link href="http://blog.rapleaf.com/dev/author/bryan/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.rapleaf.com/dev</link>
	<description>For engineers, by engineers.</description>
	<lastBuildDate>Mon, 12 Dec 2011 08:57:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>More Compact Than CompactProtocol: TupleProtocol</title>
		<link>http://blog.rapleaf.com/dev/2011/10/17/more-compact-than-compactprotocol-tupleprotocol/</link>
		<comments>http://blog.rapleaf.com/dev/2011/10/17/more-compact-than-compactprotocol-tupleprotocol/#comments</comments>
		<pubDate>Mon, 17 Oct 2011 18:18:41 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=6193</guid>
		<description><![CDATA[Rapleaf makes extensive use of Thrift&#8216;s CompactProtocol to save space for long-term data storage and for communicating between services. However, this summer, star Rapleaf intern Armaan Sarkar took us to a new level of compact-ness with his work on the new TupleProtocol. While we are completely happy with the CompactProtocol for permanent data storage and [...]]]></description>
			<content:encoded><![CDATA[<p>Rapleaf makes extensive use of <a href="http://thrift.apache.org">Thrift</a>&#8216;s CompactProtocol to save space for long-term data storage and for communicating between services. However, this summer, star Rapleaf intern Armaan Sarkar took us to a new level of compact-ness with his work on the new <a href="https://issues.apache.org/jira/browse/THRIFT-1239">TupleProtocol</a>.</p>
<p>While we are completely happy with the CompactProtocol for permanent data storage and RPC, there is one place where it&#8217;s not a perfect fit. Since all our objects are Thrift structs, we pass those same structs around during our <a href="http://www.cascading.org">Cascading</a> flows. We&#8217;ve built a connector (extremely similar to <a href="https://github.com/nathanmarz/cascading-thrift">this one</a>) that allows these Thrift structs to be serialized in the CompactProtocol when Cascading serializes Tuples between mappers and reducers, and this all works great. However, as part of our never-ending performance efforts, one of the things we are always striving to do is to decrease the amount of data that needs to be shipped between mappers and reducers. You&#8217;re probably thinking, isn&#8217;t the CompactProtocol&#8230; compact? The answer is yes. It&#8217;s the smallest representation we could come up with for a general-purpose Thrift protocol.</p>
<p>But wait! Recall that Thrift as a framework is designed to allow smooth transitions for users using different versions of their schema (like an old client contacting an updated server). Thrift does this by making sure that the data that gets serialized is sufficient to fully describe itself &#8211; serialized messages contain markers about field IDs and field data types. Using these markers, it&#8217;s possible for the Thrift library to skip over unrecognized fields. But this ability comes at the cost of a few bytes of overhead here and there, and these bytes can really add up when you have 200 billion records to move around.</p>
<p>Look back at our imperfect use case: intermediate serialization during a Cascading flow. When would the Thrift schema ever change during a single job? The answer is that it wouldn&#8217;t. This means that the extra bytes that the CompactProtocol spends on the feature to support changing schemas is 100% waste. It would be great if we didn&#8217;t have to pay the cost of a feature we don&#8217;t use.</p>
<p>This is exactly what the TupleProtocol does. Instead of being a general-purpose protocol, TupleProtocol is a purpose-built protocol designed specifically for the case when you know beyond the shadow of a doubt that the schema for a record cannot change. With this precondition, it can avoid writing out the type markers and field IDs and just uses the metadata implicit in the code itself to guide deserialization. For instance, it knows that what&#8217;s coming up next is going to be a string field called &#8220;foo&#8221; because it can assume the writer of the data wrote it exactly to that spec. By dumping this extra overhead, users of the TupleProtocol can see a significant decrease in the size of serialized objects. One sample job we tested saw about a 5% size decrease. The actual savings you see will vary a lot depending on what kind of data you have in your structs, but in general, the more non-string fields you have, the more benefit you&#8217;ll see. For us, just based on the 5% figure, it amounts to tens of gigabytes less shuffle in our biggest jobs.</p>
<p>The TupleProtocol is currently committed to Thrift TRUNK and will be released as part of Thrift 0.8. If you&#8217;d like to try it out now, we&#8217;d love to hear your feedback!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2011/10/17/more-compact-than-compactprotocol-tupleprotocol/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Google Calendar + Arduino = The Roominator</title>
		<link>http://blog.rapleaf.com/dev/2011/08/01/google-calendar-arduino-the-roominator/</link>
		<comments>http://blog.rapleaf.com/dev/2011/08/01/google-calendar-arduino-the-roominator/#comments</comments>
		<pubDate>Mon, 01 Aug 2011 23:40:01 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=3383</guid>
		<description><![CDATA[Last week was our quarterly Hackleaf, and this time around a handful of us set out to solve a slightly more physical problem than we usually tackle: conference room abuse. We have nine conference rooms in our office these days, and even though we regularly use Google Calendar to schedule meetings, we still struggle with [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignnone" src="http://distillery.s3.amazonaws.com/media/2011/07/28/481918b1e30e4dec804732da7848b407_7.jpg" alt="" width="612" height="612" /></p>
<p>Last week was our quarterly Hackleaf, and this time around a handful of us set out to solve a slightly more <em>physical</em> problem than we usually tackle: conference room abuse. We have nine conference rooms in our office these days, and even though we regularly use Google Calendar to schedule meetings, we still struggle with rooms being hijacked.</p>
<p>It&#8217;s almost always an accident &#8211; two people start chatting, one inevitably says &#8220;Let&#8217;s grab a room,&#8221; and then they just walk around the office until they happen upon an unoccupied room.  Unfortunately, it&#8217;s often the case that while the room is currently unoccupied, another meeting is scheduled to start very soon. In a few minutes, the rightful occupants show up and discover that someone is already using their room! If the squatters are just using the whiteboard, it&#8217;s easy to boot them out, but if they&#8217;re taking a phone call, the people who actually reserved the room have to find a new one. The ability to do ad-hoc meetings is something we want to keep. We just want to help room-grabbers know when it is OK to grab and when it isn&#8217;t.</p>
<p>Enter the Roominator, an open-source system of hardware and software that helps facilitate informed ad-hoc reservations.</p>
<p>The hardware consists of two parts: a display unit that&#8217;s posted outside of each conference room, and a controller unit that&#8217;s stashed in our wiring closet. The display unit shows the current and upcoming reservations and an LED status indicator that can tell you from a distance whether a room is &#8220;good to grab&#8221;. It also has a pair of buttons &#8211; one to make an ad-hoc reservation and one to cancel the current reservation. The controller unit interfaces with all the displays to distribute power and data, both of which run over a single standard Cat5e cable. Both the controller and the displays are <a title="Arduino Homepage" href="http://www.arduino.cc">Arduino</a>-based.</p>
<p>The software component is a Rails web site that allows for configuration and integrates with Google Calendar. Reservations made via Google Calendar are sync&#8217;d with the Roominator, and vice-versa. The controller unit polls the web site for the information it should pass to the displays.</p>
<p>This project was a lot of fun for all involved, and we definitely went outside of our comfort zone with this one. There&#8217;s a lot of polishing to do in order to get the UI just right &#8211; and we still have to manufacture another seven display units by hand! (No copy/paste in the real world, after all.)</p>
<p>Want your own Roominator setup? All the <a title="Roominator on GitHub" href="https://github.com/bryanduxbury/roominator">source code and schematics</a> are on GitHub. If you decide to go down the road of building your own units, please get in touch with us so we can collaborate!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2011/08/01/google-calendar-arduino-the-roominator/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Java Performance: synchronized() vs Lock</title>
		<link>http://blog.rapleaf.com/dev/2011/06/16/java-performance-synchronized-vs-lock/</link>
		<comments>http://blog.rapleaf.com/dev/2011/06/16/java-performance-synchronized-vs-lock/#comments</comments>
		<pubDate>Thu, 16 Jun 2011 23:29:54 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=3163</guid>
		<description><![CDATA[Yesterday, I noticed that one of our systems was using a Lock where a plain old synchronized() block would suffice, and I thought to myself, does this matter? Since the Lock was already fulfilling the same role, the only real question was performance. My gut told me that there should be a performance difference between [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday, I noticed that one of our systems was using a <a href="http://download.oracle.com/javase/6/docs/api/java/util/concurrent/locks/Lock.html">Lock</a> where a plain old synchronized() block would suffice, and I thought to myself, does this matter? Since the Lock was already fulfilling the same role, the only real question was performance.</p>
<p>My gut told me that there <em>should</em> be a performance difference between a built-in language construct and a library, but experience has taught me not to guess when it comes to performance. A quick Google search led to a lot of posts deflecting the main question with cries of &#8220;premature optimization!&#8221;, and thus did not help me at all.</p>
<p>I ended up writing a micro-benchmark program (<a href="https://github.com/bryanduxbury/sync_vs_lock">code on GitHub</a>) that exercises Lock and synchronized() equally to measure total throughput. I measured two different situations, single-threaded and two-threaded,  because of <a href="http://www.mailinator.com/tymaPaulMultithreaded.pdf">previous reading</a> that indicated the JVM was pretty good at  optimizing the uncontended case.</p>
<p>The results were clear and fairly unsurprising: synchronized() is substantially faster. In the single-threaded test, synchronized() was about 7.5x faster on average than Lock.lock(). In the two-threaded test, synchronized() was still the clear winner, about 2x faster on average.</p>
<p>Bottom line: if you can use synchronized() instead of Lock, then you definitely should use synchronized().</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2011/06/16/java-performance-synchronized-vs-lock/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Lightweight Trie</title>
		<link>http://blog.rapleaf.com/dev/2011/04/12/lightweight-trie/</link>
		<comments>http://blog.rapleaf.com/dev/2011/04/12/lightweight-trie/#comments</comments>
		<pubDate>Tue, 12 Apr 2011 22:01:30 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=2662</guid>
		<description><![CDATA[One of the most interesting things that we do at Rapleaf is use our existing data to deduce or infer new data. For example, a person&#8217;s name is often highly correlated with a specific gender. After doing lots and lots of regression, we usually end up with a simple HashMap loaded from a file that [...]]]></description>
			<content:encoded><![CDATA[<p>One of the most interesting things that we do at Rapleaf is use our existing data to deduce or infer new data. For example, <a title="Introducing the Utilities API" href="http://blog.rapleaf.com/dev/2011/01/25/introducing-the-utilities-api/">a person&#8217;s name is often highly correlated with a specific gender</a>. After doing lots and lots of regression, we usually end up with a simple HashMap loaded from a file that our code uses to actually make the deduction at runtime. The map is small enough to fit in memory, so access is very fast.</p>
<p>However, memory is a scarce resource, particularly in the <a href="http://hadoop.apache.org">Hadoop</a> context. If we don&#8217;t manage it carefully, a misbehaving task can lead to our entire cluster swapping or crashing. Ideally, we&#8217;d like to get the most bang for our memory buck, even if it costs us a little CPU time, since we can bear a slowdown if it avoids a catastrophic failure.</p>
<p>Let&#8217;s back up for a second and think about the kind of data we want to store. For example, look at the &#8220;name to gender&#8221; mapping. In essence, you have a name string and a gender enum, like this:</p>
<table border="1">
<tbody>
<tr>
<td>bryn</td>
<td>female</td>
</tr>
<tr>
<td>bryan</td>
<td>male</td>
</tr>
<tr>
<td>bryant</td>
<td>male</td>
</tr>
<tr>
<td>chris</td>
<td>unknown</td>
</tr>
<tr>
<td>christine</td>
<td>female</td>
</tr>
</tbody>
</table>
<p>The critical thing to notice is that a lot of the names in the map share common prefixes. It&#8217;d be great if we could avoid storing that duplicate data in-memory when we&#8217;re running.</p>
<p>It turns out that there&#8217;s a great data structure for this purpose called a <a href="http://en.wikipedia.org/wiki/Radix_tree">radix tree</a>, which is an optimized form of a <a href="http://en.wikipedia.org/wiki/Trie">trie</a>. This structure compresses common prefixes together so that they don&#8217;t have to be stored, and has performance comparable to a hash. For things like English words or names, you can often get a huge amount of savings from using a radix tree.</p>
<p>Surprisingly, we had trouble locating a memory-efficient implementation of a String-keyed radix tree map out on the Internet, so we decided to make one ourselves. You can find <a href="https://github.com/bryanduxbury/lightweight_trie">the project on GitHub</a>, complete with tests. The basics (get, put) work pretty well, but there aren&#8217;t a lot of frills right now, so we&#8217;d love to see contributions in that department. On our test data set, we found that the ImmutableStringRadixTreeMap shaved about 44% of the overhead of HashMap! That&#8217;s great savings.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2011/04/12/lightweight-trie/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Bringing Ruby&#8217;s ActiveRecord to Java</title>
		<link>http://blog.rapleaf.com/dev/2011/03/28/bringing-rubys-activerecord-to-java/</link>
		<comments>http://blog.rapleaf.com/dev/2011/03/28/bringing-rubys-activerecord-to-java/#comments</comments>
		<pubDate>Mon, 28 Mar 2011 20:42:05 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=2442</guid>
		<description><![CDATA[Rapleaf started out using Ruby and Rails extensively to build out our systems. We loved the flexibility that it gave us to quickly put together a functional application. Tools like ActiveRecord are huge productivity boosters, saving us the trouble of hand-coding database interaction and letting us focus directly on our application. However, we evolved to [...]]]></description>
			<content:encoded><![CDATA[<p>Rapleaf started out using Ruby and <a href="http://rubyonrails.org/">Rails</a> extensively to build out our systems. We loved the flexibility that it gave us to quickly put together a functional application. Tools like <a href="http://ar.rubyonrails.org/">ActiveRecord</a> are huge productivity boosters, saving us the trouble of hand-coding database interaction and letting us focus directly on our application.</p>
<p>However, we evolved to need something more than Ruby. While we continue to use Rails for our website and our internal administration site, we&#8217;ve moved to using Java (and <a href="http://hadoop.apache.org/">Hadoop</a>/<a href="http://www.cascading.org/">Cascading</a>) to power our processing pipeline. Oftentimes, our Java applications need to access the same databases as our Ruby sites.</p>
<p>This seems like a simple problem, but we&#8217;ve encountered a couple issues. Firstly, although Java has a number of database abstraction packages, we didn&#8217;t find them quite up to the level of ActiveRecord in terms of ease of use or flexibility. Packages like <a href="http://www.hibernate.org/">Hibernate</a> are cumbersome and complex compared to ActiveRecord, and that&#8217;s led to our developers using them inconsistently or avoiding frameworks altogether.</p>
<p>Framework aside, there&#8217;s still an even more significant problem &#8211; namely, how do we scalably share our growing schema across two separate languages/frameworks? Between our two main production databases, we have over 220 tables. At that number, manually maintaining more than one version of our data definition for different frameworks is tedious and error-prone. In practice, they won&#8217;t be synchronized correctly, leading to confusion at deployment time when problems are finally uncovered.</p>
<p>Our initial solution was to limp along with carefully hand-coded Java models that mirrored their Ruby counterparts. This was a pretty poor solution, as it meant that only the minimum set of models required for the current application would get built, and features were lacking. We finally decided that we&#8217;d had enough. What if we could establish a parallel framework in Java that leveraged the work we already do in ActiveRecord?</p>
<p>It turns out that this is easier than it sounds. Over the course of the last month, we&#8217;ve build Jack, a set of Ruby scripts and a concordant library of Java classes that together allow you to convert your Ruby ActiveRecord migrations and models into automatically generated Java classes. The general idea is that we parse schema.rb for the table schema and each of the models for their associations and other configuration elements. Once we&#8217;ve got it all parsed up, it&#8217;s trivial to generate brand-new Java classes.</p>
<p>We&#8217;ve decided to open source this project in hopes that others can benefit from and contribute to the project. You can find <a href="https://github.com/bryanduxbury/jack">the code and some documentation</a> on GitHub. We look forward to hearing your thoughts!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2011/03/28/bringing-rubys-activerecord-to-java/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Announcing Hank: A Fast, Open-Source, Batch-Updatable, Distributed Key-Value Store</title>
		<link>http://blog.rapleaf.com/dev/2011/03/15/announcing-hank-a-fast-open-source-batch-updatable-distributed-key-value-store/</link>
		<comments>http://blog.rapleaf.com/dev/2011/03/15/announcing-hank-a-fast-open-source-batch-updatable-distributed-key-value-store/#comments</comments>
		<pubDate>Tue, 15 Mar 2011 22:30:02 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=1952</guid>
		<description><![CDATA[We&#8217;re really excited to announce the open-source debut of a cool piece of Rapleaf&#8217;s internal infrastructure, a distributed database project we call Hank. Our use case is very particular: we have tons of data that needs to get processed, producing a lot of data points for individual people, which then need to be made randomly [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re really excited to announce the open-source debut of a cool piece of Rapleaf&#8217;s internal infrastructure, a distributed database project we call Hank.</p>
<p>Our use case is very particular: we have tons of data that needs to get processed, producing a lot of data points for individual people, which then need to be made randomly accessible so they can be served through <a href="https://www.rapleaf.com/developers">our API</a>. You can think of it as the &#8220;process and publish&#8221; pattern.</p>
<p>For the processing component, <a href="http://hadoop.apache.org/">Hadoop</a> and <a href="http://www.cascading.org/">Cascading</a> were an obvious choice. However, making our results randomly accessible for the API was more challenging. We couldn&#8217;t find an existing solution that was fast, scalable, and perhaps most importantly, wouldn&#8217;t degrade performance during updates. Our API needs to have lightning-fast responses so that our customers can use it in realtime to personalize their users&#8217; experiences, and it&#8217;s just not acceptable for us to have periods where reads contend with writes while we&#8217;re updating.</p>
<p>We boiled this all down to the following key requirements:</p>
<ol>
<li>Random reads need to be fast &#8211; reliably on the order of a few milliseconds.</li>
<li>Datastores need to scale to terabytes, with keys and values on the order of kilobytes.</li>
<li>We need to be able to push out hundreds of millions of updates a day, but they don&#8217;t have to happen in realtime. Most will come from our Hadoop cluster.</li>
<li>Read performance should not suffer while updates are in progress.</li>
</ol>
<p>Additionally, we identified a few non-requirements:</p>
<ol>
<li>During the update process, it doesn&#8217;t matter if there is more than one version of our datastores available. Our application is tolerant of this inconsistency.</li>
<li>We have no need for random writes.</li>
</ol>
<p>The system we came up with is tailored to meet these needs. It consists of a fast, read-only data server backed by a custom-designed batch-updatable file format, a set of tools for writing these files from Hadoop, and a special daemon process that manages the deploy of data from the Hadoop cluster to the actual server machines. Clients of Hank are aware of ongoing updates and avoid connecting to servers that are busy. When the time comes to push out a new version of our data, the data deployer allows only a fraction of the data servers to perform an update at a time, making sure that sufficient data serving capacity remains online.</p>
<p>There&#8217;s a more detailed look at the <a href="https://docs.google.com/document/d/1enJjeleYlJZWETiceJUGGotzG_5DbysTRytNHi8Tpi0/edit?hl=en&amp;authkey=CIeVw8EE">architecture and infrastructure of the project</a>, and you can find <a href="https://github.com/bryanduxbury/hank">the code on GitHub</a>, which is shared under the Apache Software License. This codebase is still a work in progress &#8211; our older, internal version was in need of a serious refactor &#8211; but most of the necessary pieces are there, and we&#8217;re going to finish the development in the open. We&#8217;d love to hear your thoughts on the project and would doubly love to get your contributions, whatever form they might take.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2011/03/15/announcing-hank-a-fast-open-source-batch-updatable-distributed-key-value-store/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Memory-efficient sparse bitsets</title>
		<link>http://blog.rapleaf.com/dev/2010/12/17/memory-efficient-sparse-bitsets/</link>
		<comments>http://blog.rapleaf.com/dev/2010/12/17/memory-efficient-sparse-bitsets/#comments</comments>
		<pubDate>Fri, 17 Dec 2010 19:37:34 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=971</guid>
		<description><![CDATA[A bitset is a data structure designed to store a vector of boolean values very compactly &#8211; one bit per value. In practice, they&#8217;re a really handy way to save memory. However, we had a situation in one of our extremely memory-intensive applications where a simple bitset wouldn&#8217;t cut it. We have over 2500 variables [...]]]></description>
			<content:encoded><![CDATA[<p>A <a href="http://en.wikipedia.org/wiki/Bit_array">bitset</a> is a data structure designed to store a vector of boolean values very compactly &#8211; one bit per value. In practice, they&#8217;re a really handy way to save memory. However, we had a situation in one of our extremely memory-intensive applications where a simple bitset wouldn&#8217;t cut it. We have over 2500 variables to store bits on, meaning that our bitsets took up over 300 bytes each!</p>
<p>If most or all of the bits were usually set, this would be an unavoidable issue, but in our case, the set bits were very rare &#8211; usually no more than 20 or 30. This leads to a large amount of wasted space. The traditional approach to sparse sets like these is to just store the position number of the set variables directly in a collection. To allow for the full 2500 position numbers, we needed a short int, meaning that with this approach, the memory size of the collection is 2 bytes times the number of elements. A sparse set of 30 elements will take a lot less memory than the equivalent bitset (60 bytes versus 313), but in our application, it&#8217;s still too much.</p>
<p>What if we could combine the tight-packing benefits of a bitset with the low impact of the sparse set? It turns out that we can. In our application, some of the variables are set pretty frequently (1/4 records have it set) and some are set very rarely (1/10000), with a gradient of various frequencies between. We can exploit this gradient to make really efficient storage decisions. Here&#8217;s the graph of frequencies over our set:</p>
<p><a href="http://blog.rapleaf.com/dev/wp-content/uploads/2010/12/freq_ordered.png"><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/12/freq_ordered.png" alt="" width="479" height="338" class="aligncenter size-full wp-image-1021" /></a></p>
<p>It comes down to this: in a bitset, each possible variable in your set costs you one bit, whether or not it&#8217;s set. In a sparse set, with our example of 2500 variables, each variable costs 16 bits, but only when it&#8217;s set. Comparing these two costs gives a clear tradeoff point. When a given variable is set on at least one of every 16 records, you should store it in a bitset; when it is set on one out of every 17 or more records, then you should store it in the sparse set.</p>
<p>To build this type of set in practice, you just need to get your variables into a list ordered by frequency of occurrence and then apply the cutoff. Everything above should be managed through a bitset, and everything below should be managed by a collection. We found it handy to wrap both of these sets up in a single class so that the user can be indifferent to where the data is stored.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2010/12/17/memory-efficient-sparse-bitsets/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Striving for zero copies with Thrift 0.5</title>
		<link>http://blog.rapleaf.com/dev/2010/10/19/striving-for-zero-copies-with-thrift-0-5/</link>
		<comments>http://blog.rapleaf.com/dev/2010/10/19/striving-for-zero-copies-with-thrift-0-5/#comments</comments>
		<pubDate>Tue, 19 Oct 2010 16:49:39 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=898</guid>
		<description><![CDATA[&#8220;Zero copies&#8221; is a common optimization principle used in high-performance applications. The gist of the technique is to have the smallest number of byte array copies necessary for the server to perform its task. Byte array copies are one of those insidious time-wasters that are hard to understand or even detect until you start looking [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Zero-copy">&#8220;Zero copies&#8221;</a> is a common optimization principle used in high-performance applications. The gist of the technique is to have the smallest number of byte array copies necessary for the server to perform its task. Byte array copies are one of those insidious time-wasters that are hard to understand or even detect until you start looking for them. It seems intuitive to use a perfectly-sized byte array for everything you do: it&#8217;s straightforward, reduces the number of arguments you have to pass to each method, but most of all, it&#8217;s just simple. However, you actually pay a steep price every time you copy a byte array &#8211; the CPU is spinning away shuffling bytes from one memory location to another. It&#8217;s actually even worse in Java: every time you create a new byte[], you&#8217;re both allocating memory <em>and</em> looping over it to zero each position out. This means you pay a price now as you iterate over every position and a price later when you ultimately have to garbage collect the new byte[] you threw away. An ideal server would <em>never</em> copy a byte[] unnecessarily, preferring to reuse the one over and over again.</p>
<p>Before Thrift 0.4, no matter how much you might want to, there was no way to avoid doing an extra byte[] copy for each binary field that you deserialized, despite the fact that virtually all deserialization happens directly from an in-memory byte[] buffer. Thrift 0.4 <a href="https://issues.apache.org/jira/browse/THRIFT-830">changed that</a> by switching the underlying type of binary fields from byte[] to the Java NIO construct <a href="http://download.oracle.com/javase/6/docs/api/java/nio/ByteBuffer.html">ByteBuffer</a>; Thrift 0.5<a href="https://issues.apache.org/jira/browse/THRIFT-894"> elaborated on this theme</a> by making it easier to get the byte[]s that everyone expected while still offering access to the ByteBuffer for more advanced operations.</p>
<p>So how do you actually use this feature to speed up your servers? Let&#8217;s take a look at a pair of examples. In both examples, we&#8217;ll use the following Thrift file as our base:</p>
<p><code><br />
struct A {<br />
  1: required binary foo;<br />
}</p>
<p>service SomeService {<br />
  A read();<br />
  void logFoo(1: A a);<br />
}<br />
</code></p>
<p><strong>The logFoo method</strong></p>
<p>Let&#8217;s pretend that your objective is to log the contents of the foo field to some stream. Here&#8217;s how you might do it naively:</p>
<p><code><br />
private DataOutputStream out;</p>
<p>public void logFoo(A a) throws TException {<br />
  byte[] value = a.getFoo();<br />
  out.writeInt(value.length);<br />
  out.write(value);<br />
}<br />
</code></p>
<p>Seems simple enough. So what&#8217;s wrong here? The problem is that calling getFoo() causes a byte[] copy. It&#8217;s hidden from you by the method, but it&#8217;s happening nonetheless. The copy you create is only used for an instant, becoming garbage after you pass it to write(), and then the entire A object becomes garbage.</p>
<p>Here&#8217;s the right way to do it:</p>
<p><code><br />
private DataOutputStream out;</p>
<p>public void logFoo(A a) throws TException {<br />
  ByteBuffer value = a.bufferForFoo();<br />
  out.writeInt(value.remaining());<br />
  out.write(value.array(), value.arrayOffset() + value.position(), value.remaining());<br />
}<br />
</code></p>
<p>There&#8217;s a lot here, so let&#8217;s break it down. First, notice that we called &#8220;bufferForFoo&#8221; instead of &#8220;getFoo&#8221;. This returns a ByteBuffer instead of a byte[]. Then, we use the remaining() method to get the number of bytes in the buffer that belong to this value. Finally, we go to the write() call, but this time using the &#8220;array, offset, length&#8221; version of write. This allows us to reference a subarray directly from the array that backs <em>value</em> without any intermediate copying. There&#8217;s some trickiness that goes into understanding why the first element in the backing byte array is arrayOffset() + position(), but for right now, trust me that it&#8217;s the case.</p>
<p>It&#8217;s a small difference with a bit more code, but depending on the size of foo, you could see a substantial boost in performance.</p>
<p><strong>The read method</strong></p>
<p>Now let&#8217;s look at things from the other side of the equation. Let&#8217;s say that the objective of the read() method is to read the bytes of foo from an input stream and return them wrapped in an instance of A. Here&#8217;s what the naive approach might look like:</p>
<p><code><br />
public A read() throws TException {<br />
  // assume that "in" is a DataInputStream<br />
  int fooLength = in.readInt();<br />
  byte[] value = new byte[fooLength];<br />
  in.readFully(value);<br />
  A result = new A();<br />
  result.setFoo(value);<br />
  return result;<br />
}<br />
</code></p>
<p>The problem with this method is that every call leads to a new short-lived instance of A and a brand new perfect-sized byte[]. Both of these will become garbage very soon, and allocating the new byte[] every time is a drag on your CPU.</p>
<p>Let&#8217;s focus on how we can reuse the byte[] for now and think about the A instance some other time. There are many possible strategies for caching your buffers, but here&#8217;s a simple one:</p>
<p><code><br />
private final ThreadLocal bufferCache;</p>
<p>public A read() throws TException {<br />
  int fooLength = in.readInt();<br />
  byte[] value = bufferCache.get();<br />
  if (value.length &lt; fooLength) {<br />
    value = new byte[fooLength];<br />
    bufferCache.set(value);<br />
  }<br />
  in.readFully(value, 0, fooLength);<br />
  A result = new A();<br />
  result.setFoo(ByteBuffer.wrap(value, 0, fooLength);<br />
  return result;<br />
}<br />
</code></p>
<p>There&#8217;s a good bit more to this version. First, note that we&#8217;re using Java&#8217;s ThreadLocal capability to support us keeping a single byte[] per active thread. This makes sure that each thread servicing a client won&#8217;t interfere with any other, and there&#8217;s no contention (synchronization) for the thread-local buffer. Next, after we figure out how much we need to read, we make a point of checking if we have enough buffer space to complete the read. If not, we replace our buffer with a new, bigger one. Then we complete the read into the buffer &#8211; this time specifying the length we want to read, rather than letting the length of the buffer imply the size of the read. This ensures that on subsequent reads, when fooLength is less than value.length, we don&#8217;t try to read more than we wanted to. Finally, instead of passing the entire buffer into the foo field, we pass in a ByteBuffer that wraps just the portion that contains the value we read this time. </p>
<p>By using this technique, we&#8217;ve avoided one copy per call plus an unknown number of byte[] allocations &#8211; if your record sizes vary a lot, then it will take some time before the buffer has to expand to accommodate the biggest one, but after that, you won&#8217;t need any more allocations. If your records are fixed size, then you should reach that point immediately.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2010/10/19/striving-for-zero-copies-with-thrift-0-5/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Analyzing some interesting networks for Map/Reduce clusters</title>
		<link>http://blog.rapleaf.com/dev/2010/08/26/analyzing-some-interesting-networks-for-mapreduce-clusters/</link>
		<comments>http://blog.rapleaf.com/dev/2010/08/26/analyzing-some-interesting-networks-for-mapreduce-clusters/#comments</comments>
		<pubDate>Thu, 26 Aug 2010 17:20:56 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Miscellaneous]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=775</guid>
		<description><![CDATA[In a previous post, I described how Rapleaf had built a conceptual model for determining the average aggregate peak throughput (which we&#8217;ll call T) that a given network architecture could support. This post applies that model to a variety of network topologies you might consider for your cluster. Just as a brief refresher, T represents [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://blog.rapleaf.com/dev/2010/08/24/analyzing-network-load-in-mapreduce/">previous post</a>, I described how Rapleaf had built a conceptual model for determining the average aggregate peak throughput (which we&#8217;ll call T) that a given network architecture could support. This post applies that model to a variety of network topologies you might consider for your cluster.</p>
<p>Just as a brief refresher, T represents the max throughput that the shuffle operation will be able to demand during its peak, and we compute it by determining which component of the network will saturate first.</p>
<p><strong>The &#8220;triangle&#8221;</strong><br />
When Rapleaf had only 120 machines in three racks, our network looked like this:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/original_network.png" alt="" width="293" height="262" class="aligncenter size-full wp-image-652" /></p>
<p>The <a href="http://www.dell.com/us/en/business/networking/pwcnt_6248/pd.aspx?refid=pwcnt_6248&amp;s=bsd&amp;cs=04">Dell switches</a> we&#8217;d purchased have a configuration that allows them to be &#8220;stacked&#8221; together via a proprietary 48Gbps link. The interesting thing about this configuration is that communications from any one rack switch to another rack switch only have to cross one link and two of the three total switches. If we compute the T numbers for this architecture, we get:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/original_network_annotated1.png" alt="" width="294" height="260" class="aligncenter size-full wp-image-776" /></p>
<p>This architecture provides a whopping 331Gbps T! That&#8217;s actually more than all the machines connected to the network can actually even use &#8211; with all 144 ports in use, we could only produce 288Gbps of traffic. It&#8217;s really interesting that this network architecture is so speedy, particularly because it&#8217;s so cheap to put together &#8211; in the neighborhood of $6000 for the entire network. This is a great starter network. You can build it up organically from one to three switches and never have to worry about your network being a bottleneck.</p>
<p>It has some downsides, though. Using both the provided stacking modules prevents you from taking advantage of any of the SFP ports, so you don&#8217;t have any high-speed options for connecting to an external non-cluster network. This means that getting data in and out of the cluster can be a challenge, and that there really isn&#8217;t any way to expand this architecture with more computing power. In practice, you&#8217;ll have to sacrifice 4-8 of the ports on each switch to be bonded together into an uplink to another network. Also, the stacking cables only reach a max of 10 feet, so your racks will have to be physically close to each other.</p>
<p><strong>Naive expansion of the &#8220;triangle&#8221;</strong><br />
When we were planning to expand from three to four racks, the first thing we considered was connecting a fourth switch to our existing triangle network. The best uplink we could have managed was an 8Gbps bonded Ethernet link like so:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/naive_triangle_plus.png" alt="" width="452" height="255" class="aligncenter size-full wp-image-777" /></p>
<p>You might already see where this is going. Take a look at the T numbers:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/naive_triangle_plus_annotated.png" alt="" width="452" height="253" class="aligncenter size-full wp-image-778" /></p>
<p>It turns out that the weakest link in the entire network, the new 8Gbps link, has to carry a really large proportion of T, and saturates at a measly 42G! It&#8217;s a massive reduction over the configuration described above. If you were to implement this network, you might find yourself in the unenviable position of having your jobs <em>execute more slowly</em> despite having <em>added more machines</em>.</p>
<p>I don&#8217;t think that anyone should ever build this network. Even if 42G is enough for your application, this network just won&#8217;t grow with you.</p>
<p><strong>Daisy chains</strong><br />
A network topology you might find tempting when reading about various &#8220;stacking&#8221; options is a daisy chain. Per Dell&#8217;s product sheet, you can stack up to 12 of their switches together via the 48Gbps links:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/daisy_chain1.png" alt="" width="429" height="330" class="aligncenter size-full wp-image-801" /></p>
<p>With the T values computed:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/daisy_chain_annotated1.png" alt="" width="279" height="323" class="aligncenter size-full wp-image-799" /></p>
<p>It should come as no surprise that the network link in the dead center of the chain will be where saturation occurs first &#8211; about 1/4 of all traffic is going to have to cross this link in each direction, leading to a saturation point of 192Gbps. An interesting observation, though, is that no matter the number of racks in the chain, the middle link will always have the same saturation point, since it will always see 1/4 of the traffic.</p>
<p>Since it has such a high saturation point, this topology is somewhat attractive. However, it can be impractical to implement. Since the stacking cables are so short, you are required to locate all your cabinets right next to one another &#8211; or to keep all of the switches in one rack and run 480 Ethernet cables to the individual racks! Neither of these approaches are conducive to incremental growth and easy management in a shared facility. Also, once you expand to the max of 12 switches, you are left with no incremental growth path. Finally, you are still subject to all the other limitations of the stacking configuration as described above.  </p>
<p>This style of network only really seems to be merited when you know you won&#8217;t grow beyond a certain size and you have great control over your racks&#8217; physical positioning.</p>
<p><strong>&#8220;Star&#8221; networks</strong><br />
The <a href="http://en.wikipedia.org/wiki/Star_network">star network topology</a> is one of the most common, and it benefits from being easy to assemble and well understood. Consider this sample from my original post on the analysis technique, pre-annotated with T values: </p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/shuffle_annotated_2.png" alt="" width="558" height="327" class="aligncenter size-full wp-image-767" /></p>
<p>It consists of four rack switches and one central switch, all the aforementioned Dell 6248s, and connected by 8Gbps links. It saturates at a relatively low 42Gbps. Unlike the triangle network, all the traffic headed to another rack must pass through a single link to the backbone, and unlike the daisy chain network, the links aren&#8217;t strong enough to make up for that.</p>
<p>However, this topology still has room for more racks. Let&#8217;s see what happens if we plugged in another two racks&#8217; worth of machines like so:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/6-rack-star-annotated1.png" alt="" width="559" height="327" class="aligncenter size-full wp-image-785" /></p>
<p>Note that while the uplinks still saturate first, the T that it takes to saturate them has gone up! This is because the relative proportion of the total traffic that goes over each link has actually reduced. The thing to take away is that some network topologies, like this one, keep scaling as you add new nodes. It might make sense to leave yourself room to grow, rather than buying the minimum that fits your needs. </p>
<p>Rapleaf&#8217;s current network has taken these lessons to heart. Here&#8217;s what we&#8217;re using today:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/rapleaf_actual.png" alt="" width="559" height="323" class="aligncenter size-full wp-image-787" /></p>
<p>We switched from Dell to Juniper for our hardware, using <a href="http://www.juniper.net/us/en/products-services/switching/ex-series/ex2500/">this switch</a> as our backbone and one of <a href="http://www.juniper.net/us/en/products-services/switching/ex-series/ex3200">these</a> at the top of each rack. The links between the rack switches and the backbone are made over a single 10G Ethernet connection. We also reduced the size of a rack from 40 machines to 20, since we&#8217;re power- and cooling-limited to 20 nodes per cabinet anyway. If you run the T numbers, you end up with:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/rapleaf_actual_annotated.png" alt="" width="559" height="324" class="aligncenter size-full wp-image-789" /></p>
<p>The links still saturate first at a respectable 92Gbps. This performance seems to be suitable for our application, but the really great thing about this topology is its flexibility. We can expand from our current 160 machines all the way up to 480 machines without any of the switches ever becoming overwhelmed. Or, if we discover that our applications are especially bandwidth-hungry, we can bond an extra 10G Ethernet from each rack switch to the backbone, doubling our T for the price of new cables.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2010/08/26/analyzing-some-interesting-networks-for-mapreduce-clusters/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Analyzing network load in Map/Reduce</title>
		<link>http://blog.rapleaf.com/dev/2010/08/24/analyzing-network-load-in-mapreduce/</link>
		<comments>http://blog.rapleaf.com/dev/2010/08/24/analyzing-network-load-in-mapreduce/#comments</comments>
		<pubDate>Tue, 24 Aug 2010 16:54:48 +0000</pubDate>
		<dc:creator>Bryan Duxbury</dc:creator>
				<category><![CDATA[Hadoop]]></category>
		<category><![CDATA[network]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=758</guid>
		<description><![CDATA[Hadoop Map/Reduce can put a heavy toll on your network. Just how heavy, though, isn&#8217;t obvious. This is an especially important consideration when you are expanding your cluster. Rapleaf recently encountered this situation, and in the process we devised a neat theoretical model for analyzing how network topology affects Map/Reduce. When does Hadoop put the [...]]]></description>
			<content:encoded><![CDATA[<p>Hadoop Map/Reduce can put a heavy toll on your network. Just how heavy, though, isn&#8217;t obvious. This is an especially important consideration when you are expanding your cluster. Rapleaf recently encountered this situation, and in the process we devised a neat theoretical model for analyzing how network topology affects Map/Reduce.</p>
<p><strong>When does Hadoop put the most stress on the network?</strong><br />
The two phases of a Map/Reduce job that are candidates for high network stress are shuffling and reduce output. During the shuffle phase, each of your reducers will contact every other machine in the cluster to collect intermediate files. During the reduce output phase, the final results of the whole job will be written to HDFS, usually with three replicas.</p>
<p>Intuitively, since the data is written out three times, it might seem like the reduce output stage is the most intense period of network traffic; however, it turns out that the shuffle has the most potential to max out your network due to the fact that each node contacts each other node, rather than exactly two other nodes. Note, however, that the reduce output phase might <em>take longer</em> than the shuffle &#8211; it just won&#8217;t stress the network as much.</p>
<p><strong>Just what do you mean, &#8220;stress the network&#8221;?</strong><br />
When I speak of &#8220;stressing the network&#8221;, I&#8217;m talking about throughput. More specifically, I am talking about the <em>average aggregate peak throughput</em>. There&#8217;s a lot in that term, so let&#8217;s break it down. &#8220;Throughput&#8221; refers to the rate at which we can transfer data, usually measured in Gbps. &#8220;Aggregate&#8221; refers to the fact that we&#8217;re summing up all the throughput across the whole cluster. &#8220;Average&#8221; refers to the fact that while some components may see higher or lower values, we want the average. Finally, &#8220;peak&#8221; refers to the fact that we&#8217;re interested in the throughput at the peak of stress, rather than at some other time.</p>
<p>Put another way, the average aggregate peak throughput is the aggregate throughput at which some component in the network saturates, that is when it is carrying its maximum throughput capacity. For a link like an Ethernet cable, the max capacity is determined by the rating of the cable and the ports it&#8217;s plugged into. For instance, Gigabtit Ethernet is rated for 1Gbps of throughput. For a switch, the rating of the backplane, also specified in Gbps, determines its capacity. Once one component in the network saturates, even if there are other unsaturated components, the job as a whole won&#8217;t be able to go any faster.</p>
<p><strong>The model</strong><br />
The objective of our model is to figure out exactly what average aggregate peak throughput (henceforth abbreviated T) a given network topology can bear before some component saturates. To do this, we&#8217;ll figure out what proportion of T each link and switch in the network has to carry, then solve some simple equations to determine what value of T causes each component to saturate. The one with the lowest value of T is the one that we care about. </p>
<p>First, let&#8217;s quickly look at what happens during the shuffle phase. Let&#8217;s assume that we&#8217;re operating on a common star-topology network, with four individual top-of-rack switches (labeled A through D) connected via a central backbone switch:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/shuffle_illustration.png" alt="" width="621" height="384" class="aligncenter size-full wp-image-762" /></p>
<p>Each of the racks in this cluster contains the same number of nodes, so when the map phase is over, each rack has exactly 1/4 of the total intermediate data, represented in the diagram by the small boxes behind each cloud of nodes. On an individual rack, say Rack A, 1/4 of the data will be transferred to another host in the same rack, with the remaining 3/4 to be split evenly amongst racks B, C, and D. Likewise, each of the other racks will be sending 1/4 of <em>their</em> data to Rack A. </p>
<p>Since this network is symmetrical, each rack behaves identically. And since each rack has 1/4 of the total data, you can multiply things out to determine the following: 1/16 of the total data is going to stay in place in each rack, 3/16 will be transferred in, and 3/16 will be transferred out.</p>
<p>If you sum up all the numbers, you can annotate the diagram like so:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/shuffle_annotated.png" alt="" width="557" height="326" class="aligncenter size-full wp-image-764" /></p>
<p>There are a few things worth noting here. Firstly, the links between switches have two separate numbers, one for each direction. This is because virtually all network connections are full-duplex, meaning they can carry their rated capacity in both directions. If you happen to have a half-duplex connection, then you <strong>must</strong> sum both of the numbers together to get the true proportion. Second, the backbone switch carries more of the throughput than the top-of-rack switches. This is because the backbone has to carry all of the traffic that leaves each of the rack switches. Also, it&#8217;s important to avoid double-counting the traffic that passes through the backbone. Think of a switch as a toll booth that can take traffic in either direction &#8211; you only need to count traffic that comes in and out once. Thus, the 3/4 proportion in the backbone switch is computed by summing up only the traffic that is coming in.</p>
<p>Now that we have computed what proportion of T passes through each component, we can solve for the value of T that causes the component to saturate. To do that, we&#8217;ll need concrete numbers for the capacities of our links and switches. Let&#8217;s assume that the links between the rack switches and the backbone are 8Gbps, and that all the switches are <a href="http://www.dell.com/us/en/enterprise/networking/pwcnt_6248/pd.aspx?refid=pwcnt_6248&amp;cs=555&amp;s=biz">inexpensive 48-port Dell switches</a> that have a backplane capacity of 184Gbps. The connections to individual machines are made via 1Gbps Ethernet. If a switch can handle up to 184Gbps, and it must carry 7/16 of T, then for what T does the switch carry 184Gbps? You can produce an equation from this question: 7/16 T = 184Gbps, which you can solve easily: T = 184 * 16 / 7 = 420Gbps. Here&#8217;s what the diagram looks like with this step applied to all switches and links:</p>
<p><img src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/08/shuffle_annotated_2.png" alt="" width="558" height="327" class="aligncenter size-full wp-image-767" /></p>
<p>I&#8217;ve highlighted the inter-switch links in this version, because as you can see, they become saturated when T is only 42Gbps, as opposed to the switches, which saturate at between 250Gbps and 420Gbps. This means that at its most intense, the shuffle phase will average about 42Gbps of throughput across the entire cluster. This architecture could support up to 160 machines; if they are sharing fairly, this means that each machine would have about 42Gbps / 160 = 250Mbps of throughput available for their transfers.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2010/08/24/analyzing-network-load-in-mapreduce/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

