<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Engineering Rapleaf &#187; Anonymouse</title>
	<atom:link href="http://blog.rapleaf.com/dev/tag/anonymouse/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.rapleaf.com/dev</link>
	<description>For engineers, by engineers.</description>
	<lastBuildDate>Mon, 12 Dec 2011 08:57:30 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	
		<item>
		<title>Why Rapleaf Does Not Use Unique Identifiers in Cookies</title>
		<link>http://blog.rapleaf.com/dev/2010/09/10/why-rapleaf-does-not-use-unique-identifiers-in-cookies/</link>
		<comments>http://blog.rapleaf.com/dev/2010/09/10/why-rapleaf-does-not-use-unique-identifiers-in-cookies/#comments</comments>
		<pubDate>Fri, 10 Sep 2010 17:51:01 +0000</pubDate>
		<dc:creator>greg</dc:creator>
				<category><![CDATA[Anonymouse]]></category>
		<category><![CDATA[privacy]]></category>
		<category><![CDATA[uuid]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=834</guid>
		<description><![CDATA[Update-11/4/11:  This article reflects our policies as of Fall 2010.  Since that time, we&#8217;ve continued to update our policies in line with industry best practices, including those developed by the DMA, NAI and IAB.  As our technology and products continue to evolve, we&#8217;re always committed to certain fundamental privacy principles:  that users have control over their data, [...]]]></description>
			<content:encoded><![CDATA[<p><em><strong>Update-11/4/11:  </strong>This article reflects our policies as of Fall 2010.  Since that time, we&#8217;ve continued to update our policies in line with industry best practices, including those developed by the DMA, NAI and IAB.  As our technology and products continue to evolve, we&#8217;re always committed to certain fundamental privacy principles:  that users have control over their data, that data collection and use be made as transparent as possible, and that online behavioral tracking data should never be merged with a person&#8217;s real-life identity.   For our most current privacy practices in our online advertising data business (a division within Rapleaf called LiveRamp), refer to LiveRamp&#8217;s current privacy standards <a href="http://liveramp.com/privacy/">here</a>.</em></p>
<hr />
<p>If you ever need to drop data in a browser cookie, you generally have two options: dropping the data directly, or dropping a unique ID (or <strong>UUID</strong>, for <em>universally unique identifier</em>). In the latter case you&#8217;d have to store a mapping from UUIDs to data on your server, and whenever you see a cookie you&#8217;d query this map to acquire the data you want.</p>
<p>The UUID approach is nice from a technical perspective because it limits the size of the cookies you drop: a UUID is only 16 bytes.<sup>1</sup> Cookies get sent during browser requests, and may be uploaded multiple times during a browsing session. If a cookie is large enough, it can dominate the size of the request and noticeably hurt the user&#8217;s browsing experience. This issue is mitigated somewhat by the fact that cookies can&#8217;t be larger than 4K—but then you run into an upper limit on the amount of data a cookie can contain, and the UUID approach becomes attractive once again.</p>
<p>UUIDs are also convenient because all the data lives on the server, simplifying the task of updating that data. If the data lives in the cookie, then we cannot update it until we have an opportunity to drop another cookie on the user.</p>
<p>Because of these features, UUID’s are used by almost every ad network and advertising technology company today. However, although UUIDs are attractive, we&#8217;ve prohibited the use of UUID’s here at Rapleaf due to privacy concerns. UUIDs are, by design, uniquely identifying. If you use UUIDs, it means you have a mapping from UUID to data on your servers.</p>
<h2>Unique Identifiers Are Often Personally-Identifiable</h2>
<p>Here&#8217;s a simple example of how a UUID system might work. Let&#8217;s say we have the following database of information:</p>
<p><a href="http://blog.rapleaf.com/dev/wp-content/uploads/2010/09/uuid_table.png"><img class="aligncenter size-full wp-image-847" src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/09/uuid_table.png" alt="UUID Table" width="469" height="104" /></a></p>
<p>Now imagine we want to drop a cookie based on the email <tt>jsmith@example.com</tt>. Rather than putting the actual data in the cookie (e.g., <tt>gender = male</tt> and whatever other information there might be in subsequent columns), we could simply drop the UUID <tt>0800200c9a67</tt>. If we see this cookie later, then all we need to do is take the UUID, find its row in the database, and grab the data associated with that user.</p>
<p>If that data contains any personally identifiable information (like a user&#8217;s name or email address), it&#8217;s completely trivial to map from a browser cookie to a person&#8217;s identity. In fact, many companies are doing this today. They claim to not include personally-identifiable information in cookies, but in fact they store UUID’s that map directly to email addresses or hashed email addresses—making it trivial to reconstruct the browser’s identity.</p>
<p>For example, from the UUID <tt>0800200c9a67</tt>, it is trivial to derive that user is actually jsmith@example.com—<strong>so the UUID itself is personally identifiable</strong>. The danger of this system is that the ad network can merge the data about what sites you visit back into a database attached to your email address, name, and address, building a permanent data set of what sites you&#8217;ve visited.</p>
<p>And even if you can&#8217;t map a UUID to personally identifiable information, there are still privacy issues. Specifically, a UUID can act as a unique identifier for a particular browser. This means that you can know a user&#8217;s browsing history, even if you don&#8217;t explicitly know <em>who</em> the user is. By piecing together enough pieces of information on a user, you can often figure out that user’s identity—making it possible for a rogue company (or government) to link browsing behavior to specific individuals.</p>
<p>At Rapleaf, we actively avoid collecting data on browsing history: we don&#8217;t want to know it, it&#8217;s not our business to know it, and we want to control the amount of information we know about the user to ensure that they maintain anonymity online. Full stop.</p>
<h2>Privacy-Centric Alternatives</h2>
<p>That&#8217;s why we store data on the cookie itself. We don&#8217;t put any personally identifiable information in our cookie, so there&#8217;s no straightforward way to know who a browser might belong to. Likewise, we don&#8217;t put a UUID in there, so there&#8217;s no straightforward way to determine browsing history.</p>
<p>Now, we recognize that this system isn&#8217;t perfect. Given enough data on a user, it is often possible to de-anonymize that data back to a particular user. If you read our <a href="http://blog.rapleaf.com/dev/2010/07/20/anonymouse/">post about Anonymouse</a> a few weeks ago, you&#8217;ll know that we&#8217;re spending a lot of resources on solving this problem. Once a cookie has been anonymized, this should provide a strong guarantee on the user&#8217;s privacy.</p>
<p>There&#8217;s one last alternative to UUIDs that combines the best of both worlds—the privacy advantages of putting data directly in the cookie, as well as the technical advantages of using UUIDs. After a set of cookies have been anonymized, each cookie will belong to an <em>equivalence class</em> with several others. For example, if we perform 10,000-anonymization on the data set, then each cookie will look identical to at least 9,999 other potential cookies.</p>
<p>Now, instead of storing all the data in the cookie, what if instead we simply stored an equivalence class ID? This gains us all the technical advantages of dropping a UUID, since we&#8217;re only dropping a single key in the cookie. But from privacy standpoint, it is <em>fundamentally different</em> from a UUID. An equivalence class tells us nothing about an individual user; if we have 10,000-anonymized the data set, then by design the user could be any one of 10,000 people. It is impossible to gather a browsing history, since multiple browsers can and will have the same equivalence class ID. Of course, this relies on a strong degree of confidence in the anonymization algorithm, and this is a change we have not yet implemented—but we think it’s a promising idea.</p>
<p><sup>1</sup> There&#8217;s nothing special about 16 bytes. All that&#8217;s necessary is that the ID is large enough to be uniquely identifying within the domain of the ad network. I used 16 bytes because that&#8217;s the size specified in the <a href="http://en.wikipedia.org/wiki/Universally_unique_identifier">UUID standard</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2010/09/10/why-rapleaf-does-not-use-unique-identifiers-in-cookies/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Anonymouse</title>
		<link>http://blog.rapleaf.com/dev/2010/07/20/anonymouse/</link>
		<comments>http://blog.rapleaf.com/dev/2010/07/20/anonymouse/#comments</comments>
		<pubDate>Tue, 20 Jul 2010 14:07:39 +0000</pubDate>
		<dc:creator>greg</dc:creator>
				<category><![CDATA[Anonymouse]]></category>
		<category><![CDATA[anonymization]]></category>

		<guid isPermaLink="false">http://blog.rapleaf.com/dev/?p=579</guid>
		<description><![CDATA[Update-11/4/11:  This article reflects our policies as of Fall 2010.  Since that time, we&#8217;ve continued to update our policies in line with industry best practices, including those developed by the DMA, NAI and IAB.  As our technology and products continue to evolve, we&#8217;re always committed to certain fundamental privacy principles:  that users have control over their data, [...]]]></description>
			<content:encoded><![CDATA[<p><em><strong>Update-11/4/11:  </strong>This article reflects our policies as of Fall 2010.  Since that time, we&#8217;ve continued to update our policies in line with industry best practices, including those developed by the DMA, NAI and IAB.  As our technology and products continue to evolve, we&#8217;re always committed to certain fundamental privacy principles:  that users have control over their data, that data collection and use be made as transparent as possible, and that online behavioral tracking data should never be merged with a person&#8217;s real-life identity.   For our most current privacy practices in our online advertising data business (a division within Rapleaf called LiveRamp), refer to LiveRamp&#8217;s current privacy standards <a href="http://liveramp.com/privacy/">here</a>.</em></p>
<hr />
<p><em></em>Privacy is an incredibly important issue to us at Rapleaf; it informs all our business and engineering decisions. Occasionally, privacy concerns can lead us to some really interesting engineering challenges. We love this: not only do we get to work on protecting our users&#8217; privacy, we also get a chance to tackle ridiculously challenging problems&#8212;stuff no one else is working on. It&#8217;s a win-win situation. One recent effort that exemplifies this attitude at Rapleaf is our Anonymouse<sup>1</sup> project.</p>
<p><strong>The Problem</strong></p>
<p>Operating our <a href="http://www.rapleaf.com/acquire">Rapleaf Display Media</a> product involves dropping cookies on users&#8217; browsers. These cookies contain various tidbits of information about the user&#8212;basic demographics, interest data, and the like. Using these cookies, we make it possible for content providers to customize their websites to individual users. It&#8217;s pretty exciting stuff, and we think that it&#8217;ll change the face of the web.</p>
<p>In privacy circles, there&#8217;s a concept known as <a href="http://en.wikipedia.org/wiki/Personally_identifiable_information">Personally Identifiable Information</a> (or PII). PII refers to data that can be uniquely linked to a specific individual: things like name, address, phone number, or email. It goes without saying that we don&#8217;t drop any PII in our cookies. But it&#8217;s become <a href="http://www.techdirt.com/articles/20071130/114005.shtml">abundantly</a> <a href="http://techdirt.com/articles/20060807/0219238.shtml">clear</a> that simply stripping your dataset of PII is not enough to make it anonymous.</p>
<p>Here&#8217;s a sample dataset to help illustrate the problem:</p>
<div id="attachment_585" class="wp-caption aligncenter" style="width: 443px"><img class="size-full wp-image-585" src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/07/anonymouse_dataset.png" alt="Anonymouse Dataset" width="433" height="169" /><p class="wp-caption-text">A simple sample dataset.</p></div>
<p>Each row in the table represents a single <em>record</em>&#8212;an individual for whom we have some data. For example, we know that the first individual is a male between the ages of 35 and 55 who uses Facebook and enjoys watching movies; the third record represents a young female Facebook user; and so on. As you can see, these records are fairly anonymous&#8212;every record looks exactly like another record in the data. In particular, we would not be able to trace a given set of attributes back to a specific individual in this table.</p>
<p>Now let&#8217;s change our dataset just a bit:</p>
<div id="attachment_595" class="wp-caption aligncenter" style="width: 533px"><img class="size-full wp-image-595" src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/07/anonymouse_dataset_2.png" alt="Anonymouse Dataset 2" width="523" height="196" /><p class="wp-caption-text">A slightly less simple dataset.</p></div>
<p>Notice the new interest categories. Specifically, take a look at that bottom record: a 56+ year-old man who enjoys <em>Twilight</em>, knitting, and Motocross. In the dataset, there aren&#8217;t any other records that look like him. Furthermore, if we were given just that set of attributes, we&#8217;d be able to tie them back to that specific record. Even though each individual attribute is non-identifying, the dataset is no longer anonymous.</p>
<p><strong>Anonymouse</strong></p>
<p>The goal of Anonymouse is to selectively exclude data from the cookies we drop so that our users are sufficiently indistinguishable. We define “sufficiently indistinguishable” using the notion of <em>k</em>-anonymity. A dataset is <em>k</em>-anonymous as long as every record in the set is identical to no fewer than <em>k-1</em> other records. We can therefore think of a <em>k</em>-anonymous dataset as consisting of clusters of records, or <em>equivalence classes</em>, of size <em>k</em> or greater.</p>
<p>Furthermore, we wouldn&#8217;t just like to <em>k</em>-anonymize the dataset; we&#8217;d also like to maintain as much valuable data as possible. We could trivially anonymize the dataset by dropping all attributes from everyone, but for obvious reasons that&#8217;s not an acceptable solution. Instead, we define the value of an anonymization strategy by its <em>value retained</em>: the value of the anonymized dataset divided by the value of the un-anonymized data. We calculate the value of a dataset by taking the sum over all attributes of the value of that attribute times the number of entities exhibiting that attribute.</p>
<p>Finding an optimal solution is really hard&#8212;NP-hard, actually&#8212;so the best we&#8217;ll be able to do is an approximation algorithm. Even so, it&#8217;s still an incredibly tough problem, and we&#8217;re still working on our solution.</p>
<p>There&#8217;s been <a href="http://spdp.dti.unimi.it/papers/k-Anonymity.pdf">some great research</a> into the <em>k</em>-anonymity problem, but we&#8217;ve found that our problem is idiosyncratic for a few reasons. First and foremost, we&#8217;re dealing with a dataset orders of magnitude larger than what most research has been done on: we have approximately 1 billion records and thousands of possible attributes available. Furthermore, our attributes are all boolean&#8212;they are either present or not. We also have differing values between attributes&#8212;e.g., it might be more valuable to know if someone enjoys watching Motocross than to know that they&#8217;re male. And we&#8217;d like to drop different attributes for different records; a lot of existing research simplifies the problem by focusing on dropping the same set of attributes across all records.</p>
<p><strong>Global vs. Entity-level Suppression</strong></p>
<p>Let&#8217;s expand upon that last point a bit more in-depth. There are two general strategies for dropping attributes. First, we could choose to suppress segments globally. When an attribute is suppressed <em>globally</em>, it will be suppressed for <strong>every</strong> entity in the dataset we&#8217;re anonymizing. The other alternative is to perform <em>entity-level</em> suppression. To do this, we choose for <strong>each individual</strong> which attributes to express. This gives us a much finer degree of granularity over what data we express, but at the cost of a much more complex problem.</p>
<div id="attachment_602" class="wp-caption aligncenter" style="width: 340px"><img class="size-full wp-image-602" src="http://blog.rapleaf.com/dev/wp-content/uploads/2010/07/global_vs_entity_suppression.png" alt="Global- vs. entity-level suppression" width="330" height="552" /><p class="wp-caption-text">An example of global- and entity-level suppression strategies on a dataset about media consumption. The asterisks indicate where an active attribute has been dropped.</p></div>
<p>In the above example, you can see the difference in approaches. Global-level suppression requires us to drop information across entire columns; entity-level suppress means we can drop individual cells in the table. Note that an inactive value doesn&#8217;t indicate an active negative&#8212;that is, just because we haven&#8217;t marked someone as &#8220;interested in movies&#8221; doesn&#8217;t mean they <strong>dis</strong>like movies. Rather, it means that we can&#8217;t say one way or the other. Therefore, removing cells from the table doesn&#8217;t actually introduce any misinformation into the table. It just decreases what we know about individuals.</p>
<p>While global attribute suppression may seem like a very blunt weapon to wield, there are several advantages to using it as an anonymization strategy. For instance, when there are a fairly small number of possible attributes, it&#8217;s actually tractable to find an optimal global suppression strategy. Using a clever heuristic, that&#8217;s what <a href="http://rakesh.agrawal-family.com/papers/icde05kanon.pdf">Bayardo and Agrawal</a> were able to do on a dataset with 160 possible attributes and 32,000 entities.</p>
<p>Another advantage to global-suppression strategies is more subtle, but can help from an implementation standpoint: a global-suppression solution can be described as a single set of expressed attributes. For instance, if you have 100 attributes, you might store your solution as a 100-bit vector: 1 means you keep the attribute and 0 means it gets dropped. The size of this solution is 100 bits, regardless of how many records you have in your dataset. If instead we wanted to do entity-level suppression, we&#8217;d have to keep track of which attributes get suppressed for each record in our dataset. Thus, just to store our solution requires space proportional to the number of records in our dataset&#8212;and if you have a billion records, this can get a bit unwieldy.</p>
<p><strong>Global Suppression</strong></p>
<p>With these advantages in mind, we first decided to try developing a heuristic global suppression anonymizer. Our strategy was very similar to the heuristic approach mentioned in the <a href="http://rakesh.agrawal-family.com/papers/icde05kanon.pdf">Bayardo and Agrawal</a> paper mentioned earlier. Essentially, we start with a solution with no attributes suppressed, and use a hill-climbing approach to add or remove attributes, until no attribute can be added or suppressed to increase the solution value.</p>
<p>Implementation was not especially difficult, and we got a prototype working soon. There was only one drawback to using this approach: our results were awful. Out of the 552 attributes we were testing with, barely 50 were included in the solution. The value of the solution we produced was barely over 50%.</p>
<p>We tried multiple variations on this approach. We started with all attributes expressed, and iteratively removed them. We tried to find optimal solutions on smaller datasets. In the end, though, our results simply weren&#8217;t close to where we needed them to be.</p>
<p>When we looked deeper into why global suppression fares so poorly on our data, it turns out to be for the same reason Rapleaf data is so valuable: many of the attributes we store are uncommon, describing very specific populations. But since the attributes cover such small parts of the population, no algorithm could add them to a global solution without doing serious damage to anonymity.</p>
<p><strong>Cluster-level Suppression</strong></p>
<p>So we moved onto suppressing attributes at an entity level. We found it easiest to think of the problem state as a set of <em>clusters</em>—sets of entities that express the same attributes, given the current anonymization. The solution is brought to <em>k</em>-anonymity by <em>merging</em> and <em>splitting</em> these clusters. Two clusters merge when the segments they have suppressed makes them indistinguishable from each other. Likewise, a cluster splits when some of the entities in it begin expressing an attribute which some others do not.</p>
<p>A key question is: how do we know which attributes to suppress in a cluster, in the hope that dropping the attributes will make it merge it with another cluster? There is no easy answer here. We&#8217;ve tried a number of heuristic approaches: first try dropping the least frequent attribute; first drop the least valuable attribute; a combination of value and frequency. We can try to perform further look-ahead, i.e., instead of looking for clusters with 1 attribute different from the current cluster, also search those a distance of 2 or 3. This is an area in which we see room for improvement, and are continuing to refine our strategy.</p>
<p><strong>Memory Challenges</strong></p>
<p>A constant challenge we face is fitting all of our data into memory. It is highly advantageous to hold as many clusters in memory as possible; the more data we have in memory, the more easily we can find clusters to merge together, and the faster we can get to <em>k</em>-anonymity while suppressing as few segments as possible. The various techniques we use to cut down our memory usage could fill a blog on their own, but one strategy has been particularly powerful: incremental anonymization.</p>
<p>The key observation is that as a dataset is anonymized and clusters merge, its memory footprint decreases. So even if we can only fit 80 million clusters in memory initially, after anonymizing the dataset to <em>k=2</em>, our memory usage may have dropped to 60%. We can continue reading in data and 2-anonymizing it, until we hit the point that we have maxed out our memory usage, even after anonymization. At this point, we can double our k value, and anonymize to <em>k=4</em>, and repeat the process. This approach gives us a lot of freedom to tweak JVM performance, ensuring that our free heap space never drops too low (as we don&#8217;t want garbage collection costs to dominate our runtime).</p>
<p><strong>What comes next?</strong></p>
<p>Our first set of goals remains the same as ever—refine our strategies to increase attribute retention.</p>
<p>One interesting idea has been to use non-determism in our algorithm, and slightly adjust our understanding of what it means to be k-anonymized with respect to our data. The current goal of anonymization is, when presented with a set of attributes we served, to be able to say: “there are over a thousand people in our database we dropped this set of attributes on.”</p>
<p>If, however, we added an element of chance into the selection of which attributes to actually drop in a person&#8217;s cookie, we could instead say “there are over a thousand people in our database we <em>could have</em> dropped this set of attributes on”: <em>k</em>-anonymity would be preserved. The big challenge in implementing this non-deterministic anonymization is finding a full list of entities which could be in a cluster. We have some ideas on how to do this efficiently though—and when we do, it will probably be a blog post of its own!</p>
<p><strong>Conclusion</strong></p>
<p>Again, this is a really hard problem, and we don&#8217;t purport to have solved it yet. But it&#8217;s incredibly important to us, and we expect to be continually working on and refining our solution well into the future.</p>
<p>Many thanks to Ben Podgursky, our awesome summer intern who helped me write this post and who&#8217;s responsible for a lot of the work described above.</p>
<p>&#8212;</p>
<p><sup>1</sup> Why the name Anonymouse? Well, as you&#8217;ll see, the project involves anonymizing browser cookies that we drop&#8212;and as we all know, <a href="http://www.amazon.com/If-You-Give-Mouse-Cookie/dp/B00159UUZA">if you give Anonymouse a cookie&#8230;</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.rapleaf.com/dev/2010/07/20/anonymouse/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
	</channel>
</rss>

