Anonymouse

Update-11/4/11:  This article reflects our policies as of Fall 2010.  Since that time, we’ve continued to update our policies in line with industry best practices, including those developed by the DMA, NAI and IAB.  As our technology and products continue to evolve, we’re always committed to certain fundamental privacy principles:  that users have control over their data, that data collection and use be made as transparent as possible, and that online behavioral tracking data should never be merged with a person’s real-life identity.   For our most current privacy practices in our online advertising data business (a division within Rapleaf called LiveRamp), refer to LiveRamp’s current privacy standards here.


Privacy is an incredibly important issue to us at Rapleaf; it informs all our business and engineering decisions. Occasionally, privacy concerns can lead us to some really interesting engineering challenges. We love this: not only do we get to work on protecting our users’ privacy, we also get a chance to tackle ridiculously challenging problems—stuff no one else is working on. It’s a win-win situation. One recent effort that exemplifies this attitude at Rapleaf is our Anonymouse1 project.

The Problem

Operating our Rapleaf Display Media product involves dropping cookies on users’ browsers. These cookies contain various tidbits of information about the user—basic demographics, interest data, and the like. Using these cookies, we make it possible for content providers to customize their websites to individual users. It’s pretty exciting stuff, and we think that it’ll change the face of the web.

In privacy circles, there’s a concept known as Personally Identifiable Information (or PII). PII refers to data that can be uniquely linked to a specific individual: things like name, address, phone number, or email. It goes without saying that we don’t drop any PII in our cookies. But it’s become abundantly clear that simply stripping your dataset of PII is not enough to make it anonymous.

Here’s a sample dataset to help illustrate the problem:

Anonymouse Dataset

A simple sample dataset.

Each row in the table represents a single record—an individual for whom we have some data. For example, we know that the first individual is a male between the ages of 35 and 55 who uses Facebook and enjoys watching movies; the third record represents a young female Facebook user; and so on. As you can see, these records are fairly anonymous—every record looks exactly like another record in the data. In particular, we would not be able to trace a given set of attributes back to a specific individual in this table.

Now let’s change our dataset just a bit:

Anonymouse Dataset 2

A slightly less simple dataset.

Notice the new interest categories. Specifically, take a look at that bottom record: a 56+ year-old man who enjoys Twilight, knitting, and Motocross. In the dataset, there aren’t any other records that look like him. Furthermore, if we were given just that set of attributes, we’d be able to tie them back to that specific record. Even though each individual attribute is non-identifying, the dataset is no longer anonymous.

Anonymouse

The goal of Anonymouse is to selectively exclude data from the cookies we drop so that our users are sufficiently indistinguishable. We define “sufficiently indistinguishable” using the notion of k-anonymity. A dataset is k-anonymous as long as every record in the set is identical to no fewer than k-1 other records. We can therefore think of a k-anonymous dataset as consisting of clusters of records, or equivalence classes, of size k or greater.

Furthermore, we wouldn’t just like to k-anonymize the dataset; we’d also like to maintain as much valuable data as possible. We could trivially anonymize the dataset by dropping all attributes from everyone, but for obvious reasons that’s not an acceptable solution. Instead, we define the value of an anonymization strategy by its value retained: the value of the anonymized dataset divided by the value of the un-anonymized data. We calculate the value of a dataset by taking the sum over all attributes of the value of that attribute times the number of entities exhibiting that attribute.

Finding an optimal solution is really hard—NP-hard, actually—so the best we’ll be able to do is an approximation algorithm. Even so, it’s still an incredibly tough problem, and we’re still working on our solution.

There’s been some great research into the k-anonymity problem, but we’ve found that our problem is idiosyncratic for a few reasons. First and foremost, we’re dealing with a dataset orders of magnitude larger than what most research has been done on: we have approximately 1 billion records and thousands of possible attributes available. Furthermore, our attributes are all boolean—they are either present or not. We also have differing values between attributes—e.g., it might be more valuable to know if someone enjoys watching Motocross than to know that they’re male. And we’d like to drop different attributes for different records; a lot of existing research simplifies the problem by focusing on dropping the same set of attributes across all records.

Global vs. Entity-level Suppression

Let’s expand upon that last point a bit more in-depth. There are two general strategies for dropping attributes. First, we could choose to suppress segments globally. When an attribute is suppressed globally, it will be suppressed for every entity in the dataset we’re anonymizing. The other alternative is to perform entity-level suppression. To do this, we choose for each individual which attributes to express. This gives us a much finer degree of granularity over what data we express, but at the cost of a much more complex problem.

Global- vs. entity-level suppression

An example of global- and entity-level suppression strategies on a dataset about media consumption. The asterisks indicate where an active attribute has been dropped.

In the above example, you can see the difference in approaches. Global-level suppression requires us to drop information across entire columns; entity-level suppress means we can drop individual cells in the table. Note that an inactive value doesn’t indicate an active negative—that is, just because we haven’t marked someone as “interested in movies” doesn’t mean they dislike movies. Rather, it means that we can’t say one way or the other. Therefore, removing cells from the table doesn’t actually introduce any misinformation into the table. It just decreases what we know about individuals.

While global attribute suppression may seem like a very blunt weapon to wield, there are several advantages to using it as an anonymization strategy. For instance, when there are a fairly small number of possible attributes, it’s actually tractable to find an optimal global suppression strategy. Using a clever heuristic, that’s what Bayardo and Agrawal were able to do on a dataset with 160 possible attributes and 32,000 entities.

Another advantage to global-suppression strategies is more subtle, but can help from an implementation standpoint: a global-suppression solution can be described as a single set of expressed attributes. For instance, if you have 100 attributes, you might store your solution as a 100-bit vector: 1 means you keep the attribute and 0 means it gets dropped. The size of this solution is 100 bits, regardless of how many records you have in your dataset. If instead we wanted to do entity-level suppression, we’d have to keep track of which attributes get suppressed for each record in our dataset. Thus, just to store our solution requires space proportional to the number of records in our dataset—and if you have a billion records, this can get a bit unwieldy.

Global Suppression

With these advantages in mind, we first decided to try developing a heuristic global suppression anonymizer. Our strategy was very similar to the heuristic approach mentioned in the Bayardo and Agrawal paper mentioned earlier. Essentially, we start with a solution with no attributes suppressed, and use a hill-climbing approach to add or remove attributes, until no attribute can be added or suppressed to increase the solution value.

Implementation was not especially difficult, and we got a prototype working soon. There was only one drawback to using this approach: our results were awful. Out of the 552 attributes we were testing with, barely 50 were included in the solution. The value of the solution we produced was barely over 50%.

We tried multiple variations on this approach. We started with all attributes expressed, and iteratively removed them. We tried to find optimal solutions on smaller datasets. In the end, though, our results simply weren’t close to where we needed them to be.

When we looked deeper into why global suppression fares so poorly on our data, it turns out to be for the same reason Rapleaf data is so valuable: many of the attributes we store are uncommon, describing very specific populations. But since the attributes cover such small parts of the population, no algorithm could add them to a global solution without doing serious damage to anonymity.

Cluster-level Suppression

So we moved onto suppressing attributes at an entity level. We found it easiest to think of the problem state as a set of clusters—sets of entities that express the same attributes, given the current anonymization. The solution is brought to k-anonymity by merging and splitting these clusters. Two clusters merge when the segments they have suppressed makes them indistinguishable from each other. Likewise, a cluster splits when some of the entities in it begin expressing an attribute which some others do not.

A key question is: how do we know which attributes to suppress in a cluster, in the hope that dropping the attributes will make it merge it with another cluster? There is no easy answer here. We’ve tried a number of heuristic approaches: first try dropping the least frequent attribute; first drop the least valuable attribute; a combination of value and frequency. We can try to perform further look-ahead, i.e., instead of looking for clusters with 1 attribute different from the current cluster, also search those a distance of 2 or 3. This is an area in which we see room for improvement, and are continuing to refine our strategy.

Memory Challenges

A constant challenge we face is fitting all of our data into memory. It is highly advantageous to hold as many clusters in memory as possible; the more data we have in memory, the more easily we can find clusters to merge together, and the faster we can get to k-anonymity while suppressing as few segments as possible. The various techniques we use to cut down our memory usage could fill a blog on their own, but one strategy has been particularly powerful: incremental anonymization.

The key observation is that as a dataset is anonymized and clusters merge, its memory footprint decreases. So even if we can only fit 80 million clusters in memory initially, after anonymizing the dataset to k=2, our memory usage may have dropped to 60%. We can continue reading in data and 2-anonymizing it, until we hit the point that we have maxed out our memory usage, even after anonymization. At this point, we can double our k value, and anonymize to k=4, and repeat the process. This approach gives us a lot of freedom to tweak JVM performance, ensuring that our free heap space never drops too low (as we don’t want garbage collection costs to dominate our runtime).

What comes next?

Our first set of goals remains the same as ever—refine our strategies to increase attribute retention.

One interesting idea has been to use non-determism in our algorithm, and slightly adjust our understanding of what it means to be k-anonymized with respect to our data. The current goal of anonymization is, when presented with a set of attributes we served, to be able to say: “there are over a thousand people in our database we dropped this set of attributes on.”

If, however, we added an element of chance into the selection of which attributes to actually drop in a person’s cookie, we could instead say “there are over a thousand people in our database we could have dropped this set of attributes on”: k-anonymity would be preserved. The big challenge in implementing this non-deterministic anonymization is finding a full list of entities which could be in a cluster. We have some ideas on how to do this efficiently though—and when we do, it will probably be a blog post of its own!

Conclusion

Again, this is a really hard problem, and we don’t purport to have solved it yet. But it’s incredibly important to us, and we expect to be continually working on and refining our solution well into the future.

Many thanks to Ben Podgursky, our awesome summer intern who helped me write this post and who’s responsible for a lot of the work described above.

1 Why the name Anonymouse? Well, as you’ll see, the project involves anonymizing browser cookies that we drop—and as we all know, if you give Anonymouse a cookie…

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
This entry was posted in Anonymouse and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

6 Comments

  1. Abram
    Posted July 22, 2010 at 1:03 am | Permalink

    This post is awesome. I’ve been kicking around ideas like this (though I don’t actually have datasets yet). This post brought my thinking ahead many steps down the path. Thank you! Keep it up! You’re solving a human problem, not just a business problem. Surely you’ve changed the future of privacy.

  2. Posted September 23, 2010 at 12:54 pm | Permalink

    Great post ! You guys are really solving interesting problems to deliver incredible intelligence without sacrificing privacy. Keep it up !

  3. Posted October 22, 2010 at 2:31 am | Permalink

    good post interesting solution….

  4. Don
    Posted October 24, 2010 at 2:24 am | Permalink

    It would seem that you could use an MD5 hash result to define each record owner and record type using hash results that uniquely defines a records attributes. Where the hash result is the same, the records match.

  5. Matthijs R. Koot
    Posted November 2, 2010 at 12:07 pm | Permalink

    Excellent post, Greg! I hope Ben Podgursky got a good grade on this work. It is *really* good to see ideas from the scientific community, such as k-anonymity, being applied, pondered on and improved ‘outside’ that community. Please consider submitting a write-up of this work to the HotPETS track of the PETS 2011 conference ( http://petsymposium.org/2011/ ) as your work continues; and talk about it at engineering conferences. Awesome!

  6. Posted December 15, 2010 at 9:47 pm | Permalink

    There are much simpler methods using n-dimensional bit encoding that do not require the k-1 solution.
    It is used in cellular call detail record analysis for telecom carriers.

9 Trackbacks

  1. By How Rapleaf Anonymizes Information | Rapleaf on July 21, 2010 at 12:02 pm

    [...] Read the full post here Share this post: [...]

  2. [...] The Rapleaf engineering team takes the microphone on the company's development blog to introduce what it calls "Anonymouse" and how it tackles privacy requirements. From the blog: "The goal of Anonymouse is to selectively exclude data from the cookies we drop so that our users are sufficiently indistinguishable." Take a deep dive here. [...]

  3. By Rattle » Week #1275 on July 30, 2010 at 7:48 am

    [...] Anonymouse – great example of user privacy protection [...]

  4. [...] the cookie is only part of the challenge. Data also needs to be anonymized appropriately.  Simply stripping personally identifiable information out of a cookie is not enough to make it anonymous. Recently, Netflix had to shut down its million-dollar Netflix recommendation contest as a result [...]

  5. By Anonymouse | Techarama on October 21, 2010 at 11:20 pm

    [...] Comments View full post on Hacker News [...]

  6. By SearchCap: The Day In Search, October 22, 2010 on October 22, 2010 at 2:09 pm

    [...] Anonymouse, Engineering Rapleaf [...]

  7. By The Science Behind Rapleaf’s Anonymization on May 24, 2011 at 12:00 pm

    [...] Read the full post here This entry was posted in Privacy Related Uses, Rapleaf Updates. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL. « USA World Cup Fans on Twitter [INFOGRAPHIC] How to Develop an Integrated Marketing Strategy » [...]

  8. [...] a user, it is often possible to de-anonymize that data back to a particular user. If you read our post about Anonymouse a few weeks ago, you’ll know that we’re spending a lot of resources on solving this [...]

  9. [...] the cookie is only part of the challenge. Data also needs to be anonymized appropriately.  Simply stripping personally identifiable information out of a cookie is not enough to make it anonymous. Recently, Netflix had to shut down its million-dollar Netflix recommendation contest as a result [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers