Announcing Hank: A Fast, Open-Source, Batch-Updatable, Distributed Key-Value Store

We’re really excited to announce the open-source debut of a cool piece of Rapleaf’s internal infrastructure, a distributed database project we call Hank.

Our use case is very particular: we have tons of data that needs to get processed, producing a lot of data points for individual people, which then need to be made randomly accessible so they can be served through our API. You can think of it as the “process and publish” pattern.

For the processing component, Hadoop and Cascading were an obvious choice. However, making our results randomly accessible for the API was more challenging. We couldn’t find an existing solution that was fast, scalable, and perhaps most importantly, wouldn’t degrade performance during updates. Our API needs to have lightning-fast responses so that our customers can use it in realtime to personalize their users’ experiences, and it’s just not acceptable for us to have periods where reads contend with writes while we’re updating.

We boiled this all down to the following key requirements:

  1. Random reads need to be fast – reliably on the order of a few milliseconds.
  2. Datastores need to scale to terabytes, with keys and values on the order of kilobytes.
  3. We need to be able to push out hundreds of millions of updates a day, but they don’t have to happen in realtime. Most will come from our Hadoop cluster.
  4. Read performance should not suffer while updates are in progress.

Additionally, we identified a few non-requirements:

  1. During the update process, it doesn’t matter if there is more than one version of our datastores available. Our application is tolerant of this inconsistency.
  2. We have no need for random writes.

The system we came up with is tailored to meet these needs. It consists of a fast, read-only data server backed by a custom-designed batch-updatable file format, a set of tools for writing these files from Hadoop, and a special daemon process that manages the deploy of data from the Hadoop cluster to the actual server machines. Clients of Hank are aware of ongoing updates and avoid connecting to servers that are busy. When the time comes to push out a new version of our data, the data deployer allows only a fraction of the data servers to perform an update at a time, making sure that sufficient data serving capacity remains online.

There’s a more detailed look at the architecture and infrastructure of the project, and you can find the code on GitHub, which is shared under the Apache Software License. This codebase is still a work in progress – our older, internal version was in need of a serious refactor – but most of the necessary pieces are there, and we’re going to finish the development in the open. We’d love to hear your thoughts on the project and would doubly love to get your contributions, whatever form they might take.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
Posted in Uncategorized | 7 Comments

Reusing connections for performance

Some of our customers were seeing some really high response times when accessing our API from various places around the US and they asked us to help get to the bottom of the situation. We were surprised when one customer reported seeing times of 500+ms because we measure the response times of our requests and usually get an average of 5-6ms. Our hypothesis was that the lack of persistent connections and/or internet latency were the root cause, so we used Keynote to take some measurements from around the US.

 

The graph below shows the total request time (including DNS lookup) based on the region where your servers are located:

Latency is definitely an issue, though there’s very little we can do to alleviate it short of opening datacenters across the US. As our API is entirely dynamic content, a traditional CDN approach would have little impact, though we are exploring moving the SSL negotiation to an edge network to help in situations where connection re-use is not practical.

Taking a closer look at an individual request, you can see that 75% of the time is spent just setting up the secure connection. We require SSL for all our API calls in order to protect your users’ identities. To understand how this can affect response times let’s take a quick look at how SSL negotiation works.

A detailed description of SSL negotiation can be found on Wikipedia but at a minimum you’re looking at 12 round trips before we ever get your API request – and this is under optimal circumstances. Sometimes conditions cause TCP fragmentation, retries, etc that can further degrade responses.

To get the best performance when using our API you should reuse your connections to avoid setting up and tearing down both TCP and SSL for each request. Once you’ve got this enabled you can eliminate about 75% of the time spent on each request! In addition, if the API client in your application supports HTTP pipelining, you should enable it so you can send off a number of requests on the same connection without having to wait for the response to the first request.

 

Please feel free to ask any questions you have in the comments.

 

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon


Posted in Operations, Personalization API | 2 Comments

Introducing the Utilities API

Rapleaf is a San Francisco-based startup with an ambitious vision: we want every person to have a meaningful, personalized experience. We want you to see the right content at the right time, every time. However, delivering relevant, personalized content requires a deep understanding of individuals’ attributes. At times, our knowledge of individuals is incomplete or in an irregular, unusable form and we need a way to fill in the gaps. For cases like these, we’ve built internal tools for data deduction and sanitization.

When tools like these are very general-purpose in nature, we add them to an internal API accessible to our whole development team. Today, I’m happy to announce that we’re starting an initiative to open up these tools to the public via the Utilities API. We hope that this API will allow external developers to leverage our toolset to improve the quality of their own applications.

Here are the first tools that we’ve added to the Utilities API:

Name to Gender

This function takes a string as a person’s first name and deduces the likelihood that the person has either gender.

Example: “mike” => “Gender: Male, Likelihood: .9946”

Name Deducer

This function takes a string as a user name and attempts to parse it into its constituent components.

Example: “dolegbob42″ => “First: Bob, Middle: G, Last: Dole.”

Name Normalizer

This function takes a string as a name and attempts to parse it into its constituent components.

Example: “mr john g smith iv” => “Prefix: Mr., First: John, Middle: G, Last: Smith, Suffix: IV”

Often, you might find these functions useful in conjunction with one another. For instance, imagine that you have a sign up process that requires a user to provide their email, full name, and gender. Many email addresses contain information about the owner’s name, and a person’s name contains information about their gender. In order to save your users time and increase the likelihood that they follow through with the sign up process, you could eliminate the name and gender fields. In their stead, you could apply name deduction to their email to determine their name and apply gender inference to their first name to determine their gender.

Getting Started

Try out our demo: a simple UI that demonstrates the current offerings of the API. You can also check out the documentation for better information about how to get the API working for you.

We look forward to hearing back from you about ways that you’ve found the API useful and how we can make it better.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
Posted in Miscellaneous | 5 Comments

Tag to Interest Mapping

We’re always looking for ways that our Personalization API can be made more valuable and intuitive. This week, we decided to see what things are like from the other side and actually build an application that uses the API. Our application uses tags attached by users and interest data provided by the API to prioritize site content to visitors. However, we ran into a problem: how do you map arbitrary user-entered tags to Rapleaf’s interest categories?

In particular, the Personalization API returns a list of interest categories (the full list is cataloged in our response guide). Rapleaf’s interest categories are limited to a careful selection, while the set of interests that a user might enter is effectively infinite. It’s easy enough for humans to decide whether or not tags like “Sports & Recreation” and “Athletics” are related, but, for our application to be useful, we needed a way to automate the process.

In order to improve our interest coverage, we created an open-source library that works in conjunction with our development kits to make a best-guess mapping from arbitrary term to Rapleaf interest category. You can find it on our GitHub page. We’d love to hear from you about the creative ways you’ve thought of using our tools, as well as ways in which we can make it easier for you to integrate the API, at developer@rapleaf.com.

How it Works

We consider two things when performing the mapping: syntax and semantics. In terms of syntax, we wanted to catch low-hanging fruit by matching singular and plural as well as a variety of conjugations. To accomplish this, we used the Levenshtein distance measure. The Levenshtein distance between two strings is given by the minimum number of operations (i.e. character insertion, deletion, or replacement) needed to transform one into the other. In terms of semantics, we wanted to find a way to relate words that were lexically dissimilar but were semantically equivalent. We accomplished this via the WordNet thesaurus tool. WordNet groups English words into sets of synonyms (“synsets” in WordNet lingo). If there exists a short WordNet path between two words, then it’s reasonable to think that the two words are semantically related.


Image from the Visual Thesaurus, Copyright ©1998-2010 Thinkmap, Inc. All rights reserved.


  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
Posted in Personalization API | Tagged , , , | Leave a comment

Memory-efficient sparse bitsets

A bitset is a data structure designed to store a vector of boolean values very compactly – one bit per value. In practice, they’re a really handy way to save memory. However, we had a situation in one of our extremely memory-intensive applications where a simple bitset wouldn’t cut it. We have over 2500 variables to store bits on, meaning that our bitsets took up over 300 bytes each!

If most or all of the bits were usually set, this would be an unavoidable issue, but in our case, the set bits were very rare – usually no more than 20 or 30. This leads to a large amount of wasted space. The traditional approach to sparse sets like these is to just store the position number of the set variables directly in a collection. To allow for the full 2500 position numbers, we needed a short int, meaning that with this approach, the memory size of the collection is 2 bytes times the number of elements. A sparse set of 30 elements will take a lot less memory than the equivalent bitset (60 bytes versus 313), but in our application, it’s still too much.

What if we could combine the tight-packing benefits of a bitset with the low impact of the sparse set? It turns out that we can. In our application, some of the variables are set pretty frequently (1/4 records have it set) and some are set very rarely (1/10000), with a gradient of various frequencies between. We can exploit this gradient to make really efficient storage decisions. Here’s the graph of frequencies over our set:

It comes down to this: in a bitset, each possible variable in your set costs you one bit, whether or not it’s set. In a sparse set, with our example of 2500 variables, each variable costs 16 bits, but only when it’s set. Comparing these two costs gives a clear tradeoff point. When a given variable is set on at least one of every 16 records, you should store it in a bitset; when it is set on one out of every 17 or more records, then you should store it in the sparse set.

To build this type of set in practice, you just need to get your variables into a list ordered by frequency of occurrence and then apply the cutoff. Everything above should be managed through a bitset, and everything below should be managed by a collection. We found it handy to wrap both of these sets up in a single class so that the user can be indifferent to where the data is stored.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
Posted in Miscellaneous | 2 Comments

Día de los Proyectos Muertos

At Rapleaf, every employee is required to be entrepreneurial, and a big part of entrepreneurship is the willingness to invest enormous efforts in initiatives that might not succeed. It has thus always been an important part of our culture to celebrate both well-executed successes as well as well-executed failures.

The cultural recognition of “constructive failure” is identified by Dan Senor and Saul Singer in their book Start-Up Nation as one of the ingredients of Israel’s entrepreneurial “miracle”:

Israeli attitude and informality flow also from a cultural tolerance for what some Israelies call “constructive failures” or “intelligent failures.” Most local investors believe that without tolerating a large number of these failures, it is impossible to achieve true innovation. In the Israeli military, there is a tendency to treat all performance — both successful and unsuccessful — in training and simulations, and sometimes even in battle, as value-neutral. So long as the risk was taken intelligently, and not recklessly, there is something to be learned.

In this spirit, we took a few moments last week to celebrate the memory of a handful of the wonderful projects that were terminated this past year. We derived inspiration from the Mexican holiday Día de los Muertos, also occurring last week, in which dead friends and loved ones are remembered and celebrated, sometimes in “a humorous tone, as celebrants remember funny events and anecdotes about the departed.”

Our event was designated Día de los Proyectos Muertos (Day of the Dead Projects), and each departed project received a poignant (read: spectacularly funny) eulogy given by one of our engineers. There were multiple highlights, but the top crowd-pleaser was a eulogy given for a reporting system that was put to rest after three productive years. The eulogy recalled the conception, birth, high-points and low-points of the system’s “full and meaningful life”, and featured a filtered selection of colorful svn commit comments.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
Posted in Miscellaneous | 2 Comments

Striving for zero copies with Thrift 0.5

“Zero copies” is a common optimization principle used in high-performance applications. The gist of the technique is to have the smallest number of byte array copies necessary for the server to perform its task. Byte array copies are one of those insidious time-wasters that are hard to understand or even detect until you start looking for them. It seems intuitive to use a perfectly-sized byte array for everything you do: it’s straightforward, reduces the number of arguments you have to pass to each method, but most of all, it’s just simple. However, you actually pay a steep price every time you copy a byte array – the CPU is spinning away shuffling bytes from one memory location to another. It’s actually even worse in Java: every time you create a new byte[], you’re both allocating memory and looping over it to zero each position out. This means you pay a price now as you iterate over every position and a price later when you ultimately have to garbage collect the new byte[] you threw away. An ideal server would never copy a byte[] unnecessarily, preferring to reuse the one over and over again.

Before Thrift 0.4, no matter how much you might want to, there was no way to avoid doing an extra byte[] copy for each binary field that you deserialized, despite the fact that virtually all deserialization happens directly from an in-memory byte[] buffer. Thrift 0.4 changed that by switching the underlying type of binary fields from byte[] to the Java NIO construct ByteBuffer; Thrift 0.5 elaborated on this theme by making it easier to get the byte[]s that everyone expected while still offering access to the ByteBuffer for more advanced operations.

So how do you actually use this feature to speed up your servers? Let’s take a look at a pair of examples. In both examples, we’ll use the following Thrift file as our base:


struct A {
1: required binary foo;
}

service SomeService {
A read();
void logFoo(1: A a);
}

The logFoo method

Let’s pretend that your objective is to log the contents of the foo field to some stream. Here’s how you might do it naively:


private DataOutputStream out;

public void logFoo(A a) throws TException {
byte[] value = a.getFoo();
out.writeInt(value.length);
out.write(value);
}

Seems simple enough. So what’s wrong here? The problem is that calling getFoo() causes a byte[] copy. It’s hidden from you by the method, but it’s happening nonetheless. The copy you create is only used for an instant, becoming garbage after you pass it to write(), and then the entire A object becomes garbage.

Here’s the right way to do it:


private DataOutputStream out;

public void logFoo(A a) throws TException {
ByteBuffer value = a.bufferForFoo();
out.writeInt(value.remaining());
out.write(value.array(), value.arrayOffset() + value.position(), value.remaining());
}

There’s a lot here, so let’s break it down. First, notice that we called “bufferForFoo” instead of “getFoo”. This returns a ByteBuffer instead of a byte[]. Then, we use the remaining() method to get the number of bytes in the buffer that belong to this value. Finally, we go to the write() call, but this time using the “array, offset, length” version of write. This allows us to reference a subarray directly from the array that backs value without any intermediate copying. There’s some trickiness that goes into understanding why the first element in the backing byte array is arrayOffset() + position(), but for right now, trust me that it’s the case.

It’s a small difference with a bit more code, but depending on the size of foo, you could see a substantial boost in performance.

The read method

Now let’s look at things from the other side of the equation. Let’s say that the objective of the read() method is to read the bytes of foo from an input stream and return them wrapped in an instance of A. Here’s what the naive approach might look like:


public A read() throws TException {
// assume that "in" is a DataInputStream
int fooLength = in.readInt();
byte[] value = new byte[fooLength];
in.readFully(value);
A result = new A();
result.setFoo(value);
return result;
}

The problem with this method is that every call leads to a new short-lived instance of A and a brand new perfect-sized byte[]. Both of these will become garbage very soon, and allocating the new byte[] every time is a drag on your CPU.

Let’s focus on how we can reuse the byte[] for now and think about the A instance some other time. There are many possible strategies for caching your buffers, but here’s a simple one:


private final ThreadLocal bufferCache;

public A read() throws TException {
int fooLength = in.readInt();
byte[] value = bufferCache.get();
if (value.length < fooLength) {
value = new byte[fooLength];
bufferCache.set(value);
}
in.readFully(value, 0, fooLength);
A result = new A();
result.setFoo(ByteBuffer.wrap(value, 0, fooLength);
return result;
}

There’s a good bit more to this version. First, note that we’re using Java’s ThreadLocal capability to support us keeping a single byte[] per active thread. This makes sure that each thread servicing a client won’t interfere with any other, and there’s no contention (synchronization) for the thread-local buffer. Next, after we figure out how much we need to read, we make a point of checking if we have enough buffer space to complete the read. If not, we replace our buffer with a new, bigger one. Then we complete the read into the buffer – this time specifying the length we want to read, rather than letting the length of the buffer imply the size of the read. This ensures that on subsequent reads, when fooLength is less than value.length, we don’t try to read more than we wanted to. Finally, instead of passing the entire buffer into the foo field, we pass in a ByteBuffer that wraps just the portion that contains the value we read this time.

By using this technique, we’ve avoided one copy per call plus an unknown number of byte[] allocations – if your record sizes vary a lot, then it will take some time before the buffer has to expand to accommodate the biggest one, but after that, you won’t need any more allocations. If your records are fixed size, then you should reach that point immediately.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
Posted in Miscellaneous | 3 Comments

Safe Onlining

One of Rapleaf’s key services is helping marketing companies bring data they have traditionally used in offline contexts to the online world in a privacy-centric way. We have an internal name for this: “safe onlining”.

Safe Onlining: What Is it?

Many companies have marketing databases that have been gathered offline—for example, from customer surveys or questionnaires, or from purchasing histories. These have been used in previous decades to help companies sort the marketing messages they send, mostly by direct mail and sometimes by email or phone. Today, however, people spend much of their time online, and those previous delivery mechanisms are becoming outmoded.

Safe onlining helps businesses make their databases useful in online contexts, in a way that protects people’s online anonymity and privacy. It’s meant to benefit both people and businesses by enabling companies to personalize the online banner ads and content that people see without interrupting their daily activities (unlike direct mail).

Here’s how safe onlining works:

Step 1: Importing and Mapping

The process begins when a customer comes to Rapleaf with a marketing database that they would like to bring online for advertising, analytics, or personalization. Our first step is to import this database into our system. This isn’t trivial—data arrives in a surprising (and occasionally bewildering) array of formats, and we need to normalize things to match what’s in our existing database.

Moreover, data tends to come in high volume, which makes storage and processing a daunting task. We don’t use a traditional relational database for this purpose, so this import process involves converting our customer’s data into dataunits. If you’re interested in learning a bit more about our data storage schema, check out Bryan’s earlier post on Thrift and pseudo-RDF schemas.

We then need to match the marketing database to our Rapleaf system. Using a corpus of matching data, plus some high-confidence inference algorithms, we are able to map our customers’ input files to entities in the Rapleaf system. This process is a large part of our technology and will be the subject of a future post.

Step 2: Summing

Once the mapping process is complete, we need to package it into a more usable, consumable format. We call this process summing (“summarizing” was too long a word). The workflow that performs summing is called the summer, and the objects that are the end result of this process are called summs. Very creative nomenclature, I know.

The summer creates a summ for each entity. A summ is basically just a package of little tidbits of information, which we call segments. Here are a few examples of segments:

  • Demographic, Age 18-24
  • Demographic, Age 25-34
  • Demographic, Age 35-44
  • Interests, Sports
  • Interests, Sports > Baseball
  • Shopping, In-Market for new SUV

And so on. As you can see, it’s often pretty straightforward to translate an entity’s data into a segment. But you can also imagine how multiple pieces of data might contribute towards a single segment. For example, we might place an Interest, Media segment in a summ based on the fact that a person is interested in music, books, and movies.

In reality, summs and segments are even more complicated than this. For many reasons, we need to keep track of which data partners contributed data for a given segment. We also may need to create different packages of data depending on how it will be applied.

Step 3: Anonymization

Once we have our summs, we need to anonymize them in order to ensure that the data we eventually publish respects consumers’ privacy.

The idea behind anonymization is to ensure that every summ in our datastore looks like some number of other summs. For example, let’s say a person’s summ looks like the following:

  • Demographic, Age 25-34
  • Interests, Pets
  • Interests, Travel

Now let’s imagine that this person is the only individual in our entire datastore who has these three particular attributes. This is a privacy problem, because the segments in this summ are uniquely identifying. If these segments were to make their way into a browser cookie, it would be possible to trace that cookie back to this particular summ, and therefore, the person.

But let’s say we drop that last segment, Interests, Travel. As it turns out, there are several other people in our datastore with the Demographic, Age 25-34 and Interests, Pets segments. Therefore, the summ is no longer uniquely identifying; those two segments could apply to several thousand different entities in our system, not just one person. Based solely on those two segments, there is no longer any way to trace a browser cookie back to the person’s entity. The person is anonymous.

The number of other summs that any given summ is identical to is usually referred to as k, and we often speak of k-anonymization. This means that it is impossible to trace a given set of segments back to any fewer than k individuals. k = 2 is the most basic level of anonymization; we’re striving for k = 1000.

Anonymization is an incredibly tricky and interesting problem, especially when you’re dealing with datasets as huge as Rapleaf’s. I could go on for pages about it, but this is already quite a lengthy post, so I’ll leave it at that for now. If you want a more in-depth discussion, be sure to check out our previous post on anonymization.

Step 4: Publishing Cookies

After we’ve generated and anonymized the summs for all our entities, it’s time to actually bring them online.

The first step is to load the anonymized summs into a distributed hash table. The system we use for this is an internally developed read-only hashmap that we’ve optimized for extremely fast lookups. Using this hash table, we can quickly look up a summ from an online identifier.

To bring the summs online, we partner with a variety publishers—web sites and services that support logged-in users. When a user logs into a publisher’s site, they make that user’s identifier available to Rapleaf for matching purposes only. This request can come via javascript, an HTML <iframe> element, or a small pixel embedded onto the publisher’s page. For security purposes, we typically ask that the publisher send us a hashed version of the identifier in their request so that no new data is passed to us by the publishers—just the hashed identifier.

Now we’re at the home-stretch! From the hashed identifier that the publisher sends us, we query our distributed hash table to get the user’s summ. Our response to the publisher’s request is to drop a cookie based entirely on the information in the anonymized summ. The cookie we drop only contains a plain-vanilla set of non-identifying segments: there’s no personally identifiable information, and nothing about the user’s browsing behavior.

Step 5: Safe Onlining: Success!

Finally, our customer’s offline data has been safely “onlined” into a cookie on people’s browser. This allows them to participate in the online world where most people are today. And this ultimately makes for a more tailored, personalized experience for people on the web—while protecting their privacy.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
Posted in Miscellaneous | Leave a comment

Security Best Practices at Rapleaf

As a security professional I’ve gotten to work on systems from large banks to hospitals to government agencies, and even for the President of the United States. My background also has a healthy dose of cutting edge startups at every stage of growth, so when I came to Rapleaf I expected to walk into the “wild west” mentality of many small companies:

  • Every engineer has the root password to production systems (and doesn’t see a problem with this)
  • Code releases pushed to production 10 minutes before happy hour on Friday
  • Crazy deadlines for features pushing all considerations about security and scalability into the “future.”

What I found was quite the opposite: a cultural understanding that security and scalability are core features that our customers rely on. I’d like to share how Rapleaf has made these core values of the company and how that translates into some really good security practices.

Culture

One of the biggest differences here at Rapleaf is the orientation process new hires go through. We go a lot further than the usual code walkthrough or network diagram intro that only takes the first half day at many companies. New employees get to participate in 8 or so days of intense orientation covering everything from application architecture to security and privacy practices at Rapleaf. During this period you’re given the time to really learn how we do things and ask any questions you might have, all without the distraction of a pile of work that should have been done a month before you were hired. Any company that plans ahead for this sort of training gets employees who are contributing to the long term vision of the company instead of just scrambling to produce results before knowing how those results will even be used. When it comes to security this drives home the lesson of “take the time to do it right the first time.”

Interns

Rapleaf has an awesome intern program that brings in talented individuals from around the country to work on some great and innovative projects. One of the concerns with bringing interns in to work on software is they’re almost always inexperienced in working on production systems. What we’ve done is created a locked down Linux virtual machine that interns install on their own machine, and then they must use this VM for working on Rapleaf systems. This provides them with the common toolset used by all the full time engineers here, while also limiting their access to systems and source code to what is necessary. Plus, when they leave at the end of their internship, we can just delete the VM in a single step and be sure that they aren’t accidentally leaving with any code or data. Interns are also paired up with mentors who review any code they produce before committing it into source control.

Software Development

At Rapleaf when an individual raises a security concern it is immediately evaluated for risk and put into our issue tracking system. By treating these concerns the same as a new revenue-generating feature we ensure the best quality software makes it to our production systems.

One particular problem I’ve seen at many previous companies was database passwords hard-coded into the source code of production applications. When security invariably became an issue, it was always a huge effort to excise and manage these passwords in a secure way. I came to Rapleaf to find this problem solved in a manner more elegant than any I’d ever seen: First, the credentials and configuration are stored in separate meta-files in a secured repository. Access to the password part is carefully controlled, while access to the configuration part is easily accessed. Then, when it’s time to push code to production, the Operations team runs a deploy that constructs the full configuration file from the two parts and places it securely on the target server. This allows developers to make changes to their configuration as needed without having the sensitive database credentials exposed to anyone with access to the code base.

Physical Security

This is a critical aspect of security that often stops at the datacenter. At Rapleaf, it starts with the employee’s laptop. In addition to issuing mandatory laptop locks, we make sure that every laptop uses encryption to protect sensitive code and data, and that it is set up to automatically lock its screen and expire network sessions when not in use. This mitigates scenarios like this. At the datacenter level, we’ve gone the extra mile to use a SAS 70 Type 2 certified facility. This means that an external auditor has verified that critical controls are in place and being used such as:

  • Biometric access controls
  • Dry pipe fire suppression systems
  • Man trap entrance
  • Video surveillance
  • Audited records of access

Stay tuned — in a future blog post, I will discuss more details on security in an agile web company in addition to why security policies are a good thing.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon


Posted in Miscellaneous, Security | Leave a comment

Why Rapleaf Does Not Use Unique Identifiers in Cookies

Update-11/4/11:  This article reflects our policies as of Fall 2010.  Since that time, we’ve continued to update our policies in line with industry best practices, including those developed by the DMA, NAI and IAB.  As our technology and products continue to evolve, we’re always committed to certain fundamental privacy principles:  that users have control over their data, that data collection and use be made as transparent as possible, and that online behavioral tracking data should never be merged with a person’s real-life identity.   For our most current privacy practices in our online advertising data business (a division within Rapleaf called LiveRamp), refer to LiveRamp’s current privacy standards here.


If you ever need to drop data in a browser cookie, you generally have two options: dropping the data directly, or dropping a unique ID (or UUID, for universally unique identifier). In the latter case you’d have to store a mapping from UUIDs to data on your server, and whenever you see a cookie you’d query this map to acquire the data you want.

The UUID approach is nice from a technical perspective because it limits the size of the cookies you drop: a UUID is only 16 bytes.1 Cookies get sent during browser requests, and may be uploaded multiple times during a browsing session. If a cookie is large enough, it can dominate the size of the request and noticeably hurt the user’s browsing experience. This issue is mitigated somewhat by the fact that cookies can’t be larger than 4K—but then you run into an upper limit on the amount of data a cookie can contain, and the UUID approach becomes attractive once again.

UUIDs are also convenient because all the data lives on the server, simplifying the task of updating that data. If the data lives in the cookie, then we cannot update it until we have an opportunity to drop another cookie on the user.

Because of these features, UUID’s are used by almost every ad network and advertising technology company today. However, although UUIDs are attractive, we’ve prohibited the use of UUID’s here at Rapleaf due to privacy concerns. UUIDs are, by design, uniquely identifying. If you use UUIDs, it means you have a mapping from UUID to data on your servers.

Unique Identifiers Are Often Personally-Identifiable

Here’s a simple example of how a UUID system might work. Let’s say we have the following database of information:

UUID Table

Now imagine we want to drop a cookie based on the email jsmith@example.com. Rather than putting the actual data in the cookie (e.g., gender = male and whatever other information there might be in subsequent columns), we could simply drop the UUID 0800200c9a67. If we see this cookie later, then all we need to do is take the UUID, find its row in the database, and grab the data associated with that user.

If that data contains any personally identifiable information (like a user’s name or email address), it’s completely trivial to map from a browser cookie to a person’s identity. In fact, many companies are doing this today. They claim to not include personally-identifiable information in cookies, but in fact they store UUID’s that map directly to email addresses or hashed email addresses—making it trivial to reconstruct the browser’s identity.

For example, from the UUID 0800200c9a67, it is trivial to derive that user is actually jsmith@example.com—so the UUID itself is personally identifiable. The danger of this system is that the ad network can merge the data about what sites you visit back into a database attached to your email address, name, and address, building a permanent data set of what sites you’ve visited.

And even if you can’t map a UUID to personally identifiable information, there are still privacy issues. Specifically, a UUID can act as a unique identifier for a particular browser. This means that you can know a user’s browsing history, even if you don’t explicitly know who the user is. By piecing together enough pieces of information on a user, you can often figure out that user’s identity—making it possible for a rogue company (or government) to link browsing behavior to specific individuals.

At Rapleaf, we actively avoid collecting data on browsing history: we don’t want to know it, it’s not our business to know it, and we want to control the amount of information we know about the user to ensure that they maintain anonymity online. Full stop.

Privacy-Centric Alternatives

That’s why we store data on the cookie itself. We don’t put any personally identifiable information in our cookie, so there’s no straightforward way to know who a browser might belong to. Likewise, we don’t put a UUID in there, so there’s no straightforward way to determine browsing history.

Now, we recognize that this system isn’t perfect. Given enough data on a user, it is often possible to de-anonymize that data back to a particular user. If you read our post about Anonymouse a few weeks ago, you’ll know that we’re spending a lot of resources on solving this problem. Once a cookie has been anonymized, this should provide a strong guarantee on the user’s privacy.

There’s one last alternative to UUIDs that combines the best of both worlds—the privacy advantages of putting data directly in the cookie, as well as the technical advantages of using UUIDs. After a set of cookies have been anonymized, each cookie will belong to an equivalence class with several others. For example, if we perform 10,000-anonymization on the data set, then each cookie will look identical to at least 9,999 other potential cookies.

Now, instead of storing all the data in the cookie, what if instead we simply stored an equivalence class ID? This gains us all the technical advantages of dropping a UUID, since we’re only dropping a single key in the cookie. But from privacy standpoint, it is fundamentally different from a UUID. An equivalence class tells us nothing about an individual user; if we have 10,000-anonymized the data set, then by design the user could be any one of 10,000 people. It is impossible to gather a browsing history, since multiple browsers can and will have the same equivalence class ID. Of course, this relies on a strong degree of confidence in the anonymization algorithm, and this is a change we have not yet implemented—but we think it’s a promising idea.

1 There’s nothing special about 16 bytes. All that’s necessary is that the ID is large enough to be uniquely identifying within the domain of the ad network. I used 16 bytes because that’s the size specified in the UUID standard.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
Posted in Anonymouse | Tagged , , | 9 Comments
  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers