One of Rapleaf’s key services is helping marketing companies bring data they have traditionally used in offline contexts to the online world in a privacy-centric way. We have an internal name for this: “safe onlining”.
Safe Onlining: What Is it?
Many companies have marketing databases that have been gathered offline—for example, from customer surveys or questionnaires, or from purchasing histories. These have been used in previous decades to help companies sort the marketing messages they send, mostly by direct mail and sometimes by email or phone. Today, however, people spend much of their time online, and those previous delivery mechanisms are becoming outmoded.
Safe onlining helps businesses make their databases useful in online contexts, in a way that protects people’s online anonymity and privacy. It’s meant to benefit both people and businesses by enabling companies to personalize the online banner ads and content that people see without interrupting their daily activities (unlike direct mail).
Here’s how safe onlining works:
Step 1: Importing and Mapping
The process begins when a customer comes to Rapleaf with a marketing database that they would like to bring online for advertising, analytics, or personalization. Our first step is to import this database into our system. This isn’t trivial—data arrives in a surprising (and occasionally bewildering) array of formats, and we need to normalize things to match what’s in our existing database.
Moreover, data tends to come in high volume, which makes storage and processing a daunting task. We don’t use a traditional relational database for this purpose, so this import process involves converting our customer’s data into dataunits. If you’re interested in learning a bit more about our data storage schema, check out Bryan’s earlier post on Thrift and pseudo-RDF schemas.
We then need to match the marketing database to our Rapleaf system. Using a corpus of matching data, plus some high-confidence inference algorithms, we are able to map our customers’ input files to entities in the Rapleaf system. This process is a large part of our technology and will be the subject of a future post.
Step 2: Summing
Once the mapping process is complete, we need to package it into a more usable, consumable format. We call this process summing (“summarizing” was too long a word). The workflow that performs summing is called the summer, and the objects that are the end result of this process are called summs. Very creative nomenclature, I know.
The summer creates a summ for each entity. A summ is basically just a package of little tidbits of information, which we call segments. Here are a few examples of segments:
- Demographic, Age 18-24
- Demographic, Age 25-34
- Demographic, Age 35-44
- Interests, Sports
- Interests, Sports > Baseball
- Shopping, In-Market for new SUV
And so on. As you can see, it’s often pretty straightforward to translate an entity’s data into a segment. But you can also imagine how multiple pieces of data might contribute towards a single segment. For example, we might place an Interest, Media segment in a summ based on the fact that a person is interested in music, books, and movies.
In reality, summs and segments are even more complicated than this. For many reasons, we need to keep track of which data partners contributed data for a given segment. We also may need to create different packages of data depending on how it will be applied.
Step 3: Anonymization
Once we have our summs, we need to anonymize them in order to ensure that the data we eventually publish respects consumers’ privacy.
The idea behind anonymization is to ensure that every summ in our datastore looks like some number of other summs. For example, let’s say a person’s summ looks like the following:
- Demographic, Age 25-34
- Interests, Pets
- Interests, Travel
Now let’s imagine that this person is the only individual in our entire datastore who has these three particular attributes. This is a privacy problem, because the segments in this summ are uniquely identifying. If these segments were to make their way into a browser cookie, it would be possible to trace that cookie back to this particular summ, and therefore, the person.
But let’s say we drop that last segment, Interests, Travel. As it turns out, there are several other people in our datastore with the Demographic, Age 25-34 and Interests, Pets segments. Therefore, the summ is no longer uniquely identifying; those two segments could apply to several thousand different entities in our system, not just one person. Based solely on those two segments, there is no longer any way to trace a browser cookie back to the person’s entity. The person is anonymous.
The number of other summs that any given summ is identical to is usually referred to as k, and we often speak of k-anonymization. This means that it is impossible to trace a given set of segments back to any fewer than k individuals. k = 2 is the most basic level of anonymization; we’re striving for k = 1000.
Anonymization is an incredibly tricky and interesting problem, especially when you’re dealing with datasets as huge as Rapleaf’s. I could go on for pages about it, but this is already quite a lengthy post, so I’ll leave it at that for now. If you want a more in-depth discussion, be sure to check out our previous post on anonymization.
Step 4: Publishing Cookies
After we’ve generated and anonymized the summs for all our entities, it’s time to actually bring them online.
The first step is to load the anonymized summs into a distributed hash table. The system we use for this is an internally developed read-only hashmap that we’ve optimized for extremely fast lookups. Using this hash table, we can quickly look up a summ from an online identifier.
To bring the summs online, we partner with a variety publishers—web sites and services that support logged-in users. When a user logs into a publisher’s site, they make that user’s identifier available to Rapleaf for matching purposes only. This request can come via javascript, an HTML <iframe> element, or a small pixel embedded onto the publisher’s page. For security purposes, we typically ask that the publisher send us a hashed version of the identifier in their request so that no new data is passed to us by the publishers—just the hashed identifier.
Now we’re at the home-stretch! From the hashed identifier that the publisher sends us, we query our distributed hash table to get the user’s summ. Our response to the publisher’s request is to drop a cookie based entirely on the information in the anonymized summ. The cookie we drop only contains a plain-vanilla set of non-identifying segments: there’s no personally identifiable information, and nothing about the user’s browsing behavior.
Step 5: Safe Onlining: Success!
Finally, our customer’s offline data has been safely “onlined” into a cookie on people’s browser. This allows them to participate in the online world where most people are today. And this ultimately makes for a more tailored, personalized experience for people on the web—while protecting their privacy.