Privacy Preserving Measurement 1: Background
Posted by ekr on 07 Oct 2021
Depending on your point of view, we're in a golden age of big data or a golden age of surveillance. Unfortunately, with the technology we typically use, these are more or less the same thing: if you collect data from a lot of people you're going to learn a lot about them. While there are applications where you actually want to use people's individual data (e.g., targeted behavioral advertising), in many cases you just want to learn overall information about the population, not information about specific individuals.
As a concrete example, suppose that you want to take a survey to learn the prevalance of some medical condition: for obvious reasons you don't actually want to learn people's actual medical histories, you just want to know how many people have disease X. But if you just go around asking people, suddenly you actually have a bunch of incredibly sensitive information. That information then needs to be protected--including from yourself. This might seem counterintuitive, but it's important to remember that in a lot of cases data is being collected by big organizations with a lot of people in them, and you need to make sure that nobody in the organization mishandles them. Moreover, convincing people that that information will be protected is essential to getting them to give it to you in the first place; if people don't trust you, they won't tell you the truth.
The good news is that over the past few years there has been an incredible amount of progress in what's generically called privacy preserving measurement (PPM) technologies that make it possible to take measurements while also protecting people's privacy. This series of posts attempts to provide an overview of these technologies. As a lead in to that this post covers the traditional way that people do things, which is basically to collect a pile of data and then analyze it directly.
Types of Measurement #
A good place to start is by asking about the kinds of measurements you want to take. What I mean here isn't the data that you collect from users but rather the output of the analysis that you are trying to do. As we'll see later in this series, one of the major challenges with PPM technologies is that they are good for taking certain kinds of measurements and not others and so you have to be really clear about what you are trying to do before you start collecting data. To some extent this is true for any kind of measurement as anyone who has ever done the kind of science that requires a lot of data collection can tell you, but as a practical matter, if you have a bunch of raw data in hand, there's usually quite a lot you can do. Indeed, it's quite common to be able to take data collected for one purpose and use it for an entirely one, as in many economics "natural experiments". This is much less true with PPM technologies.
In this section, I go over some of the most common types of measurements.
Simple Aggregates #
Probably the simplest type of measurement you might want to take is a population aggregate. For instance, you might want to ask the average height or income of a population or the fraction with some characteristic.
The traditional way to do this is just to survey a bunch of people (or thermometers, trees, whatever) and then collect their values for the variable of interest. This is making it sound a lot easier than it actually is because you usually don't want to measure the whole population, so you instead end up taking a sample and getting a representative sample can be quite difficult--as seems to have been responsible for the severe polling errors in recent US elections--but at the end of the day you end up with a list of values. Once you have this list, you can compute a number of different aggregates, such as total, average (mean), median, quantiles, standard deviation, etc.; basically, the kind of descriptive statistics you would learn in a typical intro stats course.
Relationship Between Multiple Values #
The next most complicated kind of measurement captures the relationship between multiple variables. For instance, we might be interested in whether people who are taller make more money (spoiler alert: they do).
The standard techniques for this kind of analysis involve having data which is grouped by subject. For instance, we might have a table like the following:
There are lots of things we can do with this kind of data. Obviously, we can compute the descriptive statistics for each variable that I mentioned above, but you can also ask about the relationship between gender and income, the relationship between gender and height, the relationship between height and income, or between all three. The nice thing about having this kind of data is that you don't need to know in advance what kind of analysis you want to run: as long as you have the raw data you can just run it. As I said above, it's quite common in economics to use data sets gathered for one purpose for researching new questions. In addition this kind of raw data is very useful for double-checking your statistical analysis: there are lots of ways in which you can get results that look fine but are actually kind of spurious (this post by Jan Vanhove does a good job of making this case for correlation coefficients).
For all these reasons, it's most convenient to have your data in this kind of raw form and to gather more data than you actually need; it's much better to have it and not need it than find you need it later when it's too late to gather it.
Everything Else #
Beyond the simple stuff I've just listed, there is of course a giant universe of other kinds of analysis, including things like:
- Build natural language models for machine translation
- Matching images to names for facial recognition
- Video surveillance for criminal investigation
- Collecting images of streets and houses for mapping and autonomous vehicles as in Google Street View
I don't expect to be talking too much about privacy-preserving versions of these applications in this series of posts, in part because they use a different set of techniques and in part because I don't know this material as well.
There is, however, one important special case to cover, which is what's often called "heavy hitters". The basic scenario is that each user has some open-ended set of values (strings or sets of bytes or something) and you want to collect the most common ones. There are a lot of applications for this kind of measurement, such as discovering the most common URLs that people are visiting. A key point here is that the values probably aren't known in advance, so you need not just to know what ones are popular but also to learn them.
As with the rest of the measurements in this post, the easiest thing to do with all these measurements is just to have everyone send their values to some central data collector, where they can be processed. This is especially useful in machine learning applications where you might develop better algorithms later and want to re-run them on the old data set.
Who do you trust? #
As I keep repeating, the easiest to do most kinds of data collection is just to gather as much data as you can in raw form and then post process it at your leisure. It's cheap, conceptually simple and easy to execute, and it's very flexible in case you later discover that you want to do a different kind of analysis or that investigage some different question than you originally intended, all of which happen quite often.
The problem, of course, is that then the data collector now has this big pile of potentially sensitive data which they have to protect. From the user's perspective, the situation is even worse: they have to trust you to manage that data in an appropriate way. This might be fine if the data is about trees, but perhaps less acceptable if it's about people's medical history.
With a conventional system, data protection mostly comes down to the data collector having some kind of policy about how they handle the data. Typically this consists of some combination of internal anonymization (stripping user information, etc.) and access controls. A good example of this is the US Census, which collects a pile of potentially confidential information but then promises to protect it:
The problem with policy controls is that they require the subjects of data collection to trust that they are correctly executed, not just now but in the future. For instance, US Census data was used to identify Japanese-Americans for internment during World War II, after repealing existing Census confidentiality protections:
The Census Bureau surveys the population every decade with detailed questionnaires but is barred by law from revealing data that could be linked to specific individuals. The Second War Powers Act of 1942 temporarily repealed that protection to assist in the roundup of Japanese-Americans for imprisonment in internment camps in California and six other states during the war.
Lawmakers restored the confidentiality of census data in 1947.
For these reasons, centralized data collection plus policy controls isn't really an ideal answer. What we really want is technical protections. Fortunately, we finally have the technology to collect sensitive data in a way that (1) lets us do significant amounts of useful analysis and (2) significantly improve user privacy in a way that doesn't just depend on trusting the data collector. In the next post, I'll be covering the simplest such technique: anonymizing proxies.
One of the most common experiences is collecting your data, finding out that you've done something wrong, and then having to collect it again.... and again. ↩︎
Note that you have to be quite careful if you try to ask too many questions out of the same data set. Each time you run a statistical test, there is a certain risk of a false positive result (the technical jargon here is Type 1 error), so if you try out a lot of different things, there's a risk that you're just going to get false positive results (see p-hacking). ↩︎