Privacy Preserving Measurement 5: Randomization
Posted by ekr on 05 Nov 2021
This is part V of my series on Privacy Preserving Measurement (see parts I, II, and III, IV). Today we'll be addressing techniques that use randomization to provide privacy.
The aggregate measurement techniques I have described so far provide exact answers (which is good) but require multiple servers in which you have to trust at least one to behave properly (which is less good). What if you want to collect a measurement but your subjects are unwilling to trust you--or anyone else--at all. It's still possible to collect some aggregate measurements using what's called randomized response.
Basic Randomized Response #
Imagine you want to collect the rate at which people engage in some behavior that they don't want people to know about or is illegal, such as using heroin. For obvious reasons, people might not be excited about that kind of admission, no matter what security precautions are used for data collection. Randomized Response offers a solution to this problem without any fancy cryptography.
The basic idea is simple and goes back to 1965. Instead of just answering the question, you generate a random number (e.g., by flipping a coin). Then you respond as follows:
Coin | ||
---|---|---|
Real Answer | Heads | Tails |
Yes | Yes | Yes |
No | Yes | No |
If we assume that the fraction of the population who would have answered "Yes" to the basic question is $X$ then the fraction of people who answer "Yes" will be $(1 + X)/2$. So if the fraction of "Yes" answers is $Y$ then it's easy to estimate $X$ by computing $X = 2Y - 1$. Note that this answer is approximate, not exact, because the coin won't come up Heads exactly half the time. If it comes up Heads more often than half the time, this will cause you to overestimate $X$ and if it comes up less than half this will lead you to underestimate $X$. The rate at which it does is given by the binomial distribution, but the gist is that, as with most sampling techniques, you get increased accuracy with more samples.
Randomized Response provides plausible deniability because a lot of the "Yes" answers are from people whose coin came up "Heads". If only a small fraction of people would have answered Yes, then the vast majority of people who say "Yes" actually just had a coin which came up heads. Note that this doesn't give zero information:
- Anyone who answer "No" really would have answered "No".
- You can estimate the probability that a subject's true answer is "Yes" because approximately $X/(X + 1/2)$ of subjects would have answered "Yes".
In addition, if you take multiple independent measurements of the same value from the same user, randomized response starts to leak information. For instance, consider what happens you ask someone to use randomized response to answer some question every month for a year. The chance of a random coin coming up heads 12 times in a row is $1/4096$, so if you get 12 "Yes" responses in a row, then it is much more likely that the true answer is "Yes".[1]
Background/Digression: Bloom Filters #
A common problem in computer science is representing a list of "known" values so that (1) the stored list is small and (2) it's fast to look up whether a given value is in a list. For instance, suppose that I want a Web browser to filter out malicious domains as in Google Safe Browsing or revoked Web site certificates. The natural solutions (lists, hash tables, etc.) require actually storing the strings in the data structure, but this is wasteful because I don't actually want to retrieve the strings, I just want to check for presence or absence, so why should I have to store all of that stuff? This suggests a simpler solution: instead of storing the value of the string in the hash table, just store a single bit representing whether the string with a given hash is present or not. This data structure is called a Bloom filter.
Bloom filters are obviously quite a bit smaller, but comes with a disadvantage: false positives. Suppose we have two domains--one innocuous and one illicit and blocked--which hash to the same value. In this case, the innocuous domain will also be blocked. In general, the false positive rate of a single hash data structure will be 2-b where b is the number of bits. You can improve the situation somewhat by using multiple hash functions (this is how Bloom filters are typically used); in this case, the string is present in the filter if all of the corresponding bits for each hash are set to 1. However, there will still be false positives and any use of Bloom filters needs to deal with this somehow.[2]
RAPPOR #
Just as with with Prio, basic randomized response is not good for reporting arbitrary strings: each string must be separately encoded, which gets out of hand quickly, and it doesn't work at all for unknown string without consuming impractical amounts of space. Bittau et al. described Randomized Aggregatable Privacy Preserving Ordinal Responses (RAPPOR) which attempts to address this problem, as well as the problem of repeated measurements.
As you have probably guessed from the previous section, RAPPOR uses Bloom filters to store strings. The basic idea is straightforward: you take the string you want to report and insert it into a Bloom filter. and then send the filter to the server. You then randomly flip some bits to add noise. Because you already are randomly adding values false positives from the Bloom filter aren't as big an issue. This lets you send string values in finite space because the Bloom filter is finite size no matter how big the strings are.
Reading the data out of the Bloom filter isn't totally straightforward. There are two problems:
-
A given set of Bloom filters is consistent with more than one set of candidate input strings
-
You have to already know the input strings. This requirement is slightly subtle because you don't need to know the strings in advance but only when you want to query the results. For instance, suppose that you deploy RAPPOR to measure the most popular home pages and then you somehow later learn a new home page you didn't know about via some other mechanism, you can use RAPPOR to find out whether it's popular. But unlike with the techniques described in part IV you can't learn the home page from RAPPOR because the Bloom filter doesn't store the values (remember, that's how you get it small).
Anyway, once you have a set of candidate strings, you can use some fancy statistics (including LASSO) to estimate the most likely values. RAPPOR also tries to address the problem of repeated querying by using two layers of randomization. The first layer is stable, so that multiple reports provide the same answer. The second layer changes with every report so that you can't use the precise reports as a tracking vector. Together they provide privacy for the user's data.
The big problem with RAPPOR is that it's very inefficient. As some of the same authors write in their paper introducing PROCHLO:
Regrettably, there are strict limits to the utility of locally differentially-private analyses. Because each reporting individual performs independent coin flips, any analysis results are perturbed by noise induced by the properties of the binomial distribution. The magnitude of this random Gaussian noise can be very large: even in the theoretical best case, its standard deviation grows in proportion to the square root of the report count, and the noise is in practice higher by an order of magnitude [7, 28–30, 74]. Thus, if a billion individuals’ reports are analyzed, then a common signal from even up to a million reports may be missed.
Unfortunately, this is a generic problem with any technique where users randomize their data before collection: there's a tradeoff between accuracy and privacy and none of the points on the curve are particularly great. For this reason, interest in these techniques has waned in favor of non-randomized techniques like those described in parts II, and III, and IV of this series.
Publishing Aggregate Data #
There is, however, a situation in which randomization is very useful: when used in combination with some exact aggregate data collection mechanism, whether conventional or privacy preserving. As discussed at the very start of this series, what you want out of these systems is usually to measure aggregate data rather than individual, but even aggregate data can reveal information about individuals.
For instance, suppose that we are collecting household income from everyone in a given neighborhood and publishing the number of people in each 10,000/year bracket. This seems like it's fine, but what happens if there's one person with income of 1,000,000/year and everyone else has average income. In that case, the aggregate will immediately give you a fairly close approximation of the rich person's income because it's the only one in the 1,000,000 bracket. As you probably expect, the fix here is to add random noise to the aggregate values before you publish them. The precise details of how to do this are somewhat complicated and depend on the structure of the data, how many different slices you are going to publish, etc. Similar techniques have been proposed to address the multiple querying problem described in part III, though the details are presently somewhat fuzzy.
What about Differential Privacy? #
You may have heard the term differential privacy (DP). Technically speaking, DP is a definition for privacy. The idea here is that you have some database which you let people query and you want to limit the amount of information that the querier can learn about individuals. The idea is sort of a generalization of randomized response: Instead of providing an exact answer, you provide a randomized answer structured so that that so that the distribution of responses is similar regardless of whether a given individual's information is included in the database or not.
Formally, this is defined by Cynthia Dwork as:
Definition 2. A randomized function $\mathcal{K}$ gives $\epsilon$-differential privacy if for all data sets $D_1$ and $D_2$ differing on at most one element, and all $S \subseteq Range(\mathcal{K})$,
$Pr[\mathcal{K}(D_1) \in S] ≤ exp(\epsilon) × Pr[\mathcal{K}(D_2) \in S]$
What this math translates to is that any query produces an answer that is based on the database but also randomized over a given distribution. The chance of any given answer is roughly (to within a factor of $\epsilon$ the same whether a given person's information is in the database or not. This is the same intuition as randomized response: the aggregate result is broadly accurate, but any individual response doesn't have much impact on the result.
In order to implement DP in practice you choose a privacy value $\epsilon$ and then tune the amount of randomness you add in order to provide that value. Each query consumes a certain amount of the budget available and at some point you refuse to allow any more queries (the simple case here is when you publish the database once, in which case you just model this as one query). Actually determining how much randomness to add and how is non-trivial, but the original theory comes from a paper by Dwork, McSherry, Nissim, and Smith.
The terminology is a bit confusing here because people often talk about "implementing" or "using" differential privacy to mean that they are adding randomness in order to provide $\epsilon$-differential privacy. Moreover, there are two kinds of differential privacy depending on where the noise is added:
-
Local differential privacy (LDP), where the noise is added at the endpoints, so that even the data collector doesn't learn much about the user's information. Randomized response techniques such as we have been discussing throughout this post provide LDP.
-
Central differential privacy (CDP) where the collector gets accurate information but then adds randomness before disclosing it to people, as discussed in the previous section.
One important real-world example of central differential privacy is the US census, which is adding randomness publishing in order to provide differential privacy. Unsurprisingly, this has resulted in complaints that the accuracy of the data will be degraded in important ways. I haven't learned enough about this to have an informed opinion on whether DP will an important impact on census results, but it will obviously have some impact and in general, any use of DP has some sort of tradeoff between accuracy and privacy.
Importantly, it's not enough to say that a system is differentially private: you need to specify the $\epsilon$ value, which embodies that privacy/accuracy tradeoff by dicatating how much randomness to add. Unfortunately, especially for LDP systems, it's hard to find a set of parameters which allow for good data collection and also have good privacy properties. For instance, Apple implemented an LDP system for collecting user telemetry but concerns have been raised about the level of actual leakage in practice. The authors of RAPPOR report similar problems, where their choose of $\epsilon$ lead to relatively low measurement power, and it's not really clear that it's possible to build a general purpose LDP system that has both good privacy and acceptable accuracy. CDP systems also have this problem to some extent but less so because you only need to add enough total randomness to protect users, rather than enough randomness to each submission. With that said, selecting the right $\epsilon$ value is still an open problem.
Summary #
The bottom line here is that while randomization is an important technique, local randomization is pretty hard to use except for a fairly narrow category of questions because the level of randomization required in order to provide privacy has such a large negative impact on accuracy. By contrast, central/global randomization techniques seem much more promising as a way to safely query data which has been gathered with exact techniques.
This is a general statistical phenomenon: if you have some response variable that is a function of some independent variables plus some random effects, then the more measurements you take the more the random effects will tend to wash out, leaving the predictable effects. This is why it is helpful to have a large data set. ↩︎
As a concrete example, Mozilla is working on a technique for certificate revocation for Firefox called CRLite in which you use multiple "cascading" Bloom filters with the second Bloom filter allowlisting certificates which are spuriously blocked in the first filter, the third blocklisting those spuriously allowed in the second, etc. Google Safe Browsing is effectively a Bloom filter with only one hash function. ↩︎