Educated Guesswork

Privacy Preserving Measurement 4: Heavy Hitters

This is part IV of my series on Privacy Preserving Measurement (see parts I, II, and III). Today we'll be addressing techniques for collecting so-called frequent strings (i.e., "heavy hitters").

Prio and similar technologies mostly operate at the level of sets of numeric values. As we've seen, this can be surprisingly useful, but doesn't work well when you want to collect non-numeric values. For example, suppose you wanted to see what web sites people visited commonly? You might, I suppose, make a list of the top million web sites and have each client report the number of visits. This has two obvious drawbacks:

  1. It is very expensive because most of these values will be zeros but you still need to send them (otherwise the server can tell which sites you visited by the fact that they were sent).

  2. It doesn't let you discover unknown values (i.e., new sites) because they won't be on the list.

The general form of this problem is what's known as collecting "heavy hitters", i.e., frequent strings. Recently, we've seen a fair amount of work on this problem, which I'll sketch a bit here.

Shamir Secret Sharing Designs (STAR) #

The first design I want to talk about is based on Shamir secret sharing. Briefly, this is a system in which you can break a secret S into an arbitrary number of shares such that any N are sufficient to reconstruct it.[1]

If we assume that each client i has a value S_i (e.g., the URL), then the client computes a key K_i as a deterministic function of S_i (e.g., by hashing it) and then sends the central server:

Secret-Share(K_i), Encrypt(K_i, S_i)

If the encryption is also deterministic, then the server can group all of the shares that correspond to the same value. Once it has collected enough shares (whatever the level of secret sharing is) it can then reconstruct K_i and decrypt S_i (which was of course shared by other people). This scheme, originally described by Bittau et al. in their paper on PROCHLO, is efficient and easy to understand and has the nice property that it doesn't require a trusted server. However, it has two problems. First, it's easy to tell when two subjects have the same value, so even if you don't know the value, you can group subjects that have shared values. Second, if the values are not high entropy (i.e., they are easy to guess) then the server can just iterate over possible values and generate its own K values until it finds a matching encryption value.

The first problem--telling which subjects have the same value--can be partly addressed by having values submitted via a proxy. This still reveals the distribution of values but not who has them, as long as you don't have separate identifying information, as I discussed in the post on proxies.

Davidson et al. propose STAR to partly address this second problem by having a separate (non-colluding) server which generates a per-value salt which can then be fed into the hashing process. The way this works is that the server computes what's called an oblivious pseudorandom function, which is a function that allows the randomness server to compute a deterministic function of the input (i.e., S_i) without seeing the actual value. This result is then used as an input to the hashing process along with the S_i. The result is that the data collector can't just exhaustively search all the potential S_i values on its own offline, but has to query the other server for each candidate value. This makes it possible to learn about specific values--even if they are uncommon--but hard to learn about unknown values.

IDPF-based Designs #

The second class of design was introduced by Boneh et al. (including the authors who designed Prio). Conceptually it's analogous to Prio in that the client splits up its value into two shares and sends one to each server along with a proof. The servers collaborate to verify the proof and then can return aggregated values. The actual encoding of the values is more complicated and depends on something called an incremental discrete point function (IDPF) which I won't describe here. (Incidentally, this system is badly in need of a cool name, because "IDPF-based systems" doesn't really roll off the tongue").

The way that this system is used in practice is that you think of each value as just being a sequence of bits, and can then query the system for the number of submissions with a given bit prefix. In other words, you can ask "how many submissions have the first bit 1? How many have the first bits 11? etc." This lets you quickly discard regions of the value space with low numbers of submissions and also discover the prefixes which are common (i.e., heavy hitters).

IDPF-based systems have a number of advantages. First, the server only learns the top values and the query path that got there; by contrast, in secret sharing approaches the servers learn all values over the secret sharing threshold. Second, like Prio it can be combined with demographic values in the clear which can then be later used for crosstabs and the like (with the same privacy caveats as with Prio). The price for this flexibility is rather higher computational cost, though still within practical limits (they can find the top 200 strings out of 400,000 clients with two servers in a bit less than an hour, so this is significantly slower than either STAR or Prio).

Next Up #

Everything I've discussed so far gives exact answers. This is convenient from the perspective of the data collector but can make the privacy properties hard to analyze. In the next post I'll be talking about randomized techniques that give approximate answers (i.e. "differential privacy" of both the local and central varieties).

  1. The way this is done technically is by constructing a polynomial of degree N-1 with the y-intercept being the secret. Any N distinct points are sufficient to reconstruct the polynomial. In this case, the polynomial has to be deterministically computed from the secret as well. ↩︎

Keep Reading