Educated Guesswork

New EV Habits for ICE Vehicle Owners

2024-06-03T00:00:00Z

Generated by Midjourney. Prompt "Man waiting for EV to charge, bored expression, EV charging station, photorealistic --ar 4:3"

I spent some time reading this HN thread in response to Wired's article on how many EV charging stations we need and I'm dumber than when I started (isn't that usually the way it is on the orange site?). On one side, we have the Internal Combustion Engine (ICE) forever crowd enders worried about the tragedy of wasting 30 minutes charging on their 500 mile road trip and on the other side we have EV lovers acting as if there's really no tradeoff.

I have two EVs so there's no doubt about what side of the argument I'm on, but I'm also not going to tell you that it's not inconvenient at times. The truth is that EVs really are a lot more convenient for most people for day to day driving but less convenient for long road trips, especially if you treat them the way you would an ICE vehicle rather than adapting yourself to their idiosyncrasies.^[1]

Background Facts #

The two basic vehicle parameters that dominate any discussion of EVs versus ICE vehicles are:

range: how long you can drive without refueling
refueling speed how long it takes to refuel

Range #

When EVs were first introduced, range was fairly bad,^[2] but things have gotten a lot better. Edmunds lists the Toyota RAV4 as the most popular non-truck ICE vehicle,^[3] and the Tesla Model Y as the top EV. Both of these are compactish SUVs, so pretty comparable. The RAV4 Hybrid gets about 38 mpg highway with a 14.5 gallon tank, so has a range of about 550 miles (this is a little hard to estimate because it's a hybrid). The most popular EV, the Tesla Model Y, has a listed range of 320 miles.

This is a real physics problem for EVs because the energy density of batteries is much worse than for gasoline cars: the RAV4's gas weighs about 120 lbs; the Tesla's battery weighs 1700lbs. What this means in practice is that adding range to an EV involves tradeoffs in terms of cost and weight but it's trivial to add range to an ICE vehicle just by making the tank a bit bigger. If Toyota has chosen 14.5 gallons, that's because they don't think you need more.

Power Units versus Energy Units #

The terminology around EV units can be a bit confusing. A battery stores a certain amount of energy, which is conventionally measured in kilowatt hours (kWh), which is to say the amount of energy you would put into the battery if you added it at the rate of one kilowatt (kW) for an hour. What's a kilowatt, then? It's 1000 watts, where a watt is the power needed to transfer one joule (the SI unit of energy) per second.^[4] In other words, a kilowatt hour is 3.6 million joules (3.6 megajoules (MJ)). Electricity tends to get sold in units of kWh, which is probably why batteries are rated this way rather than in MJ.^[5]

EV efficiency varies dramatically, but a reasonable estimate is around 3-4 mi/kWh (5-7 km/kWh). Battery size also varies quite dramatically, but the median is around 75kWh. Multiplying these two values you get a range of 225-300 mi, which is about what you should expect from the above.

Refueling Speed #

ICE vehicles charge faster than EVs. Period. The HN thread had some crazy fast estimates, but gasoline pumps do about 50l/13 gallons per minute, so we're looking at on the order of 5 minutes to fill up your tank. No deployed EV battery charges even remotely this fast.

At a high level, there are three main types of charger in the US:

AC level 1:: Plugs into an ordinary 110V socket. About 1-2kW.
AC level 2:: Requires a dedicated circuit but installable in your home. Typically around 7kW.
DC Fast Charging ("Level 3"):: Commercial charging stations. Typically between 30 and 350 kW. My experience is that there is a lot of variation in actual charging speed for fast chargers, both in terms of rated power and in terms of actual power delivery. In addition, not all cars will charge at the maximum speed of the charger, with newer cars doing better.

Of course, what really matters isn't the rate of power delivery but rather the rate of range added. If we assume 3.5 mi/kWh, we get something like:

Charger type	Charging power	miles added/hr	Time to add 250 miles of range
L1	1.5	5.25	47 hrs
L2	7	24.5	10 hrs
L3 (normal)	50	175	85 minutes
L3 (fast)	150	525	29 minutes
Tesla Supercharger (rated)	250	875	17 minutes
L3 (ultrafast)	350	1225	12 minutes

As I mentioned above, real world experience varies. As a reference point, I have a BMW i3 and a Kia EV6. The BMW will nominally accept up to 49kW, but I don't think I've ever seen above 40. The Kia will nominally charge at up to 233 kW, but I think the highest I have ever seen is around 180 kW. It's also important to know that charging slows down quite a bit once the battery hits 80%, so as a practical matter it takes a lot longer to get to the full nominal range of the car than it does to get to 80% range. Again, this isn't an issue with gas cars where filling rate is comparatively constant.

Day to Day Driving #

The day-to-day driving experience for an EV is totally different from an ICE vehicle. With an ICE vehicle, you just drive around until you are low on gas and then visit the filling station. If you have an EV and a home charger—which you really want to have^[6]—you basically never have to use a public charger on a daily basis, even if all you have at home is an L1 charger. all you do is plug your car in when you get home, which quickly becomes a habit. If you have a home L2 charger, you don't even need to do it every day.

The average US commute distance is 42 miles, which represents about 8 hrs on an L1 charger. This means that if you just drive to work and back and you're at home for 12 hrs a day, you'll always have a full battery when you leave in the morning, with about 4 hrs to spare. As long as you don't drive more than 60ish miles, you'll still have a full battery every morning. This means that you almost never have situations where you get up, are late for something, and realize you need to stop and get gas, as happens with ICE vehicles.

Obviously people don't just commute and if you take a longer drive then you'll use up more of your battery. However, on a day to day basis, most people don't drive more than the range of their car. If you drive more than overnight charge's worth in one day, then you'll just have a slightly less than full battery, but the net amount of drain is just however many miles you drove minus the amount you can charge overnight, so unless you have a lot of days with long trips, your battery never gets too low, and when you return to a normal pattern, it will refill again, unless you routinely drive as many miles as your charger can support.

For example, consider someone who has an EV with a range of 200 miles and drives 40 miles a day regularly, then has a few days where they need to drive 80. Here's what their battery state looks like after the overnight charge:

Day	Morning Range	Miles Driven	Evening Range
1	200	80	120
2	180	80	100
3	160	80	80
4	140	40	100
5	160	40	120
6	180	40	140
7	200	40	160

The bigger your battery, the longer you can sustain periods when you're consuming more than you're charging (this is of course also the situation when you're driving the car). Consider a vehicle with a 100 mile battery driven the same way:

Day	Morning Range	Miles Driven	Evening Range
1	120	80	20
2	80	80	0
3	60	80	-20 (oops)

On day three, instead of being down to less than half the battery, you're actually at negative battery instead! The only difference here is that you don't have as big a buffer, so that when you consume more than you charge you run out. The bigger the battery, the more buffer you have and therefore the less of a big deal it is if you do a long drive one day. This buffer is built up in the days before your long drive when you're charging more than the drain; with a small battery, the car is just sitting fully charged whereas with a big battery it would still be charging.

Of course, all of this is just with an L1 charger. If you have an L2 charger at home, then 12 hrs of charge is around 300 miles of range and so you'll nearly always have a full battery in the morning and will essentially never have to visit a public charger. You really just have to worry about situations where you do enough driving in one day to completely deplete your battery. This brings us to the topic of road trips.

Road Trips #

It's clearly more convenient to not have to worry about refueling on a day-to-day basis, once you want to drive more than the range of your vehicle in one day, the situation gets quite a bit worse. In an ICE vehicle you can just generally drive from point A to point B and when you get low on gas, pull out your phone and look for a gas station. This is not a good plan for an EV for several reasons.

First, there are a lot fewer EV chargers than there are gas stations. As of Jan 2024, California had less than 2000 DC fast charging stations (there are around 7000 total in the US). By comparison there are over 13000 gas stations in California. Moreover, because of charging network incompatibility, you can't use every charger (though Tesla is supposed to be opening up its network to non-Tesla cars, which will improve the situation for non-Tesla owners, as Tesla operates the biggest network).^[7] The result of this is that when you get down to (say) 30 miles of range, you may not be able to find a conveniently located fast charger. And because the range of EVs is somewhat lower you will need to find a charger more often.

Second, as should be clear from above, EV charging is significantly slower than filling your gas tank even in the best case scenario where you have a fast DC charger (say 10-20 minutes). If you can only find a normal L3, you're looking at closer to an hour. I wouldn't generally even bother with stopping at a L2 charger, though they can be useful for charging overnight at a hotel or something. You can, of course, sit in your car at the charger for 30-60 minutes but it's not an ideal experience. Worse yet, it's not uncommon for the chargers to be full and/or one of the ports to be broken, in which case you also need to wait for someone else to finish.

Basic Strategy #

My recommendation instead is to lean into the way an EV behaves rather than trying to treat it like an ICE vehicle. What this mostly means is to plan your trip around actually stopping to charge.

As a real example, consider a trip from Palo Alto to Los Angeles in my Kia EV 6 GT (range: 210 miles). The total trip is 360 miles, so I should be able to do it with one charging stop, as long as it's located more or less halfway through. There's really only one choice here, which is Kettleman City, located 184 miles from Palo Alto and 178 from Los Angeles, where there is a 10 port Electrify America charging station. The station itself is located at Chalios Mexican Restaurant, but it's in a complex with a pile of other fast food restaurants (In-n-Out, Baja Fresh, McDonalds, etc.). To be honest, this is actually on the good side in terms of location options; lots of Electrify America stations are in Walmart parking lots. Anyway, what you want to do here is plan to get there around lunchtime, plug your car in, and then go grab some food while it charges. If there's a spare port when you arrive, it's actually reasonably likely that charging will be done before you finish eating (be nice, move your car), but even if not, you can just chill in In-n-Out for a bit.

This is basically the only good option if you have an EV with a 200-odd mile range and you want to make one stop: the next closest choices are Coalinga (203 miles from LA) and (214 miles from Palo Alto). you might make it with one of these, but you're cutting it a lot closer than I like. By contrast, if you have an EV with a 300 mile range, you could pick either of these, or even make it down to Bakersfield before finding a charging station. Of course, if you had a 400 mile range (e.g., Tesla Model 3 or S long range, Rivian R1, etc.) then you can actually do the whole trip in one shot, though you'd need to charge when you got there.

Trip Planning #

For the best result, an EV trip requires a lot more planning than with an ICE vehicle. I've certainly done trips where I just drove for a while and then searched for a charger, but this definitely has a higher risk of charging in a Walmart parking lot. You're going to be happier if you do some advance research. There are a number of trip planning tools available to you (Tesla, PlugShare, A Better Route Planner). There's no magic here, just put in your source and destination and play around a bit. Some of the tools will actually recommend specific stops and some you have to do it manually, but in either case you end up with an itinerary telling you where to stop.

Less good cases #

The Palo Alto to Los Angeles trip is basically the best case scenario: California has a lot of EV chargers and you can take Interstate 5 pretty much the whole way, so you're never that far from something. Even so, I tried a few experimental but realistic trips (Palo Alto to Yosemite, Denver to Silverton, Los Angeles to Beaver UT^[8]), and was usually able to find some kind of route. There's even an Electrify America charger at the Days Inn in Beaver, so you're not stuck at 10% when you arrive. With that said, you could easily spend a lot of time in gas station and Walmart parking lots.

Probably the worst case is when you are headed to somewhere remote and there may not be a charger, so you need to plan for a round trip. For instance, Lone Pine California doesn't really have anything in the way of non-Tesla chargers, and there are only two stations on the way:

A Chargepoint L2 that might have one L3 port in Beatty
A pair of non-networked L2 plugs in Stovepipe Wells

Honestly, this would all leave me feeling pretty antsy and I'm not sure I'd want to do that trip in an EV that didn't have a really long range. You don't want to be stuck out in the middle of Death Valley with a dead battery.

Summing Up and the Future #

The bottom line is that neither an EV nor an ICE vehicle is overall better in terms of convenience. For day-to-day driving, just having a car which basically never needs to be fueled is clearly a win, so as long as you have a charger at home, it's hard to go wrong with an EV. You just have to remember to charge it every night.

When it comes to road trips, an ICE vehicle is more convenient, but you can close a lot of the gap with some good planning in terms of when you stop and charge. If you try to drive an EV the way you would an ICE vehicle by just driving until you are low on charge and then looking for a charger, you're going to have a much worse experience.

The good news is that the EV charging situation is getting rapidly better on all three fronts: (1) Batteries are getting bigger so you need to charge less frequently; (2) charging is getting faster so it's less of a hassle; and (3) more stations are being built so you have more options in terms of where to charge. As of today I'd feel comfortable doing most road trips on the West Coast in an EV, but there are still a few for which I'd want to rent something else, which seems like a reasonable tradeoff for the other ways in which an EV is better. If you buy an EV in five years or so, I expect there will be very few trips you won't be able to do in it.

I'm talking here just about charging, but obviously there are a lot of ways in which EVs are just plain better, starting with dramatically better driving performance. I'm not here to sell you that, though. ↩︎
For example, the original BMW i3 had a range of less than 100 miles in 2014. ↩︎
The top 4 vehicles are all trucks. ↩︎
As a reference point, a reasonably fit person can put out around 300 W on a bike for an extended period of time. ↩︎
Note that this isn't some scenario where we're using goofy non-metric units. kWh are still defined in a sensible way from the base units, they're just not the SI official way of doing things. Calories (the amount of energy to heat a gram of water by 1^oC) are in a similar position of being a metric but not SI unit that is widely used in specific contexts. ↩︎
Exception: people who can charge at work ↩︎
For non-Tesla owners, this mostly means you want Electrify America, which operates a lot of 150 kW and 350 kW DC chargers. The bad news is that it's not at all uncommon for EV chargers to be broken. ↩︎
Ultrarunners may be sensing a theme here ↩︎

Notes on Post-Quantum Cryptography for TLS 1.2

2024-05-24T00:00:00Z

As mentioned in previous posts, the IETF has decided not to add support for post-quantum (PQ) encryption algorithms to TLS 1.2. In fact, the TLS WG is taking a rather stronger position, namely that it's going to stop enhancing TLS 1.2 more or less entirely, including support for PQ algorithms:

While the industry is waiting for NIST to finish standardization, the IETF has several efforts underway. A working group was formed in early 2013 to work on use of PQC in IETF protocols, [PQUIPWG]. Several other working groups, including TLS [TLSWG], are working on drafts to support hybrid algorithms and identifiers, for use during a transition from classic to a post-quantum world.

For TLS it is important to note that the focus of these efforts is TLS 1.3 or later. TLS 1.2 is WILL NOT be supported (see Section 5).

As I wrote previously, to some extent this is a political position:

One challenge with the story I told above is that PQ support is only available in TLS 1.3, not TLS 1.2. This means that anyone who wants to add PQ support will also have to upgrade to TLS 1.3. On the one hand, people will obviously have to upgrade anyway to add the PQ algorithms, so what's the big deal. On the other hand, upgrading more stuff is always harder than upgrading less. After all, the TLS working group could define new PQ cipher suites for TLS 1.2, and it's an emergency so why not just let use people use TLS 1.2 with PQ rather than trying to force people to move to TLS 1.3. On the gripping hand, TLS 1.3 is very nearly a drop-in replacement for TLS 1.2. There is one TLS 1.2 use case that it TLS 1.3 didn't cover (by design), namely the ability to passively decrypt connections if you have the server's private key (sometimes called "visibility"), which is used for server side monitoring in some networks. However, this technique won't work with PQ key establishment either, so it's not a regression if you convert to TLS 1.3.

In this post, I want to look at what it would actually take to add PQ support to TLS 1.2 and why we probably shouldn't do it (as well as "revise and extend" that last point). This requires going into some more detail about the cryptographic primitives we are working with here as well as the history of TLS key establishment.

Static RSA #

SSLv3 (and later TLS) originally supported two main key establishment modes:

Static RSA
Diffie-Hellman

For a long time by far the most common mode was static RSA, shown below:

TLS 1.2 static RSA mode

The way that this mode worked was that the server's certificate contained a public key for the RSA algorithm. The client then generated a random value (the premaster secret (PMS)) which it encrypted under the RSA public key. The server used its private key to decrypt the PMS, at which point both client and server knew it. They would each derive traffic keys from the PMS (as well as some other components of the handshake) which could be used to protect the traffic. Because the attacker doesn't have the private key it is unable to recover the PMS and therefore will not be able to communicate with the client.

This design has the property that if you know the RSA private key you can decrypt any connection protected with it. This means that an attacker who is able to obtain the private key, for instance by compromising the server, will be able to decrypt any connection that they have recorded, including connections months or years in the past (note that this is the same kind of attack we are worried about with a CRQC, except that a CRQC could recover the key from the handshake without compromising the server).

Ephemeral Diffie-Hellman #

SSLv3 also included a mode based on Diffie-Hellman key exchange:

TLS 1.2 ephemeral DH mode

In this mode, the server generates a Diffie-Hellman key share (public/private key pair) and sends it to the client. In order to authenticate the share, it signs the share using the RSA key, thus proving that the server controls the private key. An attacker who doesn't have the private key will not be able to sign the key share and therefore cannot impersonate the server.

As long as the client and server generate a fresh key share for each connection—which isn't strictly required by the specification, but is common practice—and then delete the private part of the key share after use, then even if an attacker subsequently compromises the server's private signing key, it still won't be able to decrypt connections that happened in the past. This property is called forward secrecy (sometimes "perfect forward secrecy").

In the early days of SSL/TLS deployment, there was a lot of concern about the performance cost of the cryptography and ephemeral DH mode is much more expensive than RSA key exchange (DH itself is expensive and you also have to do the RSA signature), so most servers did static RSA in order to save CPU. Over time, however, a number of factors combined to make forward secret key establishment more attractive:

New Diffie-Hellman variants based on elliptic curves were developed. Elliptic Curve Diffie Hellman Ephemeral (ECDHE) algorithms were much faster than the older finite-field based algorithms and so the marginal cost of doing ECDH was much less important.
Servers got faster so that the cryptography wasn't as big a deal overall.
There was increasing concern about the practical security of non-forward secret algorithms, in part due to the Snowden revelations.

Because of the design of TLS, it was possible to incrementally deploy ECDHE. As described in a previous post, TLS negotiates the key establishment algorithm and many clients already supported ECDHE key establishment, so as soon as the server turned on ECHDE, it would automatically be able to use it with compatible clients. Moreover, RSA has the interesting property that you can use the same key pair for both encryption/decryption and digital signature, so the server could use its existing RSA certificate to authenticate to the client; all it had to do is enable ECDHE.^[1]

Starting in 2013, TLS deployments increasingly used ECDHE for key establishment, as shown in the graph below.

TLS 1.2 key exchange modes over time. From Kotzias et al, 2018.

Because ECDHE is so much faster, it didn't make much of a difference in terms of cost to the server to do so; in fact if you also enabled EC-based signatures using ECDSA, the total cost to the server was actually less than using RSA, though as a practical matter many servers still use RSA certificates (which should also suggest to you that the performance issues are less of a factor now then when SSLv3 was first designed).

TLS 1.3 #

When TLS 1.3 was designed starting in 2013, we had a number of objectives:

Clean up: Remove unused or unsafe features
Improve privacy: Encrypt more of the handshake
Improve latency: Target: 1-RTT handshake for naıve clients; 0-RTT handshake for repeat connections
Continuity: Maintain existing important use cases
Security Assurance: Have analysis to support our work (added slightly later)

In order to address objectives (2) and (3) TLS 1.3 adopted a new handshake skeleton which reverses the order of the DH key shares, as shown below:

TLS 1.3 handshake overview

In TLS 1.3, the client supplies its key share in its first message (the ClientHello) and the server responds with its key share in its first message (ServerHello). As a result, the server is able to start encrypting messages to the client immediately upon receiving the ClientHello, starting with its own certificate (thus concealing the certificate from passive attackers on the wire).^[2] The client can start encrypting as soon as it gets the server's first flight of messages, so after one round trip, which is an improvement over TLS 1.2 in some situations.

This handshake flow is inconsistent with static RSA. Because its the client sends its key share in its first message, it needs to be able to generate it without knowing the server's public key (or key share). This works fine with Diffie-Hellman (and elliptic curve Diffie-Hellman) because the key shares are generated independently of each other, but not with RSA because in RSA the sender has to use the recipient's public key to encrypt. Moreover, because the public key is in the certificate, a static RSA-based handshake makes encrypting the certificate much more difficult, as you need the certificate in order to learn the public key and hence to establish the encryption key.

Finally, static RSA is also quite difficult to implement correctly There have also been a series of adaptive attacks on the RSA implementations in TLS stacks. The general idea is that the attacker probes the server over and over by initiating handshakes and then observing the server's behavior. It can use this technique to gradually learn secret information from the server. For instance the attacker might take the encrypted PMS from some other handshake and send variants of the message until it has recovered the PMS itself.^[3] These attacks take advantage both of implementation issues with RSA and of the fact that server uses the same RSA key over and over, which gives the attacker multiple opportunities to learn small bits of inforation that add up over time (another reason why it's attractive to use a fresh key for each handshake).

Aside: Forward Secrecy and Session Resumption #

You actually need more than just forward secret key establishment to make a forward secret protocol. TLS incorporates a feature called "session resumption" in which a key established in connection 1 can be reused in connection 2, thus saving some of the cost of the key establishment (and authentication). In TLS 1.2, that key is sufficient to decrypt connection 1, so if you implement resumption you don't have forward secrecy as long as the resumption key sticks around, but in TLS 1.3 they keys are generated in such a fashion that the key to connection 2 does not let you decrypt connection 1.

But wait, there's more: some stacks implement session resumption by encrypting the resumption key with a fixed secret and sending that value to the client as a "ticket", thus removing the need for a database. Obviously, as long as that key is around, you also have a forward secrecy issue: if the attacker compromises that key then it can decrypt any tickets it has observed and learn the keys. The impact on this depends on what TLS 1.3 modes you are using. TLS 1.3 has a resumption + DHE handshake mode that provides forward secrecy for resumption while still allowing you to omit the authentication. In addition, TLS 1.3 also includes a "zero-RTT" mode in which the resumption key is used to encrypt the first packet from the client; this doesn't benefit from the resumption + DHE handshake mode because it happens before the DH key establishment.

PQ TLS #

As mentioned previously, PQ is being added to TLS 1.3 by acting as if each PQ algorithm corresponds to a new elliptic curve (group). However, in reality our PQ key establishment algorithms are much more like RSA than Diffie-Hellman. Specifically, they're Key Encapsulation Mechanisms, as shown in the figure below.

KEM overview

As with RSA, in a KEM Bob starts by generating a public/private key pair, (K_pub, K_priv). He sends K_pub to Alice, who then uses a function called Encap and some randomness to produce two values:

A shared random secret value
An associated ciphertext value

She keeps secret and sends ciphertext to Bob, who can then use the Decap function and K_priv to compute secret; at this point Alice and Bob both know it.

Just like with DH, a KEM ends up with both Alice and Bob knowing the secret, but in DH Alice can generate her key share independently of Bob as long as she knows which curve (group) he supports. I.e., it doesn't matter who speaks first and—in protocols which support it—Alice's key share and Bob's could actually cross paths. By contrast with a KEM Alice needs to know Bob's public key first, which means that Alice can't send the ciphertext until she has received the first message from Bob. This is fine with TLS 1.2, but in TLS 1.3 it's a problem because the client speaks first, so we can't use the server's public key.

In order to use a KEM with TLS 1.3, we need to reverse the direction of the KEM, as shown below. The new elements are shown in red and I've omitted the DH elements of the hybrid mode for simplicity.

TLS 1.3 with a KEM

The client generates a public/private key pair and sends the public key to the server in the ClientHello
The server sends the ciphertext to the server in the ServerHello

Reversing RSA #

It's actually possible to deploy RSA this way as well by having the client generate a public key and provide it to the server. Because RSA encryption is very fast and decryption is much slower, this allows you to offload work from the server to the client, which is an advantage in Web scenarios because the clients have to establish far fewer connections. EC crypto has gotten fast enough that we didn't specify this mode for TLS 1.3, but Bittau et al used this trick in tcpcrypt.

This allows you to establish a shared secret in a single round trip. As with DH, the server authenticates to the client by signing the connection transcript, which includes the ciphertext value, thus binding the ciphertext to the server's key. Like DH key establishment, TLS 1.3 key establishment also offers forward secrecy it the client generates a fresh key pair for each connection (because the server's contribution depends on the client's key pair, the server automatically generates a fresh value).

TLS 1.2 #

This brings us to the topic of PQ for TLS 1.2.

If we wanted to add PQ support for TLS 1.2, we would presumably do more or less the same thing as with TLS 1.3, namely pretend that the PQ KEM is a elliptic curve group. Just as with TLS 1.2 DHE mode, this is in the reverse direction from TLS 1.3, with server providing the first chunk of keying material (its public key) and the client generating the ciphertext and sending it to the server.

TLS 1.2 with a KEM

It's actually not clear that this would be safe as-is. The reason is that TLS 1.3 binds the entire handshake transcript to the resulting key by feeding the transcript into the key schedule along with the initial cryptographic shared secret. By contrast, TLS 1.2 only feeds in the random nonces in the ClientHello and ServerHello. The result is that in some circumstances an attacker can arrange that two connections (e.g., one from the client to the attacker and one from the attacker to another server) have the same cryptographic key. This property lead to the Triple Handshake Attack) by Bhargavan, Delignat-Lavaud, Fournet, Pironti, and Strub, which was one of the motivations for the more conservative design of TLS 1.3.

As Deirdre Connolly describes in detail,^[4] KEMs have different properties than ECDHE (in some ways closer to RSA) and so we'd need to analyze precisely how to integrate them with TLS 1.2. I'm not saying it can't be done, but it's not necessarily just a simple matter of crossing out "X25519" in the specs and writing in "ML-KEM". Adapting TLS 1.3 to ML-KEM also requires some thinking but that thinking is already happening and is somewhat easier because of TLS 1.3's more conservative design. Obviously the TLS WG could do that work, but the question is whether it's worth doing, given that we are trying to transition everyone to TLS 1.3.

Why you might want to do PQ for TLS 1.2 anyway #

The basic argument for why you would want to do PQ for TLS 1.2 is that some people might find it difficult to upgrade their deployments TLS 1.3 and much easier to upgrade their TLS 1.2 deployments to do PQ. I'm generally fairly skeptical of these arguments, but I want to walk through them anyway.

Sporadically Maintained Deployments #

The broad argument is that there are a lot of environments that aren't that actively maintained and so upgrading is difficult in general and are kind of stuck on TLS 1.2. For instance, they might be using a TLS library which is updated only for security issues either because the library vendor updates it infrequently or because the library consumer is stuck on an old version.

Consider the (hypothetical) case of a TLS library which has current version 2.0 but also has a version 1.1 which is on long-term support. Version 2.0 supports TLS 1.3 but version 1.1LTS only supports TLS 1.2. A deployment which is on 1.1LTS might hope that the vendor would add PQ support to 1.1LTS even though they weren't going to upgrade it to support TLS 1.3, and that upgrading to 1.1.1LTS would be less disruptive than upgrading to version 2.0.

This doesn't apply to the Web which is generally quite up to date—and which is in the process of transitioning to TLS 1.3—but there are of course lots of environments which are much slower to upgrade and arguably might have more trouble upgrading (Peter Gutmann is one of the main advocates of this view.) I do have some sympathy for this perspective, but at the end of the day one of the costs of using software is you have to upgrade it—if only to fix the inevitable vulnerabilities—and I don't think it's unreasonable to expect people to upgrade in order to get a major change like PQ support rather than expecting the rest of the world to do a lot of work to make it slightly easier for them.

I want to emphasize that this is (almost) exclusively a software issue; as I said above TLS 1.3 is intended as a drop-in replacement for TLS 1.2, meaning that in most cases you should just be able to update your TLS stack, and get TLS 1.3 as soon as the other side updates.

Passive Decryption #

As I said, TLS 1.3 is intended to be a drop in replacement for TLS 1.2. There is, however, one notable and high profile exception, what's called TLS visibility. The problem statement goes something like this. Imagine you operate an encrypted Web server of some kind and you want to monitor traffic between users and your server. There are a number of reasons you might want to do this, such as:

Debugging problems with your server.
Looking for malicious activity (attacks by clients connecting to the server).
Measuring the performance of the server on live traffic.

It's possible to do all of these things by instrumenting the server, but not all servers have great instrumentation and what if the server is the source of the problem? Another approach is to capture the traffic as it goes over the network (e.g., via port mirroring) and then decrypt it using the RSA private key. This can be done entirely passively (i.e., without interfering with the connection) and you can decrypt either in real time or by recording the traffic and then decrypting only the connections of interest. This has the advantage that you don't need to touch the server beyond getting a copy of the private key and you get to dig as deep as you want into whats going on without trusting the server.

However, these techniques don't work if you are using ephemeral Diffie-Hellman (whether of the ordinary or EC variety): knowing the server's private key allows you to impersonate the server but not to decrypt the traffic. Decrypting the traffic requires the DH private key share, which is usually generated internally by the server rather than stored on the disk the way that the long term private key is. Moreover, if the server uses a fresh DH share for every handshake—which is required for forward secrecy—then allowing decryption would require somehow sending the decryption device a copy of every key, which is obviously a lot more difficult than just a copy of a single key.

Although DH establishment became more common, even with TLS 1.2 (see above), that didn't interfere with the use of passive decryption because servers weren't required to enable it. The TLS key establishment mode as long as there is a significant population of servers which only do static RSA, clients had to support static RSA, which meant that servers could insist on it, thus making allowing this kind of passive decryption to work fine with TLS 1.2. Of course, those servers wouldn't be following best security practice in terms of protecting user traffic, but it was still technically possible.

By contrast, because TLS 1.3 doesn't support static RSA at all, it's incompatible with naive passive inspection. Of course servers could just refuse to negotiate TLS 1.3, but staying on TLS 1.2 forever isn't really an answer, especially now that the IETF has decided not to add new features to TLS 1.2. When TLS 1.3 was being finalized, a number of organizations—especially high sensitivity sites like banks or health insurance companies—raised concerns about losing this tool, but at the end of the day the TLS working group felt that forward secrecy was an important security feature and that re-adding static RSA would have been way too disruptive to the resulting protocol.

It is possible to adapt TLS 1.3 to enable passive decryption even with Diffie-Hellamn. There are at least three obvious approaches here:

Have the server re-use the same Diffie-Hellman key share for multiple connections. The server can then save a copy of the key somewhere (e.g., on disk) and the administrator can send a copy to the monitoring device.
Have the server send copies of the per-connection keys (hopefully in some secure fraction) to the monitoring device, which can use them to decrypt the connections.
Have the server deterministically generate the per-connection DH key shares based on a static secret and information in the connection. You then provision the monitoring device with the static secret and it can compute the DH key shares for itself.

None of these are particularly difficult to implement but they also require modifying the TLS stack in a way that isn't required to provide service to the client but only to provide the ability to passively decrypt. Moreover, options (2) and (3) also require specifying exactly how the keys will be transmitted (2) or computed (3), both of which have the potential to create severe vulnerabilities (up to perhaps complete compromise of every connection) if they are done incorrectly.

It's really important to understand at this point that what makes passive inspection work in the first place is basically just due to an idiosyncracy of the way that static RSA mode works. Specifically, you need to configure the server with the private key and the private key is also what you need to decrypt the traffic passively. This means that the administrator usually already has the credential in hand and can easily transfer it to the monitoring device without any special affordance by the server or TLS stack implementor. What we've seen over the past 8 or so years is that the implementors are much less enthusiastic about building special features to enable passive decryption. So, for instance, BoringSSL and OpenSSL don't seem to implement any of them. However, NIST has been running an initiative around this, specifying techniques (1) and (2), and it seems like some big vendors (e.g., F5), are participating. I don't know if they are actually planning to ship anything.

PQ and Passive Decryption #

This brings us to the topic of passive decryption for PQ. The obvious way to use PQ—just swapping it for DH—is not really compatible with this kind of passive decryption.^[5]

This is most obvious with TLS 1.3: because the server generates its ciphertext based on the client's public key, there's simply no server private key to provide to the decryption device, because a fresh key is generated for each connection. Unlike with DH, it's not even possible to generate a single static key pair and reuse it (technique (1) above) because the Encap() operation depends on the client's public key.

The situation is slightly more complicated with TLS 1.2 because the server rather than the client generates the private key. In principle, the server could just generate a single ML-KEM key and use it indefinitely (similar to approach (1) above), but, as with approach (1), there is no real reason for the server to do this other than to enable visibility. Specifically:

Re-using the same ML-KEM key breaks forward secrecy, so it's less secure.^[6]
It's actually more programming work to remember the ML-KEM key between transactions rather than just generate a new one, especially in a multi-threaded system.
You need to build some mechanism to allow either export or import of the ML-KEM key, which you wouldn't otherwise need.

Moreover, because the most common way to deploy PQ algorithms is as a hybrid, you'd need to also do something for DH, which TLS 1.2 deployments that are set to allow passive decryption don't usually currently do now, because they just do static RSA instead. The bottom line, then, is that it's not really significantly easier to support passive decryption ("visibility") for TLS 1.2 with PQ than it is for TLS 1.3, so that's not really a very good argument for porting PQ into TLS 1.2.

The Bigger Picture #

Designing and maintaining cryptographic protocols is a lot of work, and TLS 1.3 and TLS 1.2 are different enough that it's obviously desirable only to maintain one of them even if TLS 1.3 were no better than TLS 1.2. As should be clear at this point, it's technically possible to add support for PQ to TLS 1.2, but it's not trivial. That in and of itself doesn't mean it's not worth doing, but it has to pass the cost/benefit test. For instance, there might be some important application where it was hard to swap TLS 1.3 in for TLS 1.2. However, as far as I can tell that's not true. While there are deployments stuck on TLS 1.2, they should be move to TLS 1.3 without significant impact on their existing functionality, although it might involve some inconvenience in terms of software. It would be better if they were to do so rather than the IETF community needing to maintain TLS 1.2 indefinitely.

This is not considered good practice in modern systems, but it's very convenient in this particular situation. ↩︎
Of course, active attackers can replay the ClientHello with their own key and get the server to encrypt the certificate to them, but this is more work than passively snooping. TLS Encrypted ClientHello addresses this issue. ↩︎
There is also a version in which the attacker can extract a signature on a specific value. ↩︎
See Cremers, Dax, and Medinger for more on how to think about the security properties of KEMs. ↩︎
As an aside, there is actually a proposal called AuthKEM that adapts TLS 1.3 to use a handshake more like static RSA but with KEMs in the place of RSA. However, that doesn't change the situation for TLS 1.2, and everyone assumes that AuthKEM would be run in a forward secret mode where the client also provided a KEM public key. ↩︎
In addition, reusing the same key makes remote side channel attacks on the key (like those we see with RSA) easier. If you use a different key for each transaction, then the attacker only has one chance to learn about it so the side channel has to leak a lot more information. ↩︎

How to manage a quantum computing emergency

2024-04-15T00:00:00Z

Illustration by Kate Hudson with MidJourney and Photoshop AI.

Recently, I wrote about how the Internet community is working towards post-quantum algorithms in case someone develops a cryptographically relevant quantum computer (CRQC). That's still what everyone is hoping for, but nobody really know when or even if a CRQC is developed, and even in the best case the transition is going to take a really long time, so what happens if someone builds a CRQC well in advance of when that transition is complete? Clearly, this takes the situation that is somewhere between non-urgent and urgent to one that is outright emergent but that doesn't mean that all is lost. In this post, I want to look at what we would do if a CRQC were to appear sooner rather than later. As with the previous post, this post primarily focuses on TLS and the Web, though I do touch on some other protocols.

Obviously there are a lot of scenarios to consider and "cryptographically relevant" is doing a lot of work here. For instance, we typically assume that the strength of X25519 is approximately 2¹²⁸ bits. A technique which brought the strength down to 2⁸⁰ would be a pretty big improvement as an attack and would definitely be "cryptographically relevant" but would also still leave attack quite expensive; it probably wouldn't be worth using this kind of CRQC to attack connections carrying people's credit cards, especially if each connection had to be attacked individually, at a cost of 2⁸⁰ operations each time. This would obviously be a strong incentive to accelerate the PQ transition, but probably wouldn't be an outright emergency unless you had particularly high value communications.

For the purpose of this post, let's assume that:

This is a particularly severe attack, bringing the existing algorithms within range of commercial attackers in a plausible time frame, whether that's days or real time.^[1]
It happens at some point in the next few years, while there is significant deployment but by no means universal deployment of PQ key establishment and minimal if any deployment of PQ signatures and certificates.

This is close to a worst-case scenario in that our existing cryptography is severely weakened but it's not practical to just disable it and switch to PQ algorithms. In other words ~~isn't~~ it's an emergency and we leaves us with a fairly limited set of options. [Corrected, 2024-04-15]

Key Establishment #

The first order of business is to do something about key establishment. Obviously if you haven't already implemented a PQ-hybrid or pure PQ algorithm, you'll want to do that ASAP, selecting whichever one is more widely deployed (or potentially doing both if some peers do one and some the other).

Once you've added support for some PQ algorithm, the question is whether you should disable the classical algorithm. The naive answer is "no": even if the classical algorithm severely weakened, any encryption is better than no encryption. In reality, the situation is a bit more complicated.

Recall that in TLS, the client proposes a set of algorithms and the server selects one, as shown below:

TLS handshake sketch

The idea here is that the server gets to see what algorithms the client supports and pick the best algorithm. As long as the client and server agree on the algorithm ranking, then this will generally work fine. However, it's possible that the servers and clients will disagree, in which case the server's preferences will win.^[2]

This actually happened during the transition away from the RC4 symmetric cipher. After a series of papers showed significant weaknesses in RC4, the browsers decided they preferred AES-GCM. Unfortunately, many servers preferred RC4, and so the result was that even when both clients and servers supported RC4 and AES-GCM, many servers selected RC4. In response, browsers (starting with IE^[3] adopted a system in which they first tried to connect without offering RC4, and if that failed they then retried with it, as shown below:

TLS fallback to RC4

The result was that any server which supported AES-GCM would negotiate it, but if the server only supported RC4, the client could still connect. This also made it possible to measure the fraction of servers which supported AES-GCM, thus providing information about about how practical it was to disable RC4.

Downgrade Attacks #

So far we've only considered a passive attacker, but what about an active attacker? TLS 1.3 is designed so that the signature from the server protects the handshake, so as long as the weakest signature algorithm supported by the client is strong, an active attacker can't tamper with the results of the negotiation.^[4] The fallback system described above weakens this guarantee a little bit in that the attacker can forge an error and force the client into the fallback handshake. However, the client will still offer both algorithms in the fallback handshake, so the attacker can't stop the server from picking its preferred algorithm; it can just stop the client from getting the client's preferred algorithm by manipulating the first handshake.

Of course, if the server's signature isn't strong—or more properly the weakest signature algorithm the client will accept isn't strong—then the the attacker can tamper with the negotiated key establishment algorithm. However, an attacker who can do that can just impersonate the server directly, so it doesn't matter what key establishment algorithms the client supports.

Maybe it's better to fail open #

The bottom line here is that as long as you're not under active attack, TLS will deliver the strongest^[5] algorithm that's jointly supported by the peers, and, if you're under active attack by an attacker who can break signature algorithms, then all bets are off. That's probably the best you can do if you're determined to connect to the server anyway. But the alternative is, don't connect.

The basic question here is how sensitive the communication with the site is. If you're just looking up some recipes or reading the news, then it's probably not that big a deal if your connection isn't secure (in fact, people used to regularly argue that it wasn't necessary at all, though that's obviously not a position I agree with). On the other hand, if you're doing your banking or reading your e-mail, you probably really don't want to do that unencrypted. This isn't to say that we don't want ubiquitous encryption—we do—or that it's not possible for even innocuous seeming communications to be sensitive—it is—but to recognize that this scenario would force us to make some hard choices about whether we're willing to communicate insecurely if that's the only option. These are hard choices for a human and even harder for a piece of software like a browser (it's much easier for a standalone mail client, obviously).

This is actually a situation where ubiquitous encryption makes things rather more difficult. Back when encryption was rare, it was a reasonable bet that if a site was encrypted then the operators thought it was particularly sensitive. But now that everything is encrypted, it's much harder to distinguish whether it's really important for this particular connection to be protected versus just that it's good general practice (which, again, it is!).

One thing that may not be immediately obvious is that an insecure connection can threaten not just the data that you are sending over it, but other data as well. For example, if you are reading your email, you're probably authenticating with either a password (with a normal mail client) or a cookie (with Webmail). Both of these are just replayable credentials, so an attacker who can decrypt your connection can impersonate you to the server and download all your email, not just the messages you are reading now As discussed above, an attacker who recorded your traffic in the past might still be able to recover your password, but this is a lot more work than just getting it off the wire in real time.

Signature Algorithms #

Of course, none of this does anything to authenticate the server, which is critical for protecting against active attack. For that we need the server to have a certificate with a PQ algorithm and the client to refuse to trust certificates that either (1) are signed with a classical algorithm or (2) contain keys for a classical algorithm. Importantly, it's not enough for the server to stop using a PQ classical [Fixed 2024-04-15] certificate, because the server doesn't have to be part of the connection at all. In fact, even if the server doesn't have a PQ certificate, attack is still possible because the attacker can just forge the entire certificate chain.

As described in my previous post, the first thing that has to happen is that servers have to deploy PQ certificates. Without that, there's not much the clients can do to defend themselves. In this case, I would expect there to be a huge amount of pressure to do that ASAP, despite the serious size overhead issues with PQ certificates noted by Bas Westerban and David Adrian. After all, it's better to have a slow web site than one that's not secure or that people can't connect to.

For the same reason, I would expect there to be a lot less concern about the availability of hardware security modules (HSMs) for the new PQ algorithms or whether the algorithms in question have gone through the entire IETF standards process [Added link 2024-04-15]. Those are both good things, but having PQ safe certificates is more important, so I would expect the industry to converge pretty fast on a way forward.

Once there is some level of PQ deployment, clients can start distrusting the classical algorithms (before that, there's not much point). However, as with key establishment: if the client distrusts classical algorithms than it won't be able to connect to any server that doesn't have a PQ certificate, which will initially be most of them, even in the best case. This is frustrating because it means that you have to choose between failure to connect or having protection against active attack. What you'd really like is to have the best protection you can get, i.e.,

Only trust PQ algorithms for sites that have PQ certificates (so you aren't subject to active attack).
Allow classical algorithms for sites without PQ certificates (so you at least get protection against passive attack).

Actually, there are three categories here:

Sites which are so sensitive that you shouldn't connect to them without a PQ certificate (e.g., your bank).
Sites which are known to have a PQ certificate and so you shouldn't accept a classical certificate (probably big sites like Google).
Sites that aren't that sensitive and so you'd be willing to connect to them with a classical certificate (e.g., the newspaper).

The problem is being able to distinguish which category a site falls into. Usually, we don't try to draw this kind of distinction, and just let the site tell us if it wants TLS, but this isn't a usual situation, so it's worth exploring some inconvenient things.

PQ Lock #

The most obvious thing is to have the client remember when the server has a PQ certificate and thereafter refuse to accept a classical certificate. Unfortunately, this idea doesn't work well as-is, because server configurations aren't that stable. For instance:

A site might roll out PQ and then have problems and disable it.
A site might have multiple servers and gradually roll out PQ certificates on one of them.
A site might be served by more than one CDN with different configurations.

Note that in cases (2) and (3) the client will not generally be aware that there are different servers, as they have the same domain name, and IP addresses aren't reliable for this purpose (and, in any case, are likely under control of the attacker because DNS isn't very secure). An in case (1) it's actually the same server.

In any of these situations you could have a situation where the client contacts the server, get a PQ certificate, and then come back and get a classical certificate, so if the client just forbids any use of classical after PQ, this would create a lot of failures. Fortunately, we've been in this situation before with the transition to HTTPS from HTTP, so we know the solution: the server tells the client "from now on, insist on the new thing," and the client remembers that.

With HTTP/HTTPS, this is a header called HTTP Strict Transport Security (HSTS) and has the semantics "just do HTTPS from now on with this domain". It would be straightforward to introduce a new feature that had the semantics "just insist on PQ from now on with this domain". In fact, the HSTS specification is extensible, so if you wanted to also insist on HTTPS (a good idea!), you could probably just add a new directive saying "also require PQ". It would also be easy to add a new HTTP header that said "if you do HTTPS, require PQ", as HTTP is nicely extensible and unknown headers are just ignored.

One of the obvious problems with an HSTS-like header—and in fact with HSTS itself—is that it relies on the client at some point connecting to the server while not under attack. If the attacker is impersonating the server then they just don't send the new header. They can even connect to the real server and send valid data otherwise, but just strip the header. This is still a real improvement, though, as the attacker needs to be much more powerful: if the client is ever able to form a secure connection to the true server, then it will remember that PQ is needed and be protected against attack from then on, even if it's not protected from the beginning.

Preloading #

It's possible to protect the user from active attack from the very beginning by having the client software know in advance which servers support PQ. There is already something that browsers do with HSTS, where it's called "HSTS preloading". Chrome operates a site where server operators can request that their sites be added to the "HSTS preload list". The site does some checking to make sure that the server is properly configured and then Chrome adds it to their list. In principle, other browsers could do this themselves, but in practice, I think they all start from Chrome's list.

In principle, we could use a system like this for PQ preloading as well, but there are scaling issues. The HSTS preload list is fairly sizable (~160K entries as of this writing), but this only represents a small fraction of the domains on the Internet. For example, Let's Encrypt is currently issuing certificates for more than 100 million registered domains and over 400 million fully qualified domains. If we assume that sites which have moved to PQ are aggressive about preloading—which they should be for security reasons—we could be talking about 10s of millions of entries. The current Firefox download is about 134 MB, so we're probably looking at a nontrivial expansion in the size of a browser download to carry the entire preload list, even with compact data structures. On the other hand, it's probably not totally prohibitive, especially in the early years when there is likely to not be that much preloading.

There may also be ways to avoid downloading the entire database. For instance, you could use a system like Safe Browsing which combines an imperfect summary data structure with a query mechanism, so that you can get offline answers for most sites, but then will need to check with the server to be sure. The Safe Browsing database has about 4 million entries—or at least did back in 2022—so you probably could repurpose SB-style techniques for something like this, at least until PQ certificates got a lot more popular.^[6] The privacy properties of SB-style systems aren't as good as just preloading the entire list, so there's a tradeoff here, so it would be a matter of figuring out the best of a set of not-great options.

Of course, browser vendors don't need to wait for servers to ask to be preloaded; they could just add them proactively, for instance by scanning to see which sites advertise the PQ-only header, or even which sites just support PQ algorithms. Obviously there's some risk of prematurely recording a site as PQ-only, but there's also a risk in allowing non-PQ connections in this situation, The higher the proportion of servers that support these algorithms, the more aggressive browser vendors can be about requiring PQ support, and the more readily they can add servers to the list, even if the server hasn't really directly signaled that it wants to be included.

Site Categorization #

There are other indicators that can be used to determine whether a site is especially sensitive and so needs to be reached over a PQ-secure connection or not at all. This could happen both browser side or server side based on a variety of indicia such as requiring a password or being a medical or financial site. One could even imagine building some kind of statistical or machine learning model to determine whether sites were sensitive. This doesn't have to be perfect as long as it's significantly better than static configuration.

Reducing overhead #

Obviously, we would be in a better position if it weren't so expensive to use PQ signature algorithms. Mostly, this is about the size of the signatures. As noted in Bas's post, there are a number of possible options for reducing the size overhead, these include:

All of these mechanisms are designed to be be backward compatible, meaning that the client and the server can detect if they both support the optimization and use it, but can fall back to the more traditional mechanisms if not. The first two mechanisms work with existing WebPKI certificates, and would work with PQ certificates as well, requiring only that the client and server software be updated to support the optimization.

The last mechanism ("Merkle tree certificates") replaces existing WebPKI certificates, and so would require servers to get both a PQ WebPKI certificate and a PQ Merkle tree certificate, and conditionally serve the right one depending on the client's capabilities. This is obviously more work for the server operator (the same for the browser user). On the other hand, if server operators are already going to have to change their processes to get both PQ and classical certificates, it would be a convenient time to also change to get a Merkle tree certificate.

HTTP Public Key Pinning #

Obviously, in addition to recording that the server supported PQ algorithms you could remember the server's PQ signature key and insist that the server present that in the future (this is how SSH works). In the past the TLS community explored more flexible versions of this approach with a technique called HTTP Public Key Pinning. HPKP was eventually retired, in part due to concerns about how easy it was to render your site totally unusable by pinning the wrong key and in part because mechanisms like Certificate Transparency seemed to make it less important.

One might imagine resurrecting some variant of HPKP for a PQ transition as a stopgap during a period where sites are prepared to deploy PQ but CAs can't issue them yet. This wouldn't be quite the same because the server would have to authenticate with its classical certificate but then pin the PQ key, which would be accepted without a certificate chain, which HPKP doesn't support. My sense is that we could probably manage to get some issuance of PQ certificates faster than we could design a new HPKP type mechanism and get it widely deployed, but it's probably still an option worth remembering in case we need it.

What about TLS 1.2? #

One challenge with the story I told above is that PQ support is only available in TLS 1.3, not TLS 1.2.^[7] This means that anyone who wants to add PQ support will also have to upgrade to TLS 1.3. On the one hand, people will obviously have to upgrade anyway to add the PQ algorithms, so what's the big deal. On the other hand, upgrading more stuff is always harder than upgrading less. After all, the TLS working group could define new PQ cipher suites for TLS 1.2, and it's an emergency so why not just let use people use TLS 1.2 with PQ rather than trying to force people to move to TLS 1.3. On the gripping hand, TLS 1.3 is very nearly a drop-in replacement for TLS 1.2. There is one TLS 1.2 use case that it TLS 1.3 didn't cover (by design), namely the ability to passively decrypt connections if you have the server's private key (sometimes called "visibility"), which is used for server side monitoring in some networks. However, this technique won't work with PQ key establishment either, so it's not a regression if you convert to TLS 1.3.

Non-TLS systems #

Much of what I've written above applies just as well to many other interactive security protocols such as IPsec or SSH,^[8] which are designed along essentially the same pattern. Any non-Web interactive protocol is likely to have an easier time because there will be a fairly limited number of endpoints you need to connect to, so you can more readily determine whether the other side has upgraded or not. As a concrete example, SSH depends on manual configuration of the keys (the server's key is usually done on a "trust on first use" basis when the client initially connects). Once that setup is done, you don't need to discover the peer's capabilities. By contrast, a Web browser has to be able to connect to any server, including ones it has no prior information about.

There is a huge variety of other cryptographic protocols and our ability to recover from a CRQC would vary a lot. Especially impacted will be anything which relies on long-term digital signatures, as they are hard to replace. A good example here is cryptocurrency systems like Bitcoin which rely on signatures to effect the transfer of tokens: if I can forge a signature from you then I can steal your money. The right defense against this is to replace your classical key with a PQ key (effectively to transfer money to yourself), but we can assume that a lot of people won't do that in time, and as soon as a CRQC is available, any future transaction becomes questionable.

The situation around Bitcoin seems to actually be pretty interesting. The modern way to do Bitcoin transfers is to transfer them not to a public key but the hash of a public key (called pay to public key hash (p2pkh)). As long as the public key isn't revealed, then you can't use a quantum computer to forge a signature. The public key has to be revealed in order to transfer the coin, but if you don't reuse the key, then there is only a narrow window of vulnerability between the signature and when the payment is incorporated into the blockchain (which doesn't depend on public key cryptography). However, according to this study by Deloitte, about 25% of Bitcoins are vulnerable to a CRQC, so that's not a great situation.

What if the PQ algorithms aren't secure? #

All of the above assumes that we have public key algorithms that are in fact secure against both classical and quantum computers. In that case, our problem is "just" transitioning from our insecure classical algorithms to their more-or-less interface compatible PQ replacements. But what happens if those algorithms turn out to be ~~secure~~ insecure [Corrected 2024-04-15] after all. In that case we are in truly deep trouble. Obviously the world got on OK for centuries without public key cryptography, but now we have an enormous ecosystem based on public key cryptography that would be rendered insecure.

Some of those applications may just get abandoned (maybe we don't really need cryptocurrencies...) but it would obviously be very bad if nobody was able to safely buy anything on Amazon, use Google docs, or that your health care records couldn't be transmitted securely, so there's obviously going to be a lot of incentive to do something. The options are pretty thin, though.

Signature #

We do have at least one signature algorithm which we have reasonably high confidence is secure: hash signatures, which NIST is standardizing as "SLH-DSA". Unfortunately, the performance is extremely bad (we're talking 8KB signatures). On the other hand, slow and big signature algorithms are better than no signature algorithms at all, so there are probably some applications where we'd see some use of SLH-DSA.

Key Establishment #

While the signature story is bad, but the key establishment story is really dire. The main option people seem to be considering is some variant of what I've been calling intergalactic Kerberos. Kerberos is a security protocol designed at MIT back in the 80s and in its original form works by having each endpoint (user, server) share a pairwise symmetric^[9] key with a key distribution server (KDC).

A high level view of Kerberos

At a high level, when Alice wants to talk to Bob, she contacts the KDC using a message encrypted with her pairwise key K_a and tells it that it wants to contact Bob. The KDC creates a new random key R_ab and then sends Alice two values:

R_ab
A copy of R_ab encrypted under Bob's key (K_b), i.e., E(K_b, {Alice, K_ab}). In Kerberos terms this is called a "ticket".

Alice can then contact Bob and present the ticket. Bob decrypts the ticket and recovers K_ab. Now Alice and Bob share a key they can use to communicate. Note that this all uses symmetric cryptography, so it's not vulnerable to attacks on our PQ algorithms. You can wire up this kind of key establishment mechanism into protocols like TLS (TLS 1.2 actually has Kerberos integration, but it wasn't ported into TLS 1.3) and use them in something approximating the usual fashion, albeit in a much clunkier fashion.

Merkle Puzzle Boxes #

It turns out that there actually sort of is a public key system that doesn't depend on any fancy math and so we can have reasonable confidence in how secure it is. In fact, it's the original public key system, invented by Ralph Merkle. This post is already pretty long, so if you're interested check out the Wikipedia page. The TL;DR is that it's probably not that practical because (1) public key sizes are enormous and (2) it only offers the defender a quadratic level of security (if the defender does work N the attacker does work N² to break it), which isn't anywhere near as good as other algorithms. There seem to be some quantum attacks on puzzle boxes (though I'm not sure how good they are in practice), but there is also a PQ variant.

This kind of design has a number of challenges. First, it's much harder to manage. In a public-key based system clients don't need to have any direct relationship with the CA, because they just need the CA's public key. In a symmetric key system, however, each client needs a relationship with the KDC in order to establish the shared key. This is obviously a huge operational challenge.

The basic challenge with this kind of design is that the KDC is able to decrypt K_ab and hence any traffic between Alice and Bob. This is because the KDC is providing both authentication and key establishment, unlike with a public key system like the WebPKI where the CA provides authentication but the endpoints perform key establishment using asymmetric algorithms. This is just an inherent property of symmetric-only systems, and it's what we're reduced to if we don't have any CRQC-safe asymmetric algorithms.

One potential mitigation is to have multiple KDCs and then Alice and Bob use a key derived from exchanges with those KDCs. In such a system, the attacker would need to compromise all of the KDCs in use for a connection in order to either (1) impersonate one of the endpoints or (2) decrypt traffic. Recently we've started to see some interest in symmetric key type solutions along these lines, including a draft at the IETF and a recent blog post by Adam Langley.^[10] My sense is that due to the drawbacks mentions above, this kind of system isn't likely to take off as long as we have PQ algorithms, even if they're not that efficient. However, if the worst happens and we don't have asymmetric PQ algorithms at all, we're going to have to do something, and symmetric-based systems will be one of the options on the table.

The Bigger Picture #

As I mentioned in the previous post, we shouldn't expect the PQ transition to happen very quickly, both because the algorithms aren't all that we'd like and because even with better algorithms the transition is very disruptive. However, because the Internet is so dependent on cryptography and in particular public key cryptography, there would be enormous demand to do something if a CRQC were to be developed any time soon. When compared to the alternative of no secure communications at all, a lot of options that we would have previously considered unattractive or even totally non-viable would suddenly look a lot better, and I would expect the industry to have to make a lot of tough choices to get anything at all to work while we worked out what to do in the long term.

This distinction does matter for some attacks, but even if it's days, the situation is really bad. ↩︎
Unless the server decides to defer to the client, of course. ↩︎
Thanks to David Benjamin for help with the history of this technique. ↩︎
This is a new feature of TLS 1.3. In TLS 1.2, the security of the handshake depended on the weakest common key establishment algorithm, which left it vulnerable to attacks if the weakest algorithm was breakable in real-time. ↩︎
Again with the caveats above about preferences ↩︎
The worst case is when about 1/2 of the sites want to be preloaded; once you get to well over 50%, you can instead publish the list of non-preloaded sites, though this is logistically a bit trickier, as you'd need a list of every site. You can get this list from Certificate Transparency, though, which is what CRLite does. ↩︎
Obviously, there's an element of "we're trying to avoid maintaining TLS 1.2 and we want people to upgrade" going on here, but there's also a small technical advantage here: although TLS 1.2 and TLS 1.3 both authenticate the server by having the server sign something, in TLS 1.2 the signature only covers part of the handshake (specifically, the random nonces and the server's key), which means that the signature doesn't cover the key establishment algorithm negotiation. This means that an attacker who can break the weakest joint key establishment algorithm can mount a downgrade attack, forcing you back to that weakest algorithm. However, we could presumably address this by remembering that both key establishment and authentication are PQ only. ↩︎
Note: QUIC uses the TLS 1.3 handshake under the hood, so it has roughly the same properties as TLS 1.3 ↩︎
In original Kerberos, a DES key. ↩︎
Langley's design actually assumes that PQ algorithms work but are too inefficient to use all the time, so you use it to bootstrap the symmetric keys with the KDC. ↩︎

Design choices for post-quantum TLS

2024-03-30T00:00:00Z

It's a cruel irony that just as encryption is finally becoming ubiquitous, quantum computers threaten to tear it all down.

Firefox HTTPS usage

The technical details aren't that important (see here for some background), but the TL;DR version is that many of our cryptographic algorithms are designed to be difficult to break using "classical" computers (which is to say the kind we have now) but may not be difficult to break if you have a quantum computer, which takes advantage of quantum mechanical effects,^[1] then it might be possible to efficiently break these algorithms.

I say might because the situation is somewhat uncertain in that while people have built quantum computers, they are currently quite small, nowhere near what you would need to mount an attack on a modern cryptographic algorithm. There's a lot of money being invested in developing quantum computers, but nobody really knows when we'll have what's called a cryptographically relevant quantum computer (CRQC), which is to say one which could mount practical attacks on the cryptosystems in wide use, or whether it's possible to build one at all.

There was the time Blueshell had a humor fit at Pham’s faith in public key encryption, and Ravna knew some stories of her own to illustrate the Rider’s opinion.

— Vernor Vinge, "A Fire Upon The Deep"

However, if a CRQC were to exist, the impact would be catastrophic, potentially rendering nearly every existing use of cryptography insecure. Specifically, it would break the "asymmetric" algorithms we use to authenticate other Internet users and to establish cryptographic keys, so an attacker would be able to impersonate anyone and/or recover the keys used to encrypt data. A CRQC probably won't have that big an impact on the actual "symmetric" encryption used to encrypt the data itself, but if you have the key you can just decrypt it with a regular computer, so that's not much in the way of comfort.

For that reason, researchers have started developing what's often called post-quantum (PQ) cryptographic algorithms which are designed to resist attack by quantum computers, or more properly, for which there are no known quantum algorithms which would allow you to break them (which isn't to say that those algorithms don't exist). After a fairly long competition, NIST published new standards for post-quantum key establishment (ML-KEM) and digital signature (ML-DSA)^[2] and protocol designers and implementors are starting to look at how to adapt their protocols to use them.

In this post, I want to look at the challenges around that transition, focusing on the situation for TLS and the WebPKI, though some of the same concerns apply to other settings.

Why not just convert right now? #

The obvious question is why not just convert now, as we did when changing from older algorithms like RSA to newer ones based on elliptic curves (EC). The reason is that the new PQ algorithms are not clearly better than the EC algorithms that dominate the space now. Specifically:

In many cases, performance is worse,: in terms of CPU, key, ciphertext, or signature size. For instance, ML-KEM is faster than X25519 (the most popular current EC key establishment algorithm) but the keys are much bigger, over 1000 bytes compared to 32 bytes. The situation is much worse for signatures, where there really isn't any standardized algorithm which isn't a big regression from EC-based signatures in one way or another, and due to the large number of signatures that need to be carried in a typical protocol exchange, the size issue is a big deal, especially as there appear to be compatibility issues. These posts by Bas Westerban from Cloudflare and David Adrian from Chrome does a good job of covering the state of play of the various algorithms, but in general none of them has a better overall performance profile than EC.
We're not sure that they're secure.: There has been quite a bit of security analysis on the particular EC variants that are in wide use and while there has been a lot of work on the problems underlying ML-KEM and ML-DSA, my understanding is that there is still real uncertainty about how secure these systems are against classical computers. Daniel J. Bernstein (DJB) has been one of the biggest advocates for this view, but much of the industry is sort of antsy about the PQ algorithms.^[3]
We don't know if or even when we'll get a CRQC.: Current quantum computers are very far away from being cryptographically relevant and of course progress is hard to predict. The Global Risk Institute has produced a report with estimates on when we will have a CRQC, with the results shown below:

Global Risk Institute Estimates for a CRQC

For these reasons, industry has generally been pretty cautious about rolling out PQ algorithms.

Threat Model #

Because classical key establishment and digital signature are based on the same underling math problems, the impact of a CRQC on these algorithms is also the same—which is to say very bad. However, the security impact is very different.

Key Establishment #

When you encrypt traffic, you want that traffic to remain secret for the valuable lifetime of the data. For instance, if you are encrypting your credit card number, you want it to remain secret as long as that credit card is still valid. Lots of information has very long lifetimes during which people want it to remain secret; presumably you wouldn't be happy with people learning your medical history 6 months from now.

When you encrypt traffic using keys derived via an asymmetric key-based establishment protocol—as with TLS—this means that you need that key establishment algorithm to also be secure for the lifetime of the data. In this context, that means that data that is being sent now using keys established with EC algorithms—which is to say most of it—might be revealed in the future if someone develops a CRQC. An attacker might even deliberately capture a lot of traffic on the Internet, betting that eventually a CRQC will be developed and they can decrypt it (this is called a "harvest now, decrypt later" attack).

For this reason, doing something about the threat of a CRQC to the security of key establishment is a fairly high priority, because every day that you use non-PQ algorithms you're adding to the pile of data that might eventually be decryptable. This is especially true because transitions can take a really long time even in the best case. For example, TLS 1.2 first added support for modern AEAD algorithms such as AES-GCM in 2008, but Firefox and Chrome didn't even add support for TLS 1.2 until 2013, and AEAD cipher suites didn't outnumber the older CBC-based ciphers until 2015. So, even in the best case, we're still going to be sending a lot of non quantum safe traffic for years to come.

Negotiated cipher suites over time. From Kotzias et al., 2018

Digital Signature #

By contrast, a digital signature algorithm only needs to be secure at the time you make decisions based on the validity of the signature; in TLS this is at the time the connection is established. If a CRQC that breaks your signature algorithm is developed 30 second after your TLS connection is established, your data remains secure as long as you established a key using some non-vulnerable method (of course, your next connection won't be secure, so you'll want to do something about that).

Signatures for Object Security #

Note that the situation is different for signatures in object-based protocols like e-mail, because people want to be able to validate the signature long after the message was sent. Thus, having a PQ signature does help, even if paired with a classical signature, because it allows the signature to survive subsequent development of a CRQC.

It's also possible to allow a classical algorithm to survive the development of a CRQC by timestamping the signature to demonstrate that the classical signature was created prior to the development of the CRQC. For instance, you could arrange to register a hash of the signed document with some blockchain type system. You can then present the signed document paired with the timestamp proof (note that the timestamp service doesn't need to verify the signature itself; it's just vouching that it saw the document at time X.). The relying party can verify that the signature was made prior to the development of the CRQC, in which case it is presumably trustworthy.

For this reason, doing something about digital signatures is generally considered to be a lower priority, although of course it will be really inconvenient if a CRQC is built and we have no deployment of any PQ signature algorithms, as everyone will be scrambling to catch up. It's of course possible that someone—most likely some sort of nation state intelligence agency—already has a CRQC and isn't telling, but even then that's a lot less bad than having your communications be vulnerable to anyone who can get a QC shipped to them overnight as long as they have Amazon Prime.

This asymmetry in the threat model is convenient, because, as noted above, nobody is that excited about the PQ signature algorithms, whereas the PQ key establishment algorithms seem fairly reasonable—assuming of course that they're secure. As a result, people are focusing on key establishment and mostly keeping their fingers crossed that the signature situation will improve before it becomes an emergency.

Cryptographic Algorithms in Transport Security Protocols #

For the purpose of this post, I want to focus on transport security protocols like TLS. These aren't the only kind of cryptographic protocols in the world, but they illustrate a lot of the issues at play, in particular how we transition from one set of algorithms to another.

It's clearly impractical to just wholesale switch over from the old algorithms to the new algorithms at some point in time (what's often called a "flag day"). It took years (decades, really) to deploy everything we have in the ecosystem and any big change will also take time. Instead, TLS—and most similar protocols—are explicitly designed to have what's called algorithm agility, the ability to support more than one algorithm at once so that endpoints can talk to both old and new peers, thus facilitating a gradual transition from old to new.

The diagram below provides a stylized version of the TLS handshake. The client sends the first message (ClientHello), which contains a set of "key shares", one for each key establishment algorithm that it supports.^[4] For Elliptic Curve algorithms, this means one key share for each curve. When the server responds with its ServerHello message, it will pick one of those groups and send its own key share with a key from the same group. Each side can then combine its key share with the other side's key share to produce a secret key that both sides know.^[5] This shared key can then be used to derive keys to protect the application data traffic.

Something kind of like the TLS handshake

Of course, we also need to authenticate the server. This happens by having the server present a certificate and then signing the handshake transcript (the messages sent by each side) using the private key corresponding to the public key in its certificate. But as noted above, there are multiple signature algorithms, so the ClientHello tells the server which signature algorithms the client supports so that it can pick an appropriate certificate. Of course, if the server doesn't have a certificate that matches any of the client's algorithms, then the client and server will not be able to communicate.

Note that there are actually several signatures here because the certificate both has a key for the server and is signed by some key owned by the CA. These keys may have different algorithms, and both have to be in the list advertised by the client. Moreover, the CA may have its own certificate and that signature also has to use an appropriate algorithm and then there are CT SCTs (I refer you again to David Adrian's post, which quantifies these).

Post-quantum algorithms fit neatly into this structure. Each PQ algorithm is treated like a new elliptic curve (even though they really don't have anything in common cryptographically) and signature algorithms just act the same (although, as noted above, the result is a lot larger). Even better, all of the generation and selection of key shares is done internally to the TLS stack,^[6] so it's possible to roll out new key establishment algorithms just by updating your software without any action on the user's part (this is how EC was deployed in the first place). Of course, this is a lot easier if your software is remotely updatable or at least updates regularly; if we're talking about the software in a lightbulb, the situation might be a lot worse.

By contrast, in order to deploy a new signature algorithm you need a new certificate, and even though certificate deployment is partly automated now, it's not so automated that people expect new signature algorithms and the corresponding certificates to just pop up in their servers. Moreover, some servers are not set up to have multiple certificates in parallel. Given these deployment realities, the performance gap, and the threat model difference mentioned above, it shouldn't be surprising that there's a lot more activity around deploying PQ key establishment than around signatures.

The Current Deployment Situation #

In the past few years, we have seen a number of experimental deployments of PQ algorithms, primarily for key establishment.

Key Establishment #

Most of the key establishment deployment has been in what's called a "hybrid" mode, which is to say using two key establishment algorithms in parallel.

A classical EC algorithm like X25519
A PQ algorithm like ML-KEM

For instance, Chrome recently announced shipping an X25519/Kyber-768 (Kyber is the original name for what ~~is now ML-KEM~~ became ML-KEM after some modifications [Updated 2024-03-30]) hybrid and Firefox is working on it as well.

The way that these hybrid schemes work is that you send key shares for both algorithms, then compute shared keys for both, and finally combine the shared keys into the overall cryptographic key schedule that you use to derive the keys used to encrypt the traffic. There are a number of ways to do this, but the way it's done in TLS 1.3 is simple: you just invent a new algorithm identifier for the pair of classical and post-quantum algorithms and the key share is the pair of keys. Similarly, the combined algorithm emits a new secret that is formed by combining the secrets from the individual algorithms. This works well with the modular design of TLS, because it just looks like you've defined a new elliptic curve algorithm, and the rest of the TLS stack doesn't need to know any better.

The advantage of a hybrid design like this is that—assuming it's done right—it is resistant to a failure of either algorithm; as long as one of the two algorithms is secure then the resulting key will be secret from the attacker and the resulting protocol will be secure. This allows you to buy some fairly cheap insurance:

If someone develops a CRQC the connection is still protected by the PQ algorithm.
If it turns out that the PQ algorithm is weak after all, then the traffic is still protected with the classical algorithm.

Of course, if the PQ algorithm is broken, then the traffic isn't protected in the event that someone develops a CRQC, but at least we're not in any worse shape than we were before, except for the additional cost of the PQ classical [Updated 2024-03-30. oops.] algorithm, which, as noted above, is comparatively low.

All of this makes a rollout fairly easy: clients and servers can independently add support for PQ hybrids to their implementations and configure their clients to prefer them to the classical algorithms. When two PQ-supporting implementations try to connect to each other, they'll negotiate the hybrid algorithm and otherwise you just get the classical algorithm. Initially, this means that there will be very little use of hybrid algorithms, but as the updated implementations are more widely deployed, you'll have more and more use of hybrid algorithms until eventually most traffic will be protected against a CRQC. This is the same process we historically used to roll out new TLS cipher suites as well as new versions of TLS, like TLS 1.3.

Of course, it won't be safe for clients or servers to disable support for the classical algorithms until effectively all peers have support for the PQ hybrids; if you disable support for them too early, then you won't be able to talk to anyone who hasn't upgraded, which is obviously bad. For many applications, this is a well-contained problem: for instance you can disable classical algorithms in your mail client as soon as your mail server supports the PQ hybrids. However, the Web is a special case because a browser has to be able to talk to any server and a server needs to be able to talk to any browser, so Web clients and servers are typically very conservative about when they disable algorithms. The standard procedure is to offer both new and old concurrently and then measure the level of deployment of the new algorithm and only disable the old algorithm when there are almost no peers who won't support the new algorithm. Unless there is some strong sign that CRQC is imminent, I would expect there to be a very long tail of clients and servers—especially servers—that don't support PQ hybrids, in part because PQ hybrid support is not present in TLS 1.2 but only TLS 1.3, and there are still quite a few TLS 1.2 only servers. This will also make it hard for browsers to disable the classical algorithms, even if they want to.

If a viable CRQC is developed, then it will be necessary for everyone else to switch over to post-quantum key establishment algorithms on an expedited basis, but that's not enough. If you accept classical algorithms for authentication, the attacker will be able to impersonate the server. This means that after the CRQC exists, you will also need to have everyone switch to PQ signature algorithms.

Signature #

By contrast, there has been very little deployment of PQ algorithms for signature, largely for the reasons listed above, namely that:

It's a lot harder to deploy a new signature algorithm than a new key establishment algorithm.
It feels less urgent because a future CRQC mostly affects future connections rather than current ones.
The signature algorithms aren't that great. And by "not that great" I mean that replacing our current algorithms with ML-DSA would result in adding over 14K of signatures and public keys to the TLS handshake. As a comparison point, I just tried a TLS connection to google.com and the server sent 4297 bytes.

Before we can have any deployment, we first need to update the standards for signature algorithms for WebPKI certificates. From a technical perspective, this is fairly straightforward (aside from the performance and size issues associated with the certificates) in that you just assign code points for the signature algorithms. However, unlike the situation with key establishment, this is just the start of the process.

On the Web, certificate authority practices are in part governed by a set of rules (the baseline requirements (BRs)) managed by the CA/Browser Forum, which has historically been quite conservative about adding new algorithms. For instance, although much of the TLS ecosystem has shifted to new modern elliptic curves in the form of X25519, the BRs still do not support those curves for digital signature. So, the first thing that would have to happen is that CABF adds support for some kind of PQ algorithm or a PQ hybrid (more on this below). This probably won't happen until there are commercial hardware security modules that can do the PQ signatures.

Once the new algorithms are standardized, then:

The CAs have to generate new keys that they will use to sign end-entity certificates.
Those keys (embedded in CA certificates) need to be provided to vendors so they can distribute them to their users.
Certificate transparency logs need to also get PQ certificates.
Servers need to generate their own PQ keys and acquire new certificates signed by the PQ keys at CAs and CT logs.

Note that this transition is much worse than adding a new signature algorithm would ordinarily be. For instance, servers who wanted to use EC keys to authenticate themselves didn't necessarily need to wait for CAs to have EC keys themselves, because the CA could sign a certificate for an EC key with an RSA key, as RSA was still secure, just slower. This meant you could have a gradual rollout, and things got gradually better as you replaced the algorithms. But the whole premise of the PQ transition is that we don't trust the classical algorithms, so eventually you need to have the whole cert chain use the new algorithms. It's of course possible to have a mixed chain, but that's more useful for experimenting with deployment than providing actual security against a CRQC. In fact, as you gradually roll out, things get slower, but you don't get the security benefit until much later, which is actually the wrong set of incentives.^[7]

Once all this happens, when an updated client meets an updated server, then the update server can provide its new PQ-only or PQ-hybrid certificate. Just as with key establishment, the client and server both need to support the classical algorithms until effectively every endpoint they might come into contact with has PQ support. This isn't a big deal for the client, but for the server it means that it needs to have both a regular certificate and a PQ certificate for a very long time.

However, unlike with key establishment, during this transition period neither client or server is getting any security benefit from using PQ algorithms. This follows from the fact that the security of the signature algorithm in TLS is only relevant at connection establishment time. There are two main possibilities:

Nobody with a CRQC is trying to attack your connections, in which case the classical algorithm was just fine
Somebody with a CRQC is trying to attack your connections, in which case they will just attack the classical key rather than the PQ key.

In order to get security benefit from PQ signatures in this context, relying parties need to stop trusting the classical algorithms, thus preventing attackers from attacking those keys. In the Web context, this means that Web browsers need to disable those algorithms; until that happens PQ certificates don't make anything more secure, but do make it more expensive, which is not a very good selling proposition.

For this reason, what I would expect to happen is wide deployment of client side support for PQ signatures but much less wide deployment of PQ certificates. The vast majority of clients are produced by a small number of vendors (the four major browser vendors) and this is a fairly easy change for them to make. By contrast, while servers are to some extent centralized on big sites like Google or Facebook or big CDNs, there are a lot of long tail servers who will not be motivated to go to the trouble. In particular, I would be very surprised if anywhere near enough servers adopted PQ-based signatures to make it practical to disable classical signatures absent from very strong pressure from the client vendors.

As a reference point, the first good attacks on SHA-1 were published in 2004, and SHA-1 wasn't deprecated in certificates until 2017. Moreover, even after Chrome announced that they would deprecate SHA-1, it still took three years to actually happen. The difference between SHA-1 and SHA-2 had had no meaningful impact on performance or on certificate size, so this was really just a matter of transition friction. This isn't an atypical example: the vast majority of certificates contain RSA keys and are signed with RSA keys even though ECDSA is faster (for the server) and has smaller keys and signatures.

There have been some recent changes to the WebPKI ecosystem to make transitions easier (e.g., shortening certificate lifetimes), but transitioning to PQ certificates has much worse performance consequences, so we should definitely expect the PQ transition to be a slow process.

Hybrids vs. pure PQ #

One of the big points of controversy is whether to mostly support hybrid systems that combine both classical and PQ algorithms or pure PQ algorithms. As noted above, the industry seems to be trending towards hybrids for key establishment, but the question of signatures is more uncertain.

Looming over all of this is the fact that the US National Security Agency and the UK GCHQ are strongly in favor of pure PQ algorithms rather than hybrids. In November 2023, GCHQ put out a white paper arguing for pure PQ schemes rather than hybrid:

In the future, if a CRQC exists, traditional PKC algorithms will provide no additional protection against an attacker with a CRQC. At this point, a PQ/T hybrid scheme will provide no more security than a single post-quantum algorithm but with significantly more complexity and overhead. If a PQ/T hybrid scheme is chosen, the NCSC recommends it is used as an interim measure, and it should be used within a flexible framework that enables a straightforward migration to PQC-only in the future.

Similarly, the NSA's Commercial National Security Algorithms 2.0 (CNSA 2.0) guidance contains some text that many read as saying they will eventually not permit hybrid schemes:

Even though hybrid solutions may be allowed or required due to protocol standards, product availability, or interoperability requirements, CNSA 2.0 algorithms will become mandatory to select at the given date, and selecting CNSA 1.0 algorithms alone will no longer be approved.

This isn't the clearest language in the world, but it seems like the best reading is they don't want to allow hybrids. On the other hand, at IETF 119 last week, NIST's Quynh Dang stated that NIST was fine with hybrids.

The specific timeline varies by product, but most relevant for this post, they say they want to have Web browsers and servers be CNSA 2.0 only by 2033:

CNSA 2.0 timeline

It's a bit unclear what this means in practice for the Web, even if you read it as "pure PQ only". Recall that the way that TLS works is that the client offers some algorithms and the server selects one; this means that it should be possible for servers constrained by CNSA 2.0 ("national security systems and related assets") to select pure PQ algorithms as long as enough browsers support them, which seems somewhat likely, even though AFAICT no browser currently supports them. However, it's much less viable for a browser to only support PQ modes unless you never want to connect to servers on the Internet which, as noted above, are not likely to all support pure PQ. Are even government systems going to be configured to disable hybrids in 2035?

The CNSA 2.0 guidance is relevant for two reasons. First, there are likely to be a number of applications which are going to feel strong pressure to comply with CNSA 2.0. It's of course possible that if vendors just decide to use hybrids, that NSA ends up giving in and approving that, but people are understandably reluctant to find out. Second, GCHQ and NSA offer a number of arguments for why PQ algorithms as opposed to hybrids. This post is already getting quite long, so I don't want to go through them in too much detail, but they mostly come down to it's more moving parts to have a hybrid (hence more complexity, cost, etc.) and if there is a good CRQC, then the classical part of the system isn't adding much if anything in the way of security.

Another concern about hybrids is performance. Obviously, hybrids are more expensive than pure PQ, but the difference isn't likely to be a big factor. PQ keys and signatures are much bigger, so the incremental size impact of having the classical algorithm is trivial. ML-KEM is quite a bit faster than X25519, but X25519 is already so fast that my sense is that people aren't worried about this. Similarly, ML-DSA is about twice as fast as EC for verification but looks to about 4x slower for signing,^[8] a bit misleading because in most uses of TLS it's the server that has to worry about performance and that's where the signature happens, so again the incremental cost of EC isn't that big a deal.

I'm not sure how persuaded I am by these arguments, but I think at best they are arguments at the margin. In particular, there's no real reason to believe that deploying hybrids is inherently unsafe, even if the classical algorithm is trivially broken. Assuming that we've designed things correctly, the resulting system should just have the security of the PQ part of the hybrid. I've seen suggestions that severe enough implementation defects against the classical part of the system (e.g., memory corruption) could compromise the PQ part. This isn't out of the question, of course, but modern software has a pretty big surface area of vulnerable code, so it's hard to see this as dispositive.

Inside Baseball: Code point edition #

For a long time, the IETF used to make it quite hard to get code point assignments, for instance requiring that you have an RFC. The idea was that we didn't want people using stuff that hadn't been reviewed and that the IETF didn't think was at least OKish. The inevitable result was that a lot of time was spent reviewing documents (for instance national cryptography standards) which the IETF didn't care about but were just needed to get code point assignments. Worse yet, some people would just use as-yet unassigned code points—this was easy because they're generally just integers—and if there was any real level of deployment, that code point became unusable whether it was officially registered or not.

The more modern approach is to make code point assignment super easy (effectively "write a document of some kind which describes what it's for") but to mark which code points are "Recommended" by the IETF and which are not. The "Recommended=Y(es)" ones need to go through the IETF process, but "Recommended=N(o)" code points are free for the asking. This has significantly reduced the amount of time that WGs spend reviewing documents for bespoke crypto and has generally worked pretty well. More recently the WG is adding a "Recommended=D(iscouraged)" for algorithms which the WG has looked at and thinks are bad.

Key Establishment #

As noted above, most of the energy in key establishment is in hybrid modes. They're easy to deploy now and seem safer than pure PQ algorithms, at least for now. In TLS in particular, what seems likely to happen is the following:

The TLS WG will standardize a set of hybrid algorithms based on ML-KEM on and recommend that people use them.
The IETF will assign a code point (algorithm identifier) for pure ML-KEM, but it won't be a standard and the IETF won't recommend (or disrecommend) its use.

The likely result is that there will be a lot of use of hybrids but people will be able to use pure ML-KEM if they want it. At some point, sentiment may shift towards pure ML-KEM, in which case the TLS WG will be able to take that document off the shelf and standardize it. However, as noted above, that isn't urgent even if there is a working CRQC: people can just burn a little more CPU and bandwidth and do hybrids while the hybrid → pure PQ transition happens.

Signatures #

The question of whether to use hybrids versus pure PQ for signature is still being hotly contested. As I mentioned above, it seems clear that servers will need both classical and PQ signatures for some time. The relevant question is exactly how they will be put together.

It seems likely that servers will have one certificate with a classical algorithm (e.g., ECDSA) as they do today and then have another certificate with a post-quantum algorithm. This could be in one of two flavors:^[9]

For a pure PQ algorithm (ML-DSA)
For both a classical (e.g., ECDSA) and a PQ algorithm (ML-KEM). As with key establishment, these would be packaged into a single key and a single signature that was the combination of the two algorithms, with the semantics being that both signatures have to be valid.

For a while my intuition was that it was easier to just do PQ: because the PQ algorithms were so inefficient, clients and servers would largely favor the classical algorithms unless it became clear that the classical algorithms were insecure, and so it wouldn't matter much what was in the PQ certificates. And if it became clear that the implementations had to distrust the classical algorithms—which is going to be a super rocky transition anyway given the likely level of deployment of PQ certificates—then the classical part of the hybrid isn't doing much for you.

Now, consider the opposite case where instead the PQ algorithm is what's broken. At this point, you want to distrust that algorithm and fall back to classical algorithms. By contrast, to distrusting the classical algorithms, distrusting the PQ algorithms is comparatively easy because everyone is going to still have classical certificates for a long time, so relying parties (e.g., browsers) will probably be able to just turn off the PQ algorithm, in which case you don't really need a hybrid certificate for continuity.

This is all true as far as it goes, but it's also kind of browser vendor thinking because have really good support for remotely configuring their clients, so it really is practical to turn off an algorithm within days for most users. However, this isn't true for all pieces of software, many of which take much longer to update, and for those clients and servers the world will be much more secure if the only two credentials they trust are classical (still OK) and PQ hybrid (now just as secure as the classical credential). Moreover, it's also possible that there will be a secret break of the PQ algorithm, in which case even browsers won't update (the only thing we can do for a secret CRQC is to stop trusting the classical algorithms). For these reasons, I've come around to thinking that hybrids are the best choice for PQ credentials in the short term.

The Bigger Picture #

Getting through this transition is going to put a lot of stress on the agility mechanisms built into our cryptographic protocols. In many ways, TLS is better positioned than many of the protocols in common use, both because interactive protocols are inherently able to negotiate algorithms and because TLS 1.3 was designed to make this kind of transition practical. Even so, the transition is likely to be very difficult. While TLS itself is designed to be algorithm agile, it is often embedded in systems which themselves are not set up to move quickly.

Many proprietary uses of TLS—such as applications talking back to the vendor—should be able to switch pretty quickly and seamless. For instance, Facebook can just update their app in the app store and their server and they're done.
The Web is going to be a lot harder because it's such a diverse system and there isn't much in the way of central control on the server side. On the other hand, the browsers are generally centrally controlled by the vendors, which means that most of the browser user base can change quickly. There is of course a long tail of browsers in embedded devices (TVs, kindles, etc.) which may be much harder to update.
Beyond these two cases, there is going to be a long tail of TLS deployments which are in much worse shape and which can't be easily remotely updated (e.g., many IoT devices). Depending on how the clients or servers these devices need to talk to behave, they may either be stuck in a vulnerable state (if the peers don't enforce PQ algorithms) or just unable to communicate entirely.

Unfortunately, a rocky transition is actually the best case scenario. The most likely outcome is that absent some strong evidence of weakening of classical algorithms as a forcing function, we have a long period of fairly wide deployment of PQ or hybrid key establishment and very little deployment of PQ signatures, especially if the PQ signature algorithms don't get any better. Even worse would be if someone developed a CRQC in the next few years—long before there is any real chance we will be ready to just pull the plug on classical algorithms—and we have to scramble to somehow replace everything on an emergency basis. Fingers crossed.

Acknowledgement: Thanks to Ryan Hurst for helpful comments on this post.

Lots of stuff in your computer (the transistors, LEDs, etc.) are based on quantum effects, but fundamentally there's nothing that your computer does that couldn't be done by clockwork. This is something different. ↩︎
The ML stands for "module-lattice", which refers to the mathematical problem that the algorithms are based on. ↩︎
The situation here is a bit complicated. NIST is standardizing three schemes: ML-KEM, ML-DSA, and SLH-DSA. ML-KEM and ML-DSA are based on lattices, which have a fairly long history of use in cryptography. SLH-DSA is based on hash signatures which are also quite old, but has unsuitable characteristics for a protocol like TLS. Quite a few of the initial inputs to the NIST PQ competition have subsequently been broken (see this summary by Bernstein), including SIKE, which turns out to be totally insecure, which is disappointing because it had some favorable properties in terms of key size. There have also been some improvements in attacking lattices in the past few years, though they are not known to break either ML-DSA or ML-KEM. In addition to algorithmic vulnerabilities, some of the implementations of Kyber (the predecessor to ML-KEM) had a timing side channel, dubbed "KyberSlash". All in all, you can see why people might want to engage in some defense in depth. ↩︎
I'm simplifying here a bit, in that the client can actually advertise curves it doesn't send key shares for, but we can ignore that for the moment. ↩︎
Simplifying again. Each side actually generates a secret value and then computes their key share from that secret value. The shared secret is computed from the local secret and the remote key share. ↩︎
Although of course users can reconfigure it, at least in some systems. ↩︎
See my post on how to successfully deploy new protocols, coming soon. ↩︎
The numbers I have here are from Westerban and are for Ed25519, which isn't in wide use on the Web, but, at least in OpenSSL, EdDSA and ECDSA seem to have similar performance. ↩︎
There is actually another option in which you have a single certificate with the classical key in the normal place (subjectPublicKeyInfo) and the PQ key in an extension. This certificate will be usable with both old and new clients, with new clients signaling that they supported PQ and then the server signing with both algorithms. This has the advantage of only needing a single certificate but otherwise is kind of a pain because it requires a lot more changes to TLS. In the naive way I've described it, it also involves sending a lot more data for every client, but there are ways around that. ↩︎

Sean O'Brien 100K Race Report (2024)

2024-03-13T00:00:00Z

On Saturday 1/27 I ran the Sean O'Brien (SOB) 100K in Southern California. I ran this same race back in 2021 and got my 100K PR, so I knew the course and felt like it was an opportunity to do better. My training had been going well and I was dropping PRs on my local courses, so I was looking forward to a strong race and taking off bunch of time, with an overall target of 12:00 to 12:25, so ~30-50 minutes off of 2021. This did not happen, though I did PR slightly.

It actually turned out to be a bit of a mixed result. On one hand, I finished about 7 minutes faster than last time (more on the "about" later), and much higher up in the standings (8th overall out of a starting field of 96) but all of the improvement was being more efficient at aid stations and I actually was a little over 2 minutes slower in the running part. My working theory is that it was warmer this year, and so times were slower, but this is a bit harder to verify than one might like.

To orient yourself, here is the course and the hill profile. The circles on the course are mile markers, so you start at the far right, go all the way to the left, around the loop counter-clockwise, then backtrack. There's an out-and-back down to Bulldog and then you backtrack to the finish. The circles on the profile are "climb score", Runalyze's estimate of how hard the climb was.

[Screenshots from Runalyze, 2021 data]

Overall Logistics #

I've been doing most of my training with Tailwind and Maurten drink mix, but KH races uses Gu Roctane, which I don't particularly like—especially because races have a tendency to offer the caffeinated version—so I decided to use drop bags extensively. To make this easier to manage I mapped out a regular eating schedule, that targeted 320-360 cal/hr, effectively:

1 500 ml bottle of Maurten 160 drink every hour, with 250ml each 30 min
Some mix of Maurten solid and Maurten gel aiming for ~100-200 cal/hr.

I use a 30 minute timer to manage all this, so I have to do something every 30 minutes. I started with Maurten solid and then moved onto a mix of regular gel and the caffeinated gels. This got a little complicated to manage due to Maurten's non-orthogonal lineup: Maurten's solid bar is 225 cal, so effectively I was eating 1/3 bar when the timer went off, which is fine. Ideally I would have just used Maurten 160 gels every hour for 320 cal/hr, but I wanted to take caffeine every 2 hrs after 6 hrs and Maurten's caffeinated gel is only 100 cal, so I decided to aspirationally add a Maurten 100 at the 30 minute mark, though I wasn't sure I could reliably do 360 cal/hr. This mostly worked out, especially once I got past the solid phase.

Everything laid out

To make all this easier I bagged up what I needed for each aid station in a ziploc (with two bags for Kanan, because you hit it twice). The way this works is you get to the AS, you (theoretically) dump out everything from your pack, and then shove in whatever is in the ziplocs in. I labeled the ziploc both with where it was needed and my 2021 time for the AS, so as soon as I picked up the bag I could see if I was ahead or behind schedule. This part worked well and was a lot easier than a pace sheet.

Start to Corral Canyon [7.3 mi, +2270/-846 ft, 1:20:43, -2:25] #

Race start was at 5:30 AM and sunrise at a bit after 7 so I expected to run the first 60-90 minutes in the dark. I got to the start with plenty of time and was able to drop off my drop bags and then just chilled in the car for a while, before heading over to the start line about 10 minutes early, planning to use the bathroom.

This is where things started to go wrong because there was a much longer than expected bathroom line: apparently the portapotties just never got delivered so we just had the park's bathrooms, which really weren't enough for 100 runners. For some reason, the RD decided not to delay the start even though a number of people—including me—were still waiting. I decided that it was better to use the bathroom than to be right at the start, and ended up missing the start by about a minute (not the first time this has happened to me, TBH). I think this was the right decision overall: a minute isn't much for a race this long, and I wasn't expecting to win, but the result is that I started essentially at the back of the race. There are some early sections of single track and so I spent quite a bit of time trying to get past people who were going a lot more slowly than me. It's important to conserve energy early, so I tried not to get too aggro, but it still slows you down.

New for this year there was a real water crossing 2 miles in (I had heard rumors about this but no details because I missed the briefing at the start), where you actually had to wade through almost knee deep water with a rope for stabilization. I'm never a huge fan of this, but it was already fairly warm (never a good sign) and my shoes dry quickly, so it wasn't uncomfortable. Eventually I made it past most of the people slower than me and then things opened up into fire roads so it wasn't a problem to get past people any more. I felt like I was running pretty comfortably, and, as with last time, opted to run as much as I could.

I finally hit Corral Canyon at 1:21, about 3 minutes ahead of 2021 (all times here are from my watch, not gun time), which seemed pretty good considering the start. I was trying to be conscious of aid station time, and was in and out in 1:05. This is about the best you can do if you're drinking regularly and using your own nutrition because you have to pour the powder into the bottles and then add water, but I see now it was 40s slower than last year, so I think that's just the price you pay for bringing your own nutrition.

Kanan Road [6.3 mi, +1010/-1444 ft, 1:06:55, -2:13] #

This next section is mostly rolling single track and fire road. I was feeling reasonably good on this section, but it was a bit hard to get into my rhythm, as there were a lot of rocky sections and stream crossings, and I actually tripped a couple of times, which wasn't great. Fortunately, the dirt was soft, so I didn't get hurt, but it's kind of discouraging. Other than that, this section went reasonably fast.

The first drop bag is at Kanan road, so I was able to grab my nutrition refill and check my time (about 4 minutes ahead) I lost some time here because I'd taped up the bag too much and had trouble untying it and then had to refill my nutrition but still got out reasonably quickly (3:57). Only after I left did I realize I still had my headlamp in my pack, but no way was I going back to drop it off. It's not that heavy, right?

Zuma Edison Ridge 1 [5.4 mi, +1260/-997 ft, 1:00:12, -0:58] #

This next section is a rolling descent on single track followed by a moderate climb on fire road to the top of the ridge line ad. The fire road was pretty smooth and as with last time, I felt good and pretty much ran this whole section. There is a nice moderate descent that was longer than I remembered and a bit rocky but I felt really comfortable on. By this point in 2021 my knee had already started to hurt, but everything was still good, so that felt pretty promising. The next aid station (Bonsall) is all downhill so I chugged some water, refilled my bottle, and just headed out. I forgot to hit my lap timer on this one, but I know that the aid was pretty fast.

Bonsall [3.4 mi, +0/-1706 ft, 26:34, -2:49] #

As noted above, this next section is a 3.4 mile descent down to the Bonsall aid station. Pretty much this whole thing is on fire road so I was able to take it pretty fast (~7:46/mi, 50s/mile faster than 2021). With that said, I was apparently overcompensating for it feeling short before, because I expected it to go really fast, and, well, it kind of didn't; I kept thinking "OK, we must be at the bottom", but I wasn't. On the plus side, I was passing people, which doesn't usually happen for me on the descent, so I was feeling like all that training for downhill was paying off.

I hit the aid station (second drop bag), swapped out my food, and filled my bottles. It was only at this point that it started to sink in that I had nearly 2 hrs of exposed mostly climbing, it was starting to get hot, and I only had two bottles. I compensated by chugging some water and salt caps and crossing my fingers. A while after I left the AS I realized I was still carrying my headlamp, but once again, I wasn't going back.

Zuma Edison Ridge 2 [7.76, +2910,-1184 ft, 1:56:17, +4:43] #

There's a long climb out of Bonsall back to Zuma Edison Ridge. This is actually two climbs, ~1600 ft, followed by a descent of around ~1000 ft and then another climb of ~1300 ft. As I rolled out of the aid station, someone came by me with 3 bottles and one bouncing in his pack and I started to think I had made a serious mistake in terms of fluid but it was too late to fix it.

This section is mostly hiking and there 3-4 people ahead of me, including a guy named Colton who I'd run part of the way with earlier and I'd been sort of going back and forth with (he eventually finished one place behind me). I was able to mostly keep them in sight, but not make much progress. This section is super exposed and I was really starting to feel the heat and actually worried that I wouldn't have enough. I didn't really think it would take me more than two hours (two bottles by my drinking schedule) but in the heat I really needed to be drinking more water than dictated by my calorie needs. Worse yet, my knee started to hurt (same place as last time!) whenever I ran, but as I wasn't doing much running, I just tried to ignore it.

I would say this section was harder than 2021: I felt like it was hotter and I felt like I was struggling more. Partway though the second climb, the eventual first woman passed me and she just looked a lot lighter on her feet, running parts that I only barely had enough energy to hike. So, I was pretty glad to finally get to the Zuma aid station, but this leg was about 5 minutes slower than 2021. I burned through the aid station this time and just kept going.

Kanan Road [5.4 mi, +1037/-1283 ft, 1:03:45, -2:40] #

At this point we're just backtracking down the backbone trail to a previous aid station. This means a ~600ft climb followed by a step descent and some rolling terrain. I started to feel somewhat better here and was trying to focus on moving well on the downhill. At this point, I passed Colton again, for the last time and just kept moving. At this point I figured I was probably around top 15. I made it to Kanan OK, grabbed my next nutrition refill, and finally, remembered to drop my headlamp into my drop bag.

Whatever was wrong with my knee seemed to have fixed itself, so I was less worried about not being able to finish, and I had a pacer meeting me at Bulldog (mile 50), so my approach was just to treat this like a 50 miler and figure the last 12 would take care of themselves. This really meant one more modestly hard segment back to Corral Canyon and then the long downhill to Bulldog which was pretty runnable, so I was really just counting down to Corral Canyon at this point.

Corral Canyon [6.4 mi, +1453/-974 ft, 1:30:55, +1:41] #

We're still retracing our steps back to the first aid station, so this is mostly on single track and generally uphill. There was definitely a fair amount of hiking here, but I was really trying to keep solid running where I could. By this point in the race I was starting to pass people doing the 50K (almost nobody seemed to be doing the 50 mile), which is kind of nice, but I imagine pretty unpleasant for them, given that I was running a lot faster after a lot further in. This part didn't feel that bad, but nevertheless I was glad to hit the aid station, and was looking forward to the long downhill to Bulldog.

Bulldog [5.9 mi, +486/-1946 ft, 1:03:19, -0:07] #

This section is a long out and back, with the aid station being at the bottom. Fortunately, this time I had a better picture of the course and I was prepared for the mile long climb to the downhill, so it wasn't as demoralizing that time. I was almost to the top of the climb when someone came tearing the other way. I asked him if he knew what place he was in and he said first, which was reassuring in terms of where I was at in the standings but also meant I could just count off people going the other way to see where I was.

I tried to push this downhill a bit within the limits of not falling, and felt more in control than last year, though actually the overall pace for this leg was nearly identical to 2021. I was most of the way down before I saw #2, who turned out to be Ian Sharman, who has 9 Western States Top 10 finishes, so I felt like things were going pretty well, even if he was probably having a bad day (I eventually finished around 83 minutes behind him).

Eventually I hit the bottom of the hill and it was onto the flat/rolling section, which I'd remembered as ~1 mile but is actually more like 2 miles. About a mile from the turnaround there is a concrete bridge/overpass over a small river, which you have to get over somehow. It's maybe 3 ft above the trail and someone had put a small stepladder so you could get onto it, but even so it was a bit of a struggle, which wasn't a really good sign in terms of my legs being fresh. By the time I had made it to the aid station, I counted off 6 men and 1 woman before me, which seemed pretty good. I grabbed my last nutrition bag, my headlamp, and headed back out.

Corral Canyon [5.8, +1906/-495 ft, 1:32:06, +6:42] #

My pacer Kate and I ran the flat mile or two modestly hard—the bridge was even worse on the way back because I sort of had to scoot down the two whole feet onto the ladder—and then just settled in for the long hike up to the top. I was trying to push this pretty hard but definitely wasn't feeling amazing. Still, it was pretty nice to see everyone behind me going the other direction.

I'd hoped to make up time on this segment, but actually I was almost 7 minutes down for this leg (still about 8 minutes ahead overall) by the time I hit the aid station. I actually thought I was more like 14 minutes ahead because I misremembered my target time (note to self: also do a pace sheet). It didn't really matter, though, because my plan was just to push the pace as much as I could on the way down.

Finish [7.3 mi, +833/-2277 ft, 1:27:27, +0:30] #

The way to the finish is some rolling single track followed by a really long descent, first on fire roads (remember, we're backtracking again, though I'd done this section entirely in the dark on the way out) and then on single track. At this point, I was hiking most of the climbs but trying to run the downhill as much as I could.

Unfortunately, due to the shorter day and the later start, I had to run a lot of this in the dark, unlike 2021, when I finished in the light. I did have a headlamp (Petzl Actik Core), but I was really wishing I had something brighter, especially when we got the single track. If I'd just carried my Lupine another 10 miles or so, I could have had it with me for this, which might have made a difference, as I wasn't able to go as fast as my legs would have supported because I couldn't see very well

After a long downhill there is a mile or so of uphill, which I knew about this time (pretty much right after the water crossing) and was actually looking forward to, both as a break from having to pick my way through things and an opportunity to push the pace some. I did that and was rewarded by getting to listen to Kate breathing a bit harder behind me. This felt a little longer than I expected, but I'd been doing plenty of climbing in training so I was comfortable with it.

After the peak of the hill, it's back to the single track descent followed by about a half mile of nice flat fire road, which gave me an opportunity to open up a little bit towards the finish. We were still passing people but they were not in the 100K so it doesn't really count.

Analysis #

As I mentioned at the top, it's hard to compare year to year, so this section is mostly me thrashing around trying to get a better sense of it. The chart below shows my performance against 2021 (watch time, not gun time):

Leg	Distance	Vert	Time	vs 2021	vs 2021 (cum)
Corral	7.29 mi	2,270/-846 ft	+1:20:43	-2:25	-2:25
Aid	-	-	1:05	+41	-1:44
Kanan	6.34 mi	+1,010/-1,444 ft	+1:06:55	-2:13	-3:57
Aid	-	-	2:37	-1:20	-5:17
Zuma	5.42 mi	+1,260/-997 ft	1:00:12	-58	-6:15
Bonsall	3.43 mi	+0/-1,706 ft	26:34	-2:49	-9:04
Aid	-	-	+2:56	-1:30	-10:34
Zuma	7.76 mi	+2,910/-1,184 ft	1:56:17	+4:43	-5:51
Aid	-	-	2:05	-3:22	-9:13
Kanan	5.40 mi	+1,037/-1,283 ft	1:03:45	-2:40	-11:53
Aid	-	-	3:26	+23	-11:30
Corral	6.37 mi	+1,453/-974 ft	1:30:55	+1:41	-9:49
Aid	-	-	?	-2:13	-12:02
Bulldog	5.91 mi	+486/-1,946 ft	1:03:19	-7	-12:09
Aid	-	-	3:22	-29	-12:38
Corral	5.84 mi	+1,906/-495 ft	1:32:06	+6:42	-5:56
Aid	-	-	1:51	-2:11	-8:07
Finish	7.32 mi	+833/-2,277 ft	1:27:27	+30	-7:37

As seems pretty clear here, I was just faster through Bonsall both on the running legs and in the aid stations, and then I lost a lot of time on the climb out of Bonsall and then again on the climb out of Bulldog, but was still about the same as 2021 on the rest of the legs and was better on the aid throughout.

The graph below compares my paces on each grade from 2021 to this year with one graph for each hour.

Speed versus grade, faceted by hour.

For the first 5 hours, I was just plain faster both on the climbs and the descents. In hours 5 and 6 (the climb out of Bonsall) I started to slow down, especially on the climbs. I recovered again on 7 and 8 when it was just straight running, and then struggled again on the climb out of Bulldog but was pretty solid towards the finish.

It's a bit hard to know exactly what to make of this, but my working theory was that it was hotter this year and so when I had to exert a lot of effort on the climbs, I slowed down but when I was able to just run comfortably, I was still faster because heat wasn't as much of a factor. It's of course possible I have gotten worse at climbing or I wasn't pushing as hard, but I don't think that's true. I was definitely pushing pretty hard on the climb out of Bonsall and I felt like I was pushing on the climb out of Bulldog and that was Kate's impression as well. I've generally been hiking pretty well this season, and as noted above, I was doing well on the climbs early in the race, so I don't think I've just suddenly gotten a lot worse in this area.

Beyond my own performance, there are some other reasons that suggest that this year was harder and that it was at least in part due to heat:

Runalyze's estimate of the weather is 72^o this year versus 63^o for 2021 (though more humid in 2021) and Garmin's somewhat confusing sensor (which seems to integrate skin and air) also shows things 5-10^o hotter in 2024.
The drop rate in 2021 was 3/33 (9%), whereas this year it was 23/96 (23.9%)
In 2021 there were 5 people under 12:00 and this year there were 4 even with a much larger field.
While Kate was waiting at the aid station, she kept hearing how people were underperforming because it was hot.
While the winner's time was the same, the median times were a lot worse (~28 minutes overall, 67 minutes including DNFs), as shown below:

Year	Notes	Mean Time	Median	Median excl DNFs	DNF rate
2024	Same as 2021, 2020 course	14:45:48	15:29:40	15:08:40	23/96 (24%)
2023	Short course (~2-3 miles)	13:44:00	14:04:51	13:42:59	4/94 (4%)
2022	Short course (~3-4 miles): reroute due to rockslide	13:37:04	14:01:46	13:45:56	7/69 (10%)
2021	In October instead of January	14:04:24	15:01:14	14:21:13	3/33 (9%)
2020		13:41:24	14:11:57	13:43:33	24/154 (16%)
2018		13:43:56	14:09:24	14:09:24	131 finishers, no DNFs listed
2017	Short course due to weather (~2 miles)	12:44:49	12:46:20	12:46:20	137 finishers, no DNFs listed

Figure thanks to Kate Hudson

It may also be the case that I and others aren't as heat adapted because the race was in the winter rather than the fall.

I do think I faded a bit in the last 13 miles or so. I don't have splits, but I estimated that the female winner was maybe 1-1.5 miles ahead of me at Bulldog and she finished 45 minutes ahead, so she must have put at least 20 minutes on me from there. That's consistent with how fresh she looked when I saw her earlier: I definitely think those miles would have been a lot faster if I had been fresh and running more than hiking (they would also have been faster in the light!).

Retrospective #

Times, aside I felt like I followed the game plan pretty well. I ran when I could and hiked when I felt like I had to. I think there were maybe a few places towards the end that I could have run if I had to, specifically the up part of the rollers at the beginning of Bulldog and after Corral Canyon, but I felt like I was hiking pretty fast, so I'm not sure I would have run it much faster; I think I was in part just limited by what I had left in the tank. I'm quite pleased that I was legit faster on the downhills most of the race. This is something I was working on and so it's nice to see that pay off. I'm not sure why I kept tripping, but I guess I still have more agility work to do.

Missing the start really sucked because of having to work my way through everyone. I think this was the right decision, as I definitely had to go and made it through the race without issue but I wish I'd made it to the toilets earlier, so I could have started with everyone else. I might have pushed a bit too hard at the start, but I think I did a reasonable job of holding back.

My nutrition strategy worked well. It was pretty easy to stick to an every 30 minutes schedule and I didn't have any major GI issues: I felt fine until after Corral Canyon and then just a little nauseated afterwards, and even then I was still able to eat, just not as many calories per hour as I wanted (mostly I ditched the extra Maurten 100 in the hour when I had caffeine.). Having a caffeinated gel on the half hour was easy to manage. The two things I might change here are:

I want to try to just do 360 cal/hr, so I could do a Maurten every 30 minutes
I should have brought an extra bottle for the Bonsall climb and just had electrolyte or swapped out another Maurten 160 bottle for a gel, because I think I did get dehydrated there.

As noted above, I wish I'd had a better light for the finish. I think I got optimistic because I finished in the light in 2021 and didn't properly account for the later start and earlier sunset.

Overall #

12:46:25 (gun time), 12:45:37 (hand time). 8th/73 overall, 7th/59 (male), 1st 50-59

A hard look at Certificate Transparency: CT in Reality

2023-12-25T00:00:00Z

This is part II in my series about Certificate Transparency (CT) and transparency systems. In part I, we looked at how to build a simple transparency system that guaranteed that each certificate was published and that each participant in the system has the same view of the list of certificates. This prevents covert misissuance of certificates and makes it possible—at least in principle—to detect when misissuance has occurred. In this post, I want to look at CT as it is actually deployed on the Internet.

Writing on the face of the moon, but nobody's looking. Image by Kate Hudson with components from Midjourney and Adobe AI.

[Update: 2023-12-25. After I posted this, I had a long discussion with Chrome's Emily Stark and Ryan Hurst (formerly Google Core Security and Google Cloud) on X/Twitter. I've made some revisions below in light of that discussion. Big thanks to Emily and Ryan for the critique and detailed discussion.]

Deployment Compromises #

In the previous post, we designed a greenfield system without worrying too much about deployment. Unfortunately for CT, the WebPKI was already well established—with all its faults—by the time CT was developed. You run into a number of challenges when you go to retrofit it to the existing WebPKI, starting with the fact that it was a lot of work for CAs and didn't bring them any value. Importantly, deploying CT doesn't make a CA's customers any more secure because the attacker can just try to get a certificate for those customers from another CA. What it mostly does it make it harder for your CA to misbehave, but that's not really a selling point, and after all, mistakes are something that happen to other people!

Google's plan for overcoming these deployment hurdles came in two parts:

(Eventually) Require CAs to use CT in order to be trusted by Chrome, thus forcing universal deployment of CT.
Make a bunch of technical compromises designed to make CT easier for CAs to deploy.

Obviously, part (1) of this plan kind of involved playing chicken with the CAs. Chrome is by far the most popular browser, but it wouldn't be for long if it didn't work with a lot of Web sites. In order to make requiring CT a credible threat, Google needed to get enough CAs onboard that the number of sites with certificates not published in CT was very small, thus making it possible to break them with making Chrome useless, hence the need for the technical compromises to make it more palatable. The remainder of this section talks about some of those compromises.

Transparency Logs #

Previously I talked about the CA publishing the Merkle tree of certificates, but there's no technical reason the CAs have to do it themselves; the certificates just have to be published somewhere. CT separates the job of running the CA from the job of publishing the certificates by creating the role of a transparency log, which is responsible for building the tree. The CAs don't have to operate a log (though some do) just register their certificates with the log.

This design has several advantages. First, it makes life easier for the CAs, who don't have to run logs. This may not seem like a big deal, but it turns out that running a log is a lot of work for reasons we'll get into below, and indeed very few CAs actually run their own logs today. Instead, some entity with a lot of operational resources and experience (i.e., Google), could run a log that supports multiple CAs, hopefully making it easier for the CAs to deploy.

Second, having a relatively small number of logs improves the scaling properties of the system somewhat: much of the overhead for the clients comes in the form of getting an authentic copy of the signed root (what CT calls a signed tree head (STH)), and if each CA has its own tree, that means one root for each CA. If there's just a small number of logs then you need a correspondingly smaller number of roots. Similarly, in order to ensure that no certificates have been misissued, sites need to have a copy of the database for every CA; it's easier if those databases are all aggregated into a small number of logs than to have to retrieve them independently.

Finally, the log design makes it possible to publish certificates even for CAs which don't participate because the log can just unilaterally ingest those certificates. Consider what happens if most CAs publish their certificates in CT but some don't, but Chrome wants to require CT. They could use the Google crawler to collect certificates for non-cooperating CAs and put them in the log, thus potentially making it easier to require CT. This doesn't help as much as you'd think because you still have the problem of how the client gets the inclusion proof for the certificate, but there are some (not great) options here.

Signed Certificate Timestamps #

The big problem with the design as I described it in part I is that it inserts a delay in the certificate issuance process: if you are going to provide the inclusion proof at the time of certificate issuance, then you need to collect all the certificates that go into the Merkle tree before you can issue the certificates to the site. If you publish one signed tree a day, this means that on average it will take 12 hrs between the certificate request and issuance, which also means that it takes on average 12 hours and up to a day at the worst case to bring a site online. This might have been acceptable if we were starting from scratch, but certificate issuance times are measured in ~~minutes~~seconds [Updated 2023-12-25. Per Ryan Hurst]. and so this would have represented an unacceptable regression, especially for sites which didn't have a valid certificate and so would have to wait up to 24 hours to deploy (not such a big deal the first time, but an absolute emergency if you had a live site and you let your certificate expire).

In order to address this issue, Google introduced a new concept, the signed certificate timestamp (SCT). An SCT is a signed promise that the log will add the certificate to their tree soon, even though they haven't yet. The figure below shows the issuance flow with SCTs.

Certificate issuance with SCTs

The way this works is that the CA produces what's called a "pre-certificate", which is a data structure that has all the information that would be in a real certificate. It then sends that to the log, which returns an SCT that covers the pre-certificate. The CA then takes the SCT and adds it to the certificate before issuing it to the site. This has the big advantage that the site doesn't need to know about CT; because the SCT is part of the certificate, it can use the certificate as before without changing anything, which is obviously a big deal for incremental deployment. In fact, the CA can deploy CT entirely on its own one day and sites will just automatically have CT-enabled certificates.

Because SCTs can be generated immediately by the log, CAs can deploy CT without significantly slowing down their issuance process; they just retrieve the SCT and it's the log's responsibility to eventually publish the pre-certificate in its own Merkle tree ("eventually" is doing a lot of work here, as we'll see below). The resulting certificate is immediately usable because the client checks for the SCT rather than checking the Merkle tree.

Trust is a bad word #

The good news is that CT with SCTs is minimally disruptive while also allowing the browser to enforce the use of CT. The bad news is that it has totally different and much weaker security properties from the system we started with. The problem is that the SCT is just a promise that the log will incorporate the certificate into their Merkle tree, rather than a proof that it actually did, so you're reduced to trusting the log not to lie.

Recall the security logic of a transparency system, as described in Part I:

The CA publishes every certificate (i.e., identity/public key pair) that it issues.
The owner of a given identity—and potentially other people—ensures that it recognizes every certificate that was published.
Relying parties check that a certificate is in the log before accepting it.

The use of SCTs breaks part (3) of this system, because the client is just checking that the log promised to incorporate the certificate, rather than that it actually did. Consider what happens if you have a malicious CA that colludes with a malicious log. The CA would misissue a certificate for example.com, along with an SCT from the malicious log, but the log would omit the certificate from its published tree. The client will accept the certificate because it has the SCT, but because the log never publishes the certificate, example.com has no opportunity to detect the misissuance.

What's happened here is that we've taken a system which was publicly verifiable and turned it into a system in which we have to trust the logs not to cheat by issuing SCTs for certificates they don't actually publish, potentially with some double checking, as described below [Updated 2023-12-25]. This is still better than where we started because a successful attack requires that both the log and the CA be malicious, but it's a much weaker set of properties from not having to trust the log at all.

This design also means that not anyone can run a log but instead logs have to be vetted to be trustworthy and to conform to browser policy. This trust decision has to be encoded into the browser which decides whether to accept a given SCT. At present, Chrome accepts logs from only six operators:

Google itself
Cloudflare
DigiCert
Sectigo
Let's Encrypt
TrustAsia

When Google originally launched the CT requirement in Chrome, they actually required that at least one of the logs be Google's log, which meant that the policy effectively came down to "we (Chrome) trust Google's log not to lie", but had some obvious problems from an openness perspective, as it meant that realistically CAs had to use Google's log. They have since changed the policy and now you can use any two accepted logs (for certificates valid for 180 days or less) or three logs (for certificates valid for more than 180 days). This means that in order to covertly misissue you need a malicious CA and two malicious logs to collude.

Update: 2023-12-25: Ryan Hurst points out argues that the requirement for policy compliance is more about ecosystem health than about the need to trust the logs (assuming I understand him correctly) and that Chrome's auditing allowed them to verify inclusion, and thus to relax their log policy. As noted below, I think this has some force for Chrome, but mainly because it's effectively making Google the guarantor that a certificate has actually been published.

Closing the Loop #

Because the source of the problem is that the client isn't verifying inclusion of the certificate (by checking the inclusion proof) but only that the log says it would include it (by checking the SCT), the obvious fix is to have the client somehow verify that the certificate actually was included. This turns out to be somewhat challenging and there have been a number of attempts, none of which really work.

The first problem is that we will not always be able to enforce inclusion in real time for the same reason that we need SCTs in the first place: the certificate might have just been issued very recently. For these certificates the client has to trust the SCT to establish the connection and at best can check that the certificate was subsequently included by the logs. This is actually worse than it sounds because the CA has complete freedom about what timestamp to put in its certificates, and so—assuming it can collude with two logs—it can always have a misissued certificate appear to be recent. The result is that the attacker will succeed in impersonating the server and at best the client will be able to detect the cheating at some later time when it determines that the certificate was never logged.^[1]

Verifying Inclusion #

Even once you are past the time when the certificate should have been logged, verifying that it actually was is tricky. For obvious performance reasons we don't want to have to download the entire database. The inclusion proof is nicely compact, but when the client contacts the log and asks for the inclusion proof, that tells the log which certificate the client is checking and hence which site the client is visiting; together with the client's IP address, this allows the log to track the client's activity. Obviously, this problem is worse if there are only a small number of logs and was even worse when Google had to be one of them.

In order to prevent this form of tracking, we need some way for the client to retrieve the inclusion proof anonymously. There are a number of possible options here (VPNs or proxies) or Private Information Retrieval. As far as I know, no log deploys any kind of PIR—it would probably be quite expensive—and while proxies or VPNs are technically feasible, they're not free to run. There are similar problems with clients reporting certificates which are not included but should have been. I'm not aware of any major browser which verifies certificate inclusion proofs [Update 2023-12-25] by default (Chrome had some ideas about using DNS,^[2] but seems to have abandoned them.^[3]), though see below.

Distributing Inclusion Proofs #

One way to minimize the privacy risk of retrieving the inclusion proofs is to have the server distribute them to the client. Of course, if you're not willing to wait for the next STH, then you still have to deal with SCTs, but at least after the STH was issued the server could somehow get a copy of the inclusion proof and send that to the client, thus preventing the client from having to retrieve the inclusion proof for older certificates. This seems like a good idea in practice but ran into several problems.

First, it was never really clear how you would distribute the STH to the server, which, after all, already has the certificate. One possibility is to incorporate the STH into a new certificate, which the server would then retrieve a day or two later and thereafter server to the client; this seemed kind of impractical when CT was originally designed, but in the intervening 10 years, automatic certificate issuance has become far more common (specifically, a protocol called ACME, originally developed for Let's Encrypt), and so it wouldn't be that hard to imagine modifying ACME to send an updated certificate. Importantly, this is something that could be deployed incrementally, because clients have to be able to fall back to SCTs anyway. However, it doesn't seem to be something that's happening.

There were also ideas about using what's called OCSP stapling. Because certificates have a long lifespan, they might be revoked while still otherwise valid. The OCSP protocol allows clients to check whether a certificate is still valid, but introduces latency and has its own privacy problems. For a while, there was interest in having servers pre-retrieve OCSP responses (they're signed by the CA) and give them to clients proactively, thus letting them skip the OCSP checks, and it would be straightforward for the CA to put the inclusion proof in the OCSP response. This has similar deployment properties to the new certificate idea, except that it requires servers to actually do OCSP stapling. However, at the end of the day browsers adopted a different set of mechanisms for handling revocation, centered around centrally distributed revocation lists, so OCSP stapling never really took off.

All of these ideas about providing inclusion proofs to the client were made more complicated by ambiguity about which STH the inclusion proof was supposed to apply to. In the system I described in part I, there was a new Merkle tree every day, but the way CT is actually designed is that there is an ever-growing Merkle tree and STHs are issued at whatever intervals are convenient for the log, as long as they aren't too far apart. This means that it's possible for the browser to have an STH for 5 PM but the server to have an inclusion proof for 4 PM. CT has a way of handling this with a mechanism called a "consistency proof" that bridges between these two versions of the tree, but retrieving the consistency proof requires contacting the log, which creates new privacy problems.

This is actually a solvable problem if the logs provide a more predictable mapping from certificates to STHs (a technique called STH discipline which Richard Barnes and I worked on), but by the time this was all worked out, there wasn't that much energy for changes to CT.

Gossip Doesn't Work #

Even if we did have some mechanism for verifying the inclusion proof, we still have the problem of getting consensus on the STHs. The original CT design assumed a flood fill technique (what they called "gossip") like I described in part I, but was frustratingly short on specifics:

All clients should gossip with each other, exchanging STHs at least; this is all that is required to ensure that they all have a consistent view. The exact mechanism for gossip will be described in a separate document, but it is expected there will be a variety.

Needless to say, this is some vigorous handwaving, and actually building a system like this is fairly hard. In particular, there's no obvious way for browser clients to discover and communicate with each other (see my post on ICE to see some of the challenges here), as this isn't something they otherwise normally do.^[4] Eventually the IETF did try to produce a document with some ideas, but it was quite complicated and the IETF abandoned it and as far as I know, no browser ever implemented gossip.

Another option to gossip is to have the software vendor just provide the STHs. This arguably is less secure than gossip because the vendor can lie, but as I noted previously, the vendor also controls software updates and the trust anchor list, so browser vendors are reasonably comfortable with designs that require trusting them, at least for now. This is something Richard Barnes and I looked at in concert with STH discipline, but ultimately it wasn't worth it without some way to actually get the inclusion proofs on the servers, which remained largely an unsolved problem. As things stand today, clients don't really do anything to retrieve or double-check STHs.

Update: 2023-12-25 Note that what I'm referring to here is that it's hard for clients to gossip. It's obviously not a problem for services which are verifying each certificate that was issued (monitors) to gossip, as discussed below.

Chrome CT Auditing #

Added 2023-12-25

As Emily Stark pointed out to me on X/Twitter, Chrome actually does some auditing, which I had somehow managed to miss. Specifically, it checks to see if Google is aware of a given SCT. Joe DeBlasio has a summary here:

No Safe Browsing protections -> no SCT auditing

Default Safe Browsing protections -> SCT auditing logic selects a small proportion of TLS connections and performs a k-anonymous lookup on an SCT. If that privacy-preserving SCT lookup reveals that the SCT is not known to Google but should be, the client uploads the certificate, SCTs, and hostname to Google (but no other information).

Enhanced Safe Browsing protections -> SCT auditing logic selects a small proportion of TLS connections and uploads the certificate, SCTs, and hostname to Google (but no other information).

This is an interesting design and gets around some of the problems that I've discussed above.^[5] The security properties it provides are:

Google can learn which certificates have been issued by other logs and do whatever checks it wants on whether they should have been issued.
Google can check that other monitors are seeing the same thing as it does (by gossiping between monitors, as in the previous section), thus allowing them to independently check for misissuance.
Under certain assumptions about the attacker's capabilities, Google will eventually learn about any certificate which wasn't logged. What I mean by "certain assumptions" is that (1) the attacker has to use the certificate reasonably often to have a high probability of report and (2) a powerful attacker might be able to impersonate the server to a client and then block the client's subsequent network access to Google so that it can't make the report.

This isn't nothing, but I think it also falls short of public verifiability in several respects. First, it still leaves clients vulnerable to accepting certificates which were never published; it just makes it possible—modulo the caveats in point (3) above—to detect the compromise after the fact. Second, it fundamentally depends on Google acting as the guarantor that certificates were published because they're the ones who run the auditing service.

Overengineering #

As a result of all this, CT has more or less given up on public verifiability. As soon as you allow for SCTs, clients have no way of ensuring that certificates have been logged before accepting them, and without some mechanism for verifying retrospectively that certificates were logged, there's not even any way for clients to detect that they accepted an unlogged certificate, and CT just reduces to a system where the clients trust the logs not to lie about whether they are going to publish a given certificate.

Updated 2023-12-25, in light of conversation with Emily and Ryan As a result of all this, CT provides fairly limited public verifiability. At the time of acceptance, clients have no way of ensuring that certificates have been logged before accepting them, because the certificate might have just been issued and not yet incorporated into a log. Chrome's CT auditing provides a partial mechanism for retrospectively detecting that unlogged certificate was accepted, but this really depends on trusting Google, because Google has to see a copy of every certificate to make this work.

~~If we're just trusting the logs, though~~ Why then do we need all the machinery of Merkle trees? The logs could just take in pre-certificates, issue SCTs, and publish the certificates on their sites as soon as possible (effectively immediately). This doesn't provide public verifiability, of course; instead the logs act as what's called a "countersignature", in which the signature from the logs isn't attesting that they verified the certificate's trustworthiness themselves, just that they've seen it. To a first order, the answer is that what we actually have is a countersignature scheme and that the Merkle tree machinery is unnecessary overhead, or, perhaps, more charitably, futureproofing against some future world where we solve the engineering problems described above.

The problem is that it's expensive futureproofing, both in terms of protocol complexity and in terms of operational brittleness. A fairly large fraction of the CT RFC is concerned with specifying the Merkle trees, the machinery of Merkle tree proofs, and the like. All of this could just go away if we were to just treat CT as a "countersign + publish" protocol, leaving a dramatically simpler protocol that would be a thin layer on top of HTTP.

Worse yet, CT logs turn out to be hugely operationally complex to run correctly. I haven't personally operated one, but the basic problem seems to be tight timing requirements combined with the immutability of the Merkle tree structure. Recall that an SCT is a promise to include the certificate into the Merkle tree, which has to happen within a finite period of time called the maximum merge delay (MMD) (which Chrome requires to be no more than 24 hours). The reason for this is so that the clients can check that the log fulfilled its promise in the SCT to actually put the certificate in the log. If the log just had to eventually put it in, then whenever the client checked it could just say "not right now", hence the MMD. But this means that if you have any kind of glitch (say a precertificate gets lost in some queue or you have some an outage of more than 24 hours), you're suddenly out of compliance. Running a big production service with no glitches is no easy task and it shouldn't be surprising that we've seen issues.

Some examples: In August, DigiCert's log was retired because they had a bit flip in one of the entries in the tree and just in November, Cloudflare's log had an outage in which they failed to include thousands of certificates within the MMD. Even Google has had outages and at least one resulted in an MMD violation.^[6] The difficulty of running a log is a direct result of the requirements introduced by the combination of SCTs and trying to maintain the infrastructure that would support public verifiability, even though public verifiability doesn't exist in practice. Running them would be far simpler if those requirements were relaxed, and, as far as I can tell, it would have no material impact on user security.

Why then, do we have this overengineered design? The history is a little fuzzy, and I wasn't there at the beginning, but my sense is that when CT was originally designed the intention was not to have SCTs and instead to have just Merkle trees and inclusion proofs delivered with certificates (more or less the design I described in Part I). Despite some challenges, this design probably could have been made to work in a greenfield setting, albeit at the cost of high issuance latency, but eventually the designers were forced to add SCTs for deployability reasons. By the time it was clear we would be stuck with SCTs indefinitely, there was a huge amount of inertia behind the Merkle tree design, which was widely deployed and people were reluctant to climb down from it and from the hope of future public verifiability. So, instead we have a system with the complexity of public verifiability with the security of countersignatures.

Despite all this, the CT RFC (both the original 2013 version and the 2021 update) still claims that logs don't need to be trusted:

Certificate transparency aims to mitigate the problem of misissued certificates by providing publicly auditable, append-only, untrusted logs of all issued certificates. The logs are publicly auditable so that it is possible for anyone to verify the correctness of each log and to monitor when new certificates are added to it. The logs do not themselves prevent misissue, but they ensure that interested parties (particularly those named in certificates) can detect such misissuance. Note that this is a general mechanism, but in this document, we only describe its use for public TLS server certificates issued by public certificate authorities (CAs).

I suppose at the time it was written (2013) this could be read as aspirational language in the hope that some way could be found to deal with the issues described above. From the perspective of 2023, however, it looks more like wishful thinking.

CT: Still Useful #

Despite everything I've said above about the limitations of CT verifiability, it's still proven to be exceedingly useful. There is a robust set of logs and quite a few services, and CT has helped detect a number of serious incidents, in several cases leading to CAs being distrusted. [Updated: 2023-12-25]

First, a lot of CA issues are simple mistakes rather than intentional misbehavior that the CA is trying to conceal. Forcing CAs to publish all of their certificates makes this kind of error easier for third parties to detect, which happens with some frequency. This benefit doesn't require browsers to check SCTs at all, just that CAs be required to log certificates. In addition, the requirement to log certificates means that it's possible to construct a database of all the valid certificates, which is a very useful research tool.

Second, CT requirements make it harder to cheat because not only does the CA have to intentionally misbehave, it has to collude with logs to do so. Obviously, finding one or more malicious logs is harder than just having the CA be malicious, especially given the relatively small number of logs, so CT provides a real security benefit even with no public verifiability.

Finally, CT is a really useful tool for gaining visibility into the overall state of the WebPKI ecosystem; because every certificate has to be published, CT makes it much easier to understand the system as a whole.

The Bigger Picture #

What we have here is yet another case of how the Internet is build on "good enough".

It's a commonplace that the WebPKI is a cobbled together mess and at the time that CT was designed, it was even moreso. At roughly the same time CT was published there was a fair amount of interest in replacing the WebPKI with something based on DNSSEC/DANE which looked like it might have a better attack profile, in particular because there weren't a large number of actors able to attest to a given name. In practice, though, DANE deployment for the Web totally stalled, largely because it was basically a forklift upgrade.

By contract, CT is yet another patch on top of the WebPKI, but was incrementally deployable. Imperfect though it is, it has gone a long way towards improving the system, both by making undetected misissuance harder and by making simple misbehavior easier to spot and address. I know there are still people who want to replace the WebPKI with something based on totally different principles, but in 2023, that looks fairly implausible.^[7]

Similarly, while CT is overcomplicated, hard to operate, and a lot more than we really needed, it's also what's deployed and people aren't really excited about changing it. In fact, while there was an extensive effort to produce a revision of CT ("Certificate Transparency v2"), eventually everyone just kind of ran out of energy and while it did get published as an RFC, as far as I know nobody implements it. If we were starting from scratch, we'd probably do it differently (see "good enough", supra), but that's not where we are, and it's easier to just stick with what we have.

None of this is to say that transparency and public verifiability aren't good ideas, and now that end-to-end encrypted messaging has become so popular there is increased interest in transparency for those systems. The requirements here are somewhat different and the result is a rather fancier system called "key transparency", which will be the subject of the next post in this series.

This is also the reason why clients requiring that servers provide inclusion proofs for sufficiently old certificates doesn't help. ↩︎
The reasoning here is that your DNS server already knows what sites you are visiting and so if you could also retrieve the STH over DNS, this would provide privacy. ↩︎
This state of knowledge paper by Meiklejohn, DeBlasio, O'Brien, Thompson, Yeo, and Stark provides a good survey of the alternatives and the present situation. ↩︎
Apple's recent deployment of Key Transparency for iMessage does gossip but this is much more natural because iMessage clients already talk to each other. ↩︎
As an aside, this has some undesirable privacy properties, similar to those of Safe Browsing, and worse if the client actually reports a suspicious certificate. ↩︎
See Andrew Ayer's excellent writeup of CT log failures, though Ayers is a bit more sanguine about failures than I am. ↩︎
Benjamin, O'Brien, and Westerban have a proposal to replace the combination of X.509 and CT with something called "Merkle Tree Certificates", but conceptually this is the same trust architecture as the WebPKI. ↩︎

A hard look at Certificate Transparency, Part I: Transparency Systems

2023-12-13T00:00:00Z

Identifying the communicating endpoints is a key requirement for nearly every security protocol. You can have the best crypto in the world, but if you aren't able to authenticate your peer, then you are vulnerable to impersonation attacks. If the peers have communicated before, it is sometimes possible to authenticate directly, but this doesn't work in many common situations, such as when you are given the address of a Web site and need to connect to it securely.

Nearly every major communications security protocol has the same basic authentication design:

Endpoints have human-readable identities (e.g., domain names, e-mail addresses, phone numbers, etc.)
A trusted authentication service attests to the binding between an identity and the endpoint's public key.
The endpoint uses its private key to prove it identity.

For example, in the HTTPS/Web context, sites are authenticated by having certificates which are issued by a certificate authority (CA).^[1] These CAs are in turn vetted by browser vendors, who decide which CAs their browsers will trust. This entire system is called the "WebPKI" (see here for more background on this.)

The key word in this system is trust: the endpoints need to trust that the authentication service doesn't falsely attest to a binding for the wrong person (technical term: "misissuing"). If an authentication service makes a mistake or deliberately cheats, then this could allow the attacker to impersonate a valid user of the system, which is obviously bad. This is not merely a hypothetical issue. In the WebPKI alone, there have been a series of high profile certificate authority failures, perhaps most famously in 2011 when the Dutch CA DigiNotar was subverted and issued a series of bogus certificates, including one for Google. The bottom line is that an authentication service of this type represents a single point of failure for the system as a whole. The WebPKI is especially bad here because there are a large number of CAs, nearly all of which can attest to any domain name, so there are multiple entities, each of which is a single point of failure.

There are a number of potential approaches for defending against this problem but the one that the community seems to have settled on is what's called a transparency system. The basic concept of such a system is that you retain the idea of a trusted authentication service but add on a layer in which it publishes the bindings it is attesting to so that anyone can check that it's not misissuing. The first transparency system, and still the most widely deployed, is Certificate Transparency (CT), designed by Ben Laurie, Adam Langley, and Emilia Kasper (all at Google at the time) in the wake of the DigiNotar incident. CT was designed to bring transparency to the famously mismanaged WebPKI. More recently, there has also been a lot of interest in CT-like (but fancier) systems for non-WebPKI applications, such as "key transparency" for messaging systems, but in this post I want to focus on CT.

As you can see from the diagram below, CT is a very complicated system, in part because it had to be retrofitted onto the existing WebPKI design and in part due to some technical decisions which in retrospect look like they were mistakes (I'll get into those in the next post in the series).

[Overview of Certificate Transparency from transparency.dev]

What I want to do in the rest of this post is to try to gradually build up to a sort of idealized version of CT from first principles. In a future post, I'll look at actually existing CT, some of the compromises that it made in the name of deployment, and the implications of those compromises.

Transparency Systems #

The basic idea behind a transparency system is not to prevent misissuance but to detect it. At a high level, this works as follows:

The CA publishes every certificate that it issues.
The owner of a given identity—and potentially other people—ensures that it recognizes every certificate that was published.
Relying parties check that a certificate is in the log before accepting it.

The figure below provides an overview of the verification pieces of this process in the Web context:

Conceptual overview of a transparency system

At some point, example.com gets a certificate (1234) from the CA, which publishes that certificate. Then, when Alice wants to connect to example.com, it presents that certificate (step 1). Alice then checks with the published certificate list to verify that the certificate is actually on the list (step 2). Separately, example.com periodically checks the list to be sure that only certificates it knows about are on the list.

There are a lot of moving pieces, so it's worthwhile working through the logic here for why this works.

Is it possible to prevent misissuance? #

While detecting misissuance is good, it would be better to prevent it entirely. Unfortunately, this turns out to be a very challenging problem because the authentication service has to determine who owns a given name (e.g., example.com), and that determination isn't directly verifiable by third parties. There are designs which bind name issuance to authentication (often using some kind of blockchain), but the problem with these systems is that they don't allow for any discretion on the part of the authentication service, so, for instance, if I register example.com and then lose my keys I still want to be able to reclaim it. This may require some kind of manual intervention. More on this here. If you're going to allow for discretion to handle this kind of case, then you need to worry about that discretion being abused.

Misissuance Detection #

Because every issued certificate is published, if the CA misissues a certificate, then it will also be published and can then be detected, either by the true owner of the identity or by a third party who notices something fishy (why is some CA I've never heard of issuing a certificate for Google?).

In the Web context, this is all somewhat harder than it sounds: if you're a big and well-operated site, then you may well know every certificate that you have requested, but that's not necessarily true for smaller sites.^[2] Similarly, third party verifiers won't necessarily be able to check that the issued certificates are what is expected. The result is that while you should expect that misissuance of high profile sites will likely be detected, misissuance of smaller sites could easily go unnoticed.

Managing Misissuance #

OK, so you've detected a certificate that was misissued, now what? The general story is that you report it. What happens then depends on how the certificate was misissued. In the simple case of unintentional misissuance—which definitely happens—you would expect the CA to revoke the certificate, investigate what happened, and if possible address whatever issue lead to the misissuance.^[3]

However, it's also possible that the CA is not well operated or the misissuance is more than a simple mistake. In this case, browsers might decide to distrust the CA, with the effect that all certificates issued by the CA. This is a disruptive step, but it does happen, even to large CAs. For instance, in response to a series of operational issues the browsers distrusted Symantec (very gradually) between 2016 and 2018.^[4]

Much of the value of a transparency system like this is that works together with the threat of distrust as an incentive to good behavior. As noted above, it's possible for misissuance for the names of small sites to go undetected, but once there is some evidence of some misbehavior—perhaps of a single site—the transparency system allows for easier investigation of the other certificates issued by the CA. It is also possible to use the transparency system to detect other kinds of CA misbehavior than misissuance which can then prompt further investigation.

Incompetence versus Malice #

If all we are worried about is mistakes by the authentication service, then just publishing all the certificates is mostly enough; even if the CA inadvertently issues a certificate to the wrong person, it will still be published and so the mistake can potentially be detected. But what if the CA is intentionally misissuing? In this case, it can just provide the certificate to the attacker without publishing it, in which case the fraud isn't readily detectable.

This is the reason for requiring the relying parties (clients) to enforce that the certificate has been published (point 3 above). This prevents attacks where the AS doesn't publish the certificate because the relying parties just won't accept it, making the attack pointless. If relying parties don't check for the presence of the certificate on the published list then nothing requires the CA to publish every certificate.

Partitioned Views #

The description above just covers the logic of a transparency system but doesn't tell you how one actually works and in fact I've glossed over an important technical problem, which is how to ensure that the published list of certificates is the same for everyone. The obvious thing to to do is for the AS to just publish the list of certificates it has issued on its Web site, but this isn't secure. Consider what happens if the AS gives different answers to different people, like so:

Partitioning attacks

In this scenario the attacker has obtained a misissued certificate from the CA (not shown), which creates two lists of certificates:

List 1, which has the attacker's certificate
List 2, which has the legitimate certificate

When example.com goes to check the list of certificates, the CA provides List 2, containing the correct certificate (1234) so everything looks OK. On the other hand, when Alice connects to the attacker (impersonating example.com), it presents the fake certificate (ABCD). Alice then connects to the CA, which provides List 1, containing ABCD everything looks OK here too, and the attack goes undetected.

The point here is that the authentication server needs to publish the certificate list in some way that everyone has the same view and that they can verify that they have the same view (technical term: consensus). As long as this is true, then we know that the owner of the identity has had a chance to check any certificate which the relying party might treat as valid.

The analogy I like to use for this kind of consensus (I'm not sure who originated it) is that the authentication server publishes each binding by using a giant laser to inscribe each binding onto the face of the moon. This allows anyone with a telescope to look up—at least during the night—and see what bindings have been created.

Writing on the face of the moon. Image by Kate Hudson with components from Midjourney and Adobe AI.

This is what is known as a "publicly verifiable" system in that it doesn't require trust. Anyone can see for themselves what is written on the face of the moon, so you aren't depending on the CA not to cheat.

Unfortunately, the giant laser is physically impractical, and so we need some other technology for providing consensus. Much of the complexity in transparency systems derives from this requirement.

Manufacturing Consensus #

As noted above, the basic challenge we have here is ensuring that every client has the same view of the certificate database.

The obvious thing to do is for people—really client software—to share copies of the database with each other so that you effectively flood fill the database to everyone and eventually everyone has a copy of the whole database. Alternately, if you have a piece of software like a browser which has an update channel, the vendor can send a copy of the database to all its users. Of course in this case you're trusting the browser vendor not to send a fake database, but as a practical matter you're also trusting them not to send you malicious updates anyway, so it's not clear how much worse this makes the situation. More on this in a future post. Whichever design you are using, if the attacker has mounted a partitioning attack as described above, then the site will eventually get a copy of the correct database from some other element, thus allowing for detection of misissuance when it sees a certificate it doesn't recognize.

One thing that's very important to realize is that it doesn't matter if some—or even most—of the endpoints in the system are malicious; if the flood fill system is working, then eventually^[5] each endpoint will talk to someone who isn't malicious, so they will eventually get a copy of every certificate. And because certificates are publicly verifiable (you just check the signature), it's easy to store every certificate that is valid and discard the ones that aren't. A malicious node can remove certificates from the database they send you, but they can't insert certificates that don't exist or prevent other endpoints from sending you valid certificates.

Moreover, it's not really required that everyone get a full copy of the database: consider the case where we have a fake certificate for example.com. If the operators of example.com see it, then they can publish it and report it to the browser vendors, who can then investigate, as described above. The point here is that the system doesn't need to work perfectly in order to detect attacks; it just needs to work well enough that (1) any relying party will be able to validate that a certificate has been published in the database and (2) the attacker cannot reliably prevent parties trying to verify database correctness from getting a copy of misissued certificates.

With the right data structure, it's also possible to make partition attacks easier to detect. For instance, if each CA publishes one database a day and signs the entire database, then any element which receives two databases for a single day can immediately detect that there has been cheating.

The problem, obviously, is that this kind of flood fill is incredibly inefficient: Let's Encrypt alone has about 300 million valid certificates; at 1K each, this would be a database of 300GB, not something you want to be storing on your phone, let alone having to send to everyone else you come into contact with—ignoring for the moment the question of how you're going to transmit the database around. Clearly, this simple system is not practical.

Of course, you don't actually need to send a copy of the database to everyone, you just need to verify that you have the same database as everyone else, which you can do by exchanging hashes of the database, but this doesn't get us very far because (1) you still need to keep a copy of the database on your computer and (2) the database isn't static, but instead new certificates are constantly being issued (Let's Encrypt issues over 3 million certificates a day). Addressing this requires some new technology, specifically something called a "Merkle Tree".

Background: Merkle Trees #

The idea behind a Merkle Tree is to allow a way to efficiently commit to a set of values without actually publishing any of the values.

As an intuition pump, suppose I run a streaming service which send movies over the Internet and I want people to be confident that they are getting the right movie and not some content generated by an attacker. In the real world, we just carry all the data over a TLS connection, but let's assume I'm too cheap for that. Instead, what I could do is send the hash of the content over the TLS connection and then let the client retrieve the rest over HTTP (there used to be a time when people really worried about the cost of encryption). The problem with this is that the hash is computed over the entire movie, but we obviously want people to able to verify that there hasn't been any tampering as they are watching it. The obvious solution here is to break the movie up into chunks—you want to do this anyway so that people can easily scroll forward or backward—and then send a hash for each chunk over the TLS connection. Then, when the client retrieves each chunk, they can verify the hash before they play it.

This still involves sending a fair amount of data over the TLS connection, though: suppose each chunk is 5s long, then a 2 hr movie will be 1440 chunks and require sending something like 46KB over the TLS connection. It turns out that there is a more efficient strategy, using one of the computer scientist's favorite tools, the binary tree. The basic idea is that we hash each chunk and then arrange the chunks in a binary tree, like so:

A Merkle tree

The leaves of the tree are the hashes of the individual chunks and then each interior node is the hash of its two children^[6] This way, the root of the tree includes the hashes of all of the leaves, so if any leaf changes then it would also change the hash of the root. This way, you can publish only the root hash over the TLS connection and anyone can verify the leaves by just hashing them up to the root.

Well, sort of. What I just described requires having all the chunks, but remember we want to be able to verify a chunk without other chunks. Fortunately, there is an easy way to arrange this: when you send a chunk, you also send enough nodes in the tree to let the receiver reconstruct the tree. Specifically, you send the nodes next to the nodes on path between your chunk and the root. For example, suppose I just sent chunk 1. The receiver can compute H(C1) for themselves, but they can't compute the parent node without knowing H(C2), so I have to send that. Similarly, they can't compute the root without knowing H ( H(C3) + H(C4) ) so I have to send that as well. I don't have to send H(C3) or H(C4) because they don't need that to compute the root.

The figure below illustrates what I'm talking about:

The co-path of a Merkle tree

The sender has to transmit everything in blue, specifically:

The chunk C1 itself so that the receiver can compute H(C1), though of course it was transmitting this anyway.
H(C2) so that the the receiver can compute the parent node H( H(C1) + H(C2) )
H( H(C3) + H(C4) ) so that the receiver can compute the root

The receiver computes everything in black for themselves and then compares it with the root hash it received over the secure channel. If everything checks out, then this proves that the tree was computed over C1 (and that it was in that position in the tree) and therefore that it's a legitimate chunk. The technical term here is an "inclusion" proof, because it proves that the chunk was included in the computation for the tree.

The key thing to realize is that the number of extra hashes that the sender has to include in order to let the receiver verify a chunk is less than the number of total chunks. Specifically, it's the depth of the tree, which is to say the logarithm base 2 of the number of chunks. In this case, that's 2 hashes, which is only half the number of chunks, but if there were thousands of chunks then this would be a huge difference.

A Transparency System with Merkle Trees #

It should now be apparent what we are going to do next, which is to put the certificates into a Merkle Tree. As a starting point, let's say that each CA takes all the certificates and makes them the leaves of the tree.^[7] With Let's Encrypt's 3 million certificates a day, this tree will be of around depth 22 for a day's certificates. The figure below provides an overview of how this fits together:

Certificate issuance with Merkle trees

When example.com wants to get a certificate, it contacts the CA as usual. The CA does whatever procedure it wants to validate the request and then waits for other certificate requests to come in. After some period (in this case daily), the CA generates all the certificates and then builds a Merkle tree out of them. It publishes the whole Merkle tree on the Internet and then sends each site it's certificate, as well as the inclusion proof that the certificate was included in today's tree. The inclusion proof is comparatively small; using Let's Encrypt as our reference point, it will be about 600-700 bytes.

When the client subsequently contacts the site, the site provides both its certificate and the inclusion proof.^[8] The certificate can be verified in the usual fashion, but the client also needs to verify the inclusion proof in order to ensure that the certificate was actually published. In order to do this, it needs both the inclusion proof itself and the root of the Merkle tree that was published at the time of certificate issuance. Instead of flood filling the tree itself, we instead arrange to flood fill the signed root of the tree (or, more likely for a browser, to distribute it in the update channel). The client verifies the signature on the root to ensure that it's valid and then checks the inclusion proof in order to be sure that the certificate was really included in the tree.

This is a big improvement in the amount of information the client needs to store and retrieve. The signed root itself is very small (~100 bytes) and then on each connection it needs to retrieve ~600-700 bytes of inclusion proof for each certificate, which is around the size of your typical certificate, so this perhaps doubles the overhead of the TLS connection, which isn't that bad.

Note that in order to verify that there are no unexpected certificates for its domain names, the site still needs to download the entire certificate database, or more likely use some service which does it for it. However, sites typically have significant resources, and the database isn't that big, so this is a much smaller burden than requiring every browser to retrieve a copy. Moreover, a service which does this kind of checking just needs to download the database once for all of its clients, which lets it amortize the cost.

Security Properties #

This system does a reasonable job of providing the security guarantees we asked for at the start.

Because the client verifies the inclusion proof for a certificate, it is able to ensure that it chains up to a signed root. While the CA can technically make more than one tree with different contents, that requires signing two tree roots, which then have to published somehow in order to be useful. As there is supposed to be only one root per day, as soon as any endpoint sees two different roots for the same period, it knows that the CA is cheating and can prove it to any third party just by publishing both signed roots.

If we're doing simply peer-to-peer flood fill, not every client will be able to see both roots, but it's likely that one will. If clients are getting their copy of the signed root from their vendor, then the situation is even simpler: every client from the vendor will have the same root and as long as vendors check that their roots match and sites/services that want to check the database verify that their roots match the vendors roots, there's no real way to publish two roots without being immediately detected.

The result is a system that is publicly verifiable in that everyone has the same view of the certificates that have been published. This isn't perfect in that you still have to actually detect misissuance, which isn't always straightforward for the reasons I discussed above, but at least it's not possible to have covert misissuance. This means that misissuance for big sites will probably be detected, and if any kind of misissuance is detected it's much easier to investigate because you have a permanent record.

Next Up: Real World Certificate Transparency #

At the beginning, I said that I was going to try to build an idealized version of Certificate Transparency, and that's what we have here. There are still a fair number of moving pieces, but the result has strong and fairly straightforward security properties. Unfortunately, CT as actually deployed involved quite a few technical compromises and the result was something more complicated and with quite different security properties. I'll be talking about those compromises and their consequences in the next post.

Yes, yes, I know it's technically a "certification authority", but at this point, can we just agree that it's "certificate authority". ↩︎
It's also not always possible to outsource this job to a CDN or hosting provider, because you might have your site hosted across more than one service, so no single service can check that it recognizes every certificate. For instance, suppose your site is both on Cloudflare and Fastly; both services will have certificates for your domain and if Cloudflare goes to check for certificates that weren't issued to it, it will find the ones issued to Fastly. ↩︎
The processes used to validate domains for certificate issuance are far from perfect, so even a well-operated CA can still misissue. ↩︎
Note that because certificates are signed by the CA, anyone can verify that they really issued it, without the cooperation of the CA after the fact. The certificate itself plus the claim by the domain's operator is prima facie evidence that something is wrong. ↩︎
"Eventually" is doing a lot of work here, but this isn't the system we're going to build, so I'm just going to handwave past it. ↩︎
Technical note: you actually don't want to use exactly this structure because because it creates ambiguity between an interior node with children H(A) and H(B) and a leaf node with value H(A) + H(B), but that's easy to fix. ↩︎
Actual CT uses one big tree that grows over time, but this is conceptually easier to describe. ↩︎
For deployment reasons, we'd actually like the inclusion proof to be included in the certificate, so we don't need to modify the TLS stack. This is technically possible but doesn't matter at the moment. ↩︎

Adventure Run Report: Northern Yosemite 50

2023-10-29T00:00:00Z

After a kind of disappointing—but still the right call—decision to DNF at Teanaway 100, I found myself with a big pile of fitness, nothing planned for the rest of the year, but not really ready to just call it a season and start thinking about 2024. There weren't any races left I wanted to do, so instead I decided to try one of the adventure run loops that I had been eyeing for the summer but had to put off because the record snows in the 2022/2023 season kept the Sierras impassable late into the season, specifically another Leor Pantilat route that he called the Northern Yosemite 50.

My stuff ready to go.

Fortunately, my former training partner Chris Wood (now sadly reduced to running loops around Central Park in NYC) was in town, so we were able to do it together. Chris was actually already in Yosemite Valley, so we drove out separately and stayed at the Willow Springs Resort (a bit rustic but friendly and reasonably nice), which is about 30 minutes away from the start of the route at Annett's Mono Village. The "village" itself is basically a paid RV campground, but there is a big parking lot on the lake which seemed to be free. There was a "No overnight parking" sign, but we didn't expect to be there too far into the next morning in the worst case, so just left a note saying we were out trail running and hoped it would be OK.

The route is a big lollipop of just around 50 miles and 10000 ft of climbing, starting at Twin Lakes in the Hoover Wilderness at around 7000 ft, then quickly climbing above 9000 and mostly staying above there until the last 5 miles or so, with a high point of 10300 ft. Pantilat did it 15:28 and so I figured we were looking at at least 16 hrs and probably more this, which is a pretty long day this late in the season (sunset is around 6:00), so we decided on a 5 AM start, which mean that both the beginning and end would be on headlamp.

Start to the Loop [7.15 mi, +2246/-157 ft, 2:22:16, 19:56/mi, 15:41/mi GAP] #

The first 7 miles or so are the "stem" of the lollipop, a steady climb of about 2500 feet. It was really cold at the start (~45 F) so I was in gloves and a jacket and Chris was in a warm shirt and gloves, both of which were pretty much the uniform for the rest of the day as it never really warmed up. We got a little scare right as we were passing through the campground at the start when we saw a black bear rummaging through the garbage cans. The bear seemed more interested in trying to find food than in bothering us but we gave it a wide berth anyway.

As usual for the start of a run, we were fresh, and the trail itself is in pretty good shape for the Sierras and was still easy to find in the dark, so went pretty smoothly. In any case, we got to the junction quickly and were definitely thinking that this was going to be a fast day. As it turned out, we were shortly to be punched in the face by reality.

To the PCT [8.47 mi, +846/-997 ft, 2:50:17, 20:06/mi, 18:27 GAP] #

This next segment is comparatively flat and pretty early on we passed Peeler Lake, which is spectacular:

The view of Peeler Lake

40+ miles to go but at least it warmed up

This section is a bit interstitial in that it's pretty flat but you know you have some big climbs ahead of you. We would have expected to be able to move pretty fast on this segment, but due to a combination of the terrain and trying to take it conservatively we actually slowed down a fair bit. There were two factors here. First, even though a lot of this was smooth trail , it was also frequently really narrow single track cut into a meadow which I found hard to run without hitting my legs against the side. It was also somewhat gently rolling and were very deliberately walking anything that went uphill at all.

It had finally started to warm up to mid 60s and I was starting to be able to feel my hands again. It never got much warmer than this, and so we managed to stay pretty well hydrated. We'd started out with plenty of fluid (2l for me and 2.5l for Chris), with a target of about 500ml/hr, which meant that we had to start filtering water in this segment. Fortunately, there was water everywhere, whether in lakes or streams, so from here on in we kept to about 1l each. We only had one filter (Hydrapak 42mm), and were both using sports drink—you have to filter into the bottle and then add the powder, not the other way around—so the whole filtering/mixing thing slowed us down.

On the PCT [14.94 mi, +3258/-3793 ft, 5:32:15, 22:14/mi, 18:32/mi GAP] #

Finally we hit the junction for the Pacific Crest Trail. From here it's a sharp downhill to the low point of the loop (~7600ft) followed by two steep climbs, first to around 9500 ft and then above 10000 ft. The first of these is 1795 ft over 2.77 mi, for an average of 12.3%, so we made pretty extensive use of our poles.

I was really starting to feel the altitude and was definitely ready for the half-way point, which is basically at the dip between the next two climbs. We hit the halfway at just under 9 hrs, so it seemed like we were still on track for a <18 finish, especially as the last 7-10 miles were downhill. This is where I was planning to start with the caffeine and so I sucked down a Maurten Gel CAF 100. These take about 30 min to kick in but after that I started to feel a lot better. From here on, it was caffeine every 2 hrs to the finish.

From the pass at about 26 miles, it's a long downhill to around 30, where we leave the PCT again. This was another one of those sections where you would have hoped to be moving a lot faster, but in practice it was all pretty rocky and/or rutted single track so instead was a lot of hike/jogging where you'd run a bit and then have to walk to avoid some rocks, so this turned out to be a slog. At this point, Chris and I were both really hoping for the next climb to start, both so we could get it over with and because I actually find it more fun to go up in this kind of terrain because you wouldn't be running anyway, so it's not as frustrating that you can't.

Chris was also feeling altitude and had a bad headache, so when we stopped around here to filter water, he took some ibuprofen. There's been a movement away from ibuprofen in ultra, but the concern here is mostly about stress on the kidneys, and with only 20 miles to go in the cold, this didn't seem like that big a concern.

Last two Climbs [10.21 mi, +2864/-1234 ft, 3:48:21, 22:22/mi, 18:08/mi GAP]. #

We finally hit the bottom and then it was time for the last big push As you can see from the elevation profile, it's about 2100 feet over 6.4 miles, but it's really more like 850 ft over 4.4 miles (very gentle) and 1300 ft over 2 miles (quite steep), so there was a lot of hiking up shallow slopes and waiting for the real climbing to begin.

Throughout the whole approach to the pass, we could see dark clouds gathering ahead of us. The weather reports had been for some light rain in the mid afternoon, but that was for Yosemite generally and of course any weather forecast in the mountains has to be treated with some skepticism, so we mostly just crossed our fingers and pushed on. It never really rained on us, but by this time the sun had started to go down again and we were starting to get cold again.

Dark clouds over the pass

These dudes do not look very happy.

From the first pass, it's a steep descent and then the final climb of 1000+ feet over 2 miles. We'd expected to do some of this climb on headlamp, but actually we needed light a bit earlier, towards the end of the descent. I was carrying my ridiculously bright Lupine Neo (~170 lumens but still blinding on the second step out of 4), so for a while it was just me leading the way with Chris not even needing his own light.

Climbing in the dark is nice and peaceful, even if you're also getting cold and a touch of altitude sickness, and we were still happy to hit the top of the pass, telling ourselves it was just an easy cruise in. Of course, it's really an easy cruise of around 11 miles in the dark, which isn't actually so easy.

Back to the Start [10.99 mi, +591/-3609 ft, 4:10:55, 22:50/mi, 22:08/mi GAP] #

Conceptually, this last segment comes in two pieces:

Around 4 miles back to the junction of the loop
The 7 or so miles to the start, which we'd already been on

With typical runner psychology, our thinking here was that we "just need to get to the junction" because from there it's straightforward. In practice, however, this turned out to be one of the trickiest sections because (1) it was in the dark (2) the trail was really rocky (3) the trail was faint in places it was covered in snow. We got off-trail a number of times and had to spend a long time trying to figure out where it was. The process here should be familiar:

Watch shows you're off trail
Go in some direction to see if you can find it.
Watch shows you're going in the wrong direction
Pull out the phone with the better map (optional)
Finally spot a section of rock and dirt that looks more heavily used and head towards it
Obsessively look at your watch for the next two minutes to see if you're really back on trail

This obviously takes some time and caused us to really slow down in this segment. This is also where I started to fall off my nutrition; I had been pretty religiously doing 500ml of either Maurten or Tailwind + a gel or a bar every hour, but as we got closer to the finish I started thinking I didn't need to drink as much and didn't want to stop to filter. We had been drinking so much earlier that I was still well hydrated, but I got behind on my calories a bit. Fortunately with only a few hours to go I still had some buffer.

Finally, we hit the trail junction and the last descent. To be honest, this seemed a lot easier coming up, and we had both remembered it as being quite smooth and theoretically runnable, but in practice it wasn't really that runnable, so there was a lot more hike/jogging than was really ideal. The last 3 miles or so are genuinely runnable, even on headlamp, and we did run those, especially the last 1.5 miles, which are pretty much fire road.

It wasn't all smooth sailing, though: about .5 miles out Chris rolled his ankle on a rough piece of ground/rock/whatever. He walked it off but then did it again in another 200m or so. At this point our priority was avoiding injury (he's racing JFK 50 in less than a month) so we jogged it in nice and easy, at least until we got back to the campground where—again!—we saw a bear. Two, actually, a cub and what we assumed was its mother:

Fortunately, not cocaine bears.

We did the usual thing where we gave them space and made a lot of noise and fortunately they didn't chase us. From here it's an easy jog to the finish, the car, and a five hour drive home to Palo Alto.

Retrospective #

Map of the course. From Gaia GPS

Map of the course. From Runalyze

This was rather harder than I expected. For comparison, when I did Tenaya last year I averaged 19:54/mi as opposed to 21:43/here. I attribute that to several factors:

We had to do a lot more of this in the dark. I did Tenaya in midsummer and it was light almost the whole way. We were at least a minute faster (20:53) for the first 35 miles and clearly slowed down a lot as it got darker.
This was much rockier. Tenaya had a lot of smooth downhill sections (e.g., the run down from Glacier Point) where you could really open up, but there's basically nothing like that here.
With two of us, the filtering takes twice as long. If you have to filter every 2 hrs and it takes 5 additional minutes to filter, then that's almost an additional minute right there.

The main thing I would really change is that I wish I had brought warmer gloves, because the tips of my fingers were cold for the last 5 hrs or so. It was never so cold that I was really worried about damage, but it also wasn't pleasant. I have a lightweight pair of waterproof mittens that I wore at Desolation Wilderness, and I wished I'd brought those.

With that said, this was overall a pretty good day. This was really long but we finished strong and uninjured. I managed my nutrition well and managed to maintain about 300 cal/hr except for the last couple hours and I never bonked or felt thirsty. The altitude got to me a bit but I was able to manage it OK, even above 10000 ft. Given how I felt going into this, especially after Teanaway, I'm going to call it a success.

Overall: 51.8 mi, 9790 ft, 18:44:03, 21:43/mi

Maybe someday we'll actually be able to search the Web privately

2023-10-02T00:00:00Z

The privacy of Web search is tragically bad. For those of you who haven't thought about it, the way that search works is that your query (i.e., whatever you typed in the URL bar) is sent to the search engine, which responds with a search results page (SERP) containing the engine's results. The result is that the search engine gets to learn everything you search for. The privacy risks here should be obvious because people routinely type sensitive queries into their search engine (e.g., "what is this rash?", "Why do I sweat so much", or even "Dismemberment and the best ways to dispose of a body)", and you're really just trusting the search engine not to reveal your browsing history.

In addition to learning about your search query itself, browsers and search engines offer a feature called "search suggestions" in which the search engine tries to guess what you are looking for from the beginning of your query. The way this works is that as you start typing stuff into the search bar, the browser sends the characters typed so far to the search engine, which responds with things it thinks you might be interested in searching for. For instance, if I type the letter "f" into Firefox, this is what I get:

Everything in the red box is a search suggestion from Google. The stuff below that is from a Firefox-specific mechanism called Firefox Suggest which searches your history or—depending on your settings—might ask Mozilla's servers for suggestions. The important thing to realize here is that anything you type into the search bar might get sent to the server for autocompletion, which means that even in situations where you are obviously just typing the name of a site, as in "facebook"^[1]

Your privacy in this setting basically consists of trusting the search engine; even if the search engine has a relatively good privacy policy, this is still an uncomfortable position. Note that while Firefox and Safari—but not Chrome!—have a lot of anti-tracking features, they don't do much about this risk because they are oriented towards ad networks tracking you cross sites, but all of this interaction is with a single site (e.g., Google.) There are some mechanisms for protecting your privacy in this situation—primarily concealing your IP address—but they're clunky, generally not available for free, and require trusting some third party to conceal your identity.

This situation is well-known to most people who work on browsers—and to pretty much anyone who thinks about it for a minute—of course you have to send your search queries to the search engine, if it doesn't have your query, it can't fulfill your request. But what if it could?

This is the question raised by a really cool new paper by Henzinger, Dauterman, Corrigan-Gibbs, and Zeldovich about an encrypted search system called "Tiptoe".^[2] Tiptoe promises fully private (in the sense that the server learns nothing about what you are searching for) search for the low low price of 56.9 MiB of communication and 145 core-seconds of server compute time. Let's take a look.

Background: Embeddings #

In order to understand how Tiptoe works, we need some background on what's called an embedding. The basic idea behind an embedding is that you can take a piece of content such as a document or image and convert it into a short(er) vector of numbers that preserves most of the semantic (meaningful) properties of the input. They key property here is that two similar inputs will have similar embedding vectors.

As an intuition pump, consider what would happen if we were to simply count the number of times the 500 most common English language words appear in the text. For example, look at this sentence:

I went to the store with my mother

This contains the following words from the top 500 list (the numbers in parentheses are the appearance on the list with 0 being the most common):

the(0)
to(2)
with(12)
my(41)
went(327)

We can turn this into a vector of numbers by just making a list where each entry is the number of times the corresponding word is present, so in this case it's a vector of 500 components (dimension 500), as in:

1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

That's a lot of zeroes, so let's stick to the following form which lists the words that are present:

[the(0) to(2) with(12) my(41) went(327)]

Let's consider a few more sentences:

Number	Sentence	Embedding
1	I went to the store with my mother	`[the(0) to(2) with(12) my(41) went(327)]`
2	I went to the store with my sister	`[the(0) to(2) with(12) my(41) went(327)]`
3	I went to the store with your sister	`[the(0) to(2) with(12) your(23) went(327)]`
4	I am going to create the website	`[the(0) to(2) going(140) am(157) website(321) create(345)]`

As you can see, sentences 1 and 2 have exactly the same embedding, whereas sentence 3 has a similar but not identical embedding, because I went with your sister rather than with my (mother, sister). This nicely illustrates several key points about embeddings, namely that (1) similar inputs have similar embeddings and (2) that embeddings necessarily destroy some information (technically term: they are lossy). In this case, you'll notice that they have also destroyed the information about where I went with (your, my) (mother, sister, friend). By contrast, sentence (4) is a totally different sentence and has a much smaller overlap, consisting of only the two common words "the" and "to"^[3]

Once we have computed an embedding, we can easily use it to assess how similar two sentences are. One conventional procedure (and the one we'll be using for the rest of this post) is to instead take what's called the inner product of the two vectors, which means that you take the sum of the pairwise product of the corresponding values in each vector (i.e., we multiply component 1 in vector 1 times component 1 in vector 2, component 2 times component 2, and so on). I.e.,

$$ P = \sum_i V_1[i] * V_2[i] $$

The way this works is that we start by looking at the most common word ("the"). Each sentence has one "the", so that component is one in each vector. We multiply them to get 1. We then move on to the second most common English word (which happens to be "and"). Neither sentence has "and", so in both vectors this is a 0, and 0*0 = 0. Next we look at the third-most common word ("to"), and so on. We can draw this like so, for the inner product of S1 and S2.

$$ \begin{matrix} the \\ and \\ to \\ ... \\ with \\ ... \\ my \\ .... \\ went \\ \end{matrix} \begin{bmatrix} 1 \\ 0 \\ 1 \\ ... \\ 1 \\ ... \\ 1 \\ ... \\ 1 \\ \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 0 \\ 1 \\ ... \\ 1 \\ ... \\ 1 \\ ... \\ 1 \\ \end{bmatrix} = (1 + 1 + 1 + 1 + 1) = 5 $$

By contrast, if we take S1 and S3 we get:

$$ \begin{matrix} the \\ and \\ to \\ ... \\ with \\ ... \\ your \\ ... \\ my \\ .... \\ went \\ \end{matrix} \begin{bmatrix} 1 \\ 0 \\ 1 \\ ... \\ 1 \\ ... \\ 0 \\ ... \\ 1 \\ ... \\ 1 \\ \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 0 \\ 1 \\ ... \\ 1 \\ ... \\ 1 \\ ... \\ 0 \\ ... \\ 1 \\ \end{bmatrix} = (1 + 1 + 1 + 0 + 0 + 1) = 4 $$

This value is lower because one sentence has "your" and the other has "my" but neither has both "your" and "my". Finally, if we take S1 and S4, we get:

$$ \begin{matrix} the \\ and \\ to \\ ... \\ with \\ ... \\ my \\ .... \\ went \\ \end{matrix} \begin{bmatrix} 1 \\ 0 \\ 1 \\ ... \\ 1 \\ ... \\ 1 \\ ... \\ 1 \\ \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 0 \\ 1 \\ ... \\ 0 \\ ... \\ 0 \\ ... \\ 0 \\ \end{bmatrix} = (1 + 1 + 0 + 0 + 0) = 3 $$

What you should be noticing here is that the more similar (the more words they have in common) the embedding vectors are, the higher the inner product. The conventional interpretation is that each embedding vector represents a d-dimensional vector where n is the number of components and that the closer the angle between the two vectors (the more the point in the same direction) the more similar they are. Conveniently, the inner product is equal to the cosine of the angle, which is 1 when the angle is 0 and 0 when the angle is 90 degrees, and so can be used as a measure of vector similarity. Personally, I don't think well in hundreds of dimensions so I've never found this interpretation as helpful as one might like, but maybe you will find it more intuitive, and it's good to know anyway.

Normalization #

I've cheated a little bit in the way I constructed these sentences, because using this definition sentences which have more of the common English words (e.g., longer sentences) will tend to look more similar than those which do not. For instance, if instead I had used the sentences:

S5: I have been to the store with my sister

S6: I have been to the store with your sister

Number	Sentence	Embedding
2	I went to the store with my mother	`[the(0) to(2) with(12) my(41) went(327)]`
5	I have been to the store with my sister	`[the(0) to(2) with(12) have(19) my(41) been(60)]`
6	I have been to the store with your sister	`[the(0) to(2) with(12) have(19) your(23) been(60)]`

You'll notice that sentences 2 and 5 have four words in common (the, to, with, my), whereas 5 and 6 have five words in common (the, to, with, have, been), even though they (at least arguably) have quite a different meaning (who I went to the store with) rather than just differing in grammatical tense (have been versus went).

The standard way to fix this is to normalize the vectors so that the the larger the values of components in aggregate, the less the value of each individual component matters. For mathematical reasons, this is done by setting magnitude of the vector (the square root of the sum of the squares of each component) to 1, which you can do by dividing each component by the magnitude. When we do this, we get the following result:

Sentence Pair	Un-normalized Inner Product	Normalized Inner Product
S2 and S5	4	0.73
S2 and S6	5	0.55

This matches our intuition that sentences 2 and 5 are more similar than sentences 2 and 6.

Real-world Embeddings #

Obviously, I'm massively oversimplifying here and in the real world an embedding would be a lot fancier than just counting common words. Typically embeddings are computed using some fancier algorithm like Word2vec, which itself might use a neural network. However, the cool thing here is that however you compute the embedding, you can still compute the similarity of two embeddings in the same way, which means that you can just build a system that depends on having some embedding mechanism and then work out that embedding separately. This is very convenient for a system like Tiptoe where we can just assume there is an embedding and work out cryptography that will work generically for any embedding.

Tiptoe #

With this background in mind, we are ready to take a look at Tiptoe.

Naive Embedding Based Search #

Let's start by looking at how you could use embeddings to build a search engine. The basic intuition here is simple. You have a corpus of documents (e.g., Web pages) $D_1, D_2 ... D_n$. For each document, you compute a corresponding embedding for the document $Embed(D_1), Embed(D_2), ... Embed(D_n)$. When the user sends in their search query $Q$ you compute $Embed(Q)$ and return the document(s) that are closest to $Embed(Q)$, which is to say have the highest inner products.^[4] Naively, you just compute the inner product of the embedded query against every document embedding and then take the top values, though of course there are more efficient algorithms.

The figure below shows a trivial example. In this case, the client's embedded query is most similar to $Embed(D_4)$, and so the server sends $D_4$ (or, in the case of search, its URL) in response.

This is actually a very simplified version of how modern systems such as ColBERT work.

Of course the problem with this system is the same as the problem we started with, because you have to send your query to the server so it can compute the embedding. There are two obvious ways to address this:

Compute the embedding on the client and send it to the server.
Send the entire database to the client

The first of these doesn't work because the embedding contains lots of information about the query (otherwise the search engine couldn't do its job). The second doesn't work because the embedding database is far too big to send to the client. What we need is a way to do this same computation on the server without sending the client's cleartext query or its embedding to the server.

Naive Tiptoe: Inner Products with Homomorphic Encryption #

Tiptoe addresses this problem by splitting it up into two pieces. First, the client uses a homomorphic encryption encryption system to get the server to compute the inner product for each document without allowing the server to see the query.

The client then ranks each results by its inner product, which gives it a list of the results that are most relevant (e.g., results 1, 3, 9). The indices themselves aren't useful: the client needs the URL for each result, so it uses a Private Information Retrieval (PIR) scheme to retrieve the URLs associated with the top results from the server.

The reason for this two-stage design is that the URLs themselves are fairly large, and so having the server provide the URL for each result is inefficient, as most of the results will be ranked low and so the user will never see them. The server can also embed the type of preview metainformation that typically appears on the SERP (e.g., a text snippet) if it wanted to, but because PIR is expensive, you want the results to be as small as possible. Once the client has the URLs, the it can just go directly to whichever site the user selects.

I already explained PIR in a previous post, so this post will just focus on the ranking system. This system uses some similar concepts to PIR, so you may also want to go review that post. You may recall from that post that a homomorphic encryption scheme is one in which you can operate on encrypted data. Specifically, if you have two plaintext messages $M_1$ and $M_2$ and their corresponding ciphertexts $E(M_1)$ and $E(M_2)$ then the encryption is homomorphic with respect to a function $F$ if

$$ F(E(M_1), E(M_2)) = E(F(M_1, M_2)) $$

So, for instance, if you were to have an encryption function which is homomorphic with respect to addition, that would mean you could add up the ciphertexts and the result would be the encryption of the sum of the plaintexts. I.e.,

$$ E(A) + E(B) = E(A + B) $$

Homomorphic encryption allows you to give some encrypted values to another party, have it operate on them and give you the result, and then you can decrypt it to get the same result as if they had just operated on the plaintext values, but without them learning anything about the values they are operating on.

We can apply homomorphic encryption to this problem as follows. First, the client computes the embedding of the query giving it an embedding vector $V$ and each element of it $i$, $V_i$. The client then encrypts each element of $V$ with a homomorphic encryption system. Call this $E(V)$ and each element $E(V_i)$. The client sends $E(V)$ to the server.

The server iterates over each URL $U_j$ and its corresponding embedding value $D_j$ and computes the inner product of $D_j$ and $E(V)$. Specifically, for each element $i$, it computes the pairwise product $I_{j, i}$:

$$ E(I_{j,i}) = D_{j,i} * E(V_i) $$

It then sums up all these values, to get the encrypted inner product for URL $j$.

$$ E(I_j) = \sum_i E(IP_{j,i}) = \begin{matrix}E(V_1 * D_1) \\ + \\ E(V_2 * D_2) \\ + \\
E(V_3 * D_3) \\ + \\
E(V_4 * D_4) \\ + \\ E(V_5 * D_5) \end{matrix} $$

Written in pseudo-matrix notation, we get:

$$ \begin{bmatrix} E(V_1) \\ E(V_2) \\ E(V_3) \\ E(V_4) \\ E(V_5) \\ \end{bmatrix} \cdot \begin{bmatrix} D_1 \\ D_2 \\ D_3 \\ D_4 \\ D_5 \\ \end{bmatrix} \rightarrow \begin{bmatrix} E(V_1 * D_1) \\ E(V_2 * D_2) \\ E(V_3 * D_3) \\ E(V_4 * D_4) \\ E(V_5 * D_5) \\ \end{bmatrix} \rightarrow \sum_i E(V_i * D_i) $$

The server then sends back the encrypted inner product values to the client (one per document in the corpus). The client decrypts them to recover the inner product values (again, one per document). It can then just pick the highest ones which are the best matches and retrieve their URLs via PIR (effectively, "give me the URLs for documents 1, 3, 9", etc.). It then dereferences the URLs as normal. Because this is all done under encryption, the server never learns your search query, the matching documents, or the URLs you eventually decide to dereference (though of course those servers see when you visit them). Importantly, these guarantees are cryptographic, so you don't have to trust the server or anyone else not to cheat. This is different form proxying systems, where the proxy and the server can collude to link up your searches and your identity.

Ciphertext Size Matters #

For instance:

If each value in the embedding vector is a 32-bit floating point number and the embedding vector has dimension 700ish, then the embedding values for each document is around 2800 bits.
If we naively use ElGamal encryption, then each ciphertext will be around 64 bytes (480 bits).

This is an improvement of a factor of 7 or so, but at the cost of doing $N$ encryption operations, which is quite a lot.

Clustering #

Let's take stock of where we are now. The client sends a relatively short value, consisting of $T$ ciphertexts where $T$ is the number elements in the embedding vector. The server responds with $N$ ciphertexts, where $N$ is the number of URLs in its corpus and has to do $T*N$ multiplications. Depending on the homomorphic encryption algorithm, this might or might not be an improvement on the total communication bandwidth, but it's still linear in the number of documents, which is quite bad.

It's not really possible to reduce the number of operations on the server below linear. The reason for this is that the server needs to operate on the embedding for each document; otherwise the server could determine which embeddings the client isn't interested in by which ones it doesn't have to look at it in order to satisfy the client's query. However, it is possible to significantly improve the amount of bandwidth consumed by the server's response.

The trick here is that the server breaks up the corpus of documents into clusters of approximately $\sqrt N$ size (hence there are approximately $\sqrt N$ clusters). These clusters are arranged so that they have nearby embedding vectors, and hence the documents are are theoretically similar. The server publishes the embedding vector for the center of the cluster, and this allows the client to request only the inner products for the closest cluster. This reduces the amount of data that the server by a factor of $\sqrt N$ to order $\sqrt N$. There's just one problem: if the client only queries one cluster, then doesn't the server know which cluster the client is interested in?

We fix this by having the client send a separate encrypted query for each cluster, like so:

$$ \begin{bmatrix} E(0) & \color{red}{E(V_1)} & E(0) \\ E(0) & \color{red}{E(V_2)} & E(0) \\ E(0) & \color{red}{E(V_3)} & E(0) \\ E(0) & \color{red}{E(V_4)} & E(0) \\ E(0) & \color{red}{E(V_5)} & E(0) \\ \end{bmatrix} $$

In this diagram, each column represents one cluster (and hence there are $\sqrt N$ columns), and each row is a different embedding component. The column corresponding to the cluster (column $q$) of interest (in red) contains the encryption of the client's actual query embedding vector, whereas the rest of the columns just contain the encryption of 0 (the encryption is randomized so that they are not readily identifiable).

The server takes each column of the client's query and computes the inner product for each document the corresponding cluster, as before. I.e., for document $j$ in cluster $c$, it computes $E(I_{c, j})$. Then, however, the server adds up the inner product values across the clusters, with one report for the the sum of the values for 1st URL in each cluster, one for the the sum of the inner products for the 2nd URL, and the cluster, and so on, so that the server still only returns the same number of of ciphertexts as before. I.e., it reports:

$$ E(I_j) = \sum_c E(I_{c, j}) $$

Ordinarily the sum of these would be useless, but the trick^[5] here is that because the other columns—corresponding to the cluster that is not of interest—are the encryption of 0, their inner products are also zero, which means that the result sent back to the client only includes the inner products for the column of interest (column $q$).

The resulting scheme has much better communication overhead:

The server sends the list of the centers of the embeddings ($\sqrt N$).
The client sends a list of $d$ encrypted components for each cluster ($d \sqrt N$).
The server sends a single encrypted inner product value for each document in the cluster ($\sqrt N$).

This is dramatically better than the naive scheme in which the client sends $d$ values and the server sends $N$, although at the cost of pushing some of the transmission cost onto the client, for a total transmission that scales as a factor of $(d+2)\sqrt N$. Of course, that's still pretty big and the constant factor is also pretty big (~512 bits per document for ElGamal). The Tiptoe paper uses some clever tricks to bring the size down some (see below for cost numbers) but the end result is still fairly large (see cost below).

Performance #

As should be clear from the previous section it's possible to build privacy-preserving search, but how well does it actually do? This actually comes down to two questions:

How good are the answers?
How much does it cost?

Accuracy #

First, let's take a look at accuracy. Obviously, a private search mechanism will be no better than a non private search mechanism, because if it were you could just remove the privacy pieces and get the same accuracy. However, realistically we should expect worse accuracy, just on general principle (i.e., we are hiding information from the server). In this specific case we should expect worse accuracy because the server is just operating on the (encrypted) embedding of the query, rather than the whole query, and computing the embedding destroys some information.

The metric the authors use for performance is something called "MRR@100", which stands for "mean reciprocal rank at 100". The way this works is that for each query you determine which result people would have ranked at number 1 and then ask what position the search algorithm returned it in. You then compute a score that is the inverse of that position, so, for instance, if the document were found in position 5, then the score would be $1/5$. The "mean" part is that you average out the results over the document corpus. The "at 100" part is that if the search algorithm doesn't return the result in the top 100 values, you get a score of zero. In other words:

$$ MRR = \frac{\sum_i^N \begin{cases} \frac{1}{Rank_i} & \text{if } R_i \leq 100 \\ 0 &\text{otherwise} \end{cases} } {N} $$

Note that this score really rewards getting the top result, because even getting it in second place only gets you a per-document score of $1/2$.

The results look like this:

Source: Tiptoe paper.

The graph on the left provides MRR@100 comparisons to a number of algorithms, including:

A modern search algorithm (ColBERT)
Two somewhat older systems (BM25 and tf-idf)

As you can see, ColBERT performs the best and Tiptoe gets pretty close to tf-idf but is still significantly worse than BM-25. The "embeddings" is the result if you don't use the clustering trick described above. Notice here that "embeddings" does very well, and in fact is better than BM25, so the clustering really does have quite a significant impact on the quality of the results.

The graph on the right shows the cumulative probability that the best result will be found at an index less than $i$ (i.e., that it is found in the top $i$ results). The dotted line shows the chance that the best result is in the cluster Tiptoe receives at all; which reflects the best result Tiptoe could deliver even if it always picked the best result out of the cluster (about 1/3 of the time).

On the one hand, this is a fairly large regression from the state of the art, but on the other hand, it means that there is a lot of room for improvement just by improving the clustering algorithm on the server. Obviously, there's also room for improvement in terms of ranking within the cluster. With the current design the client just gets the inner product so all it can do is rank them, but there might be some things you could do, such as proactively retrieving the first 10 documents or so (there is a very steep improvement curve within the first 10) and running some local ranking algorithm on their content.

Cost #

So how much will all this cost. The answer is "quite a bit but not as much as you would think". Here's Figure 8, which shows the estimated cost of Tiptoe for various document sizes:

Source: Tiptoe paper.

The server CPU cost is linear in the number of documents in the corpus and would require around 1500 core seconds for something like Google.

The communication cost is sublinear in the number of documents but has a very high fixed cost of around 55 MiB for a query on a corpus the size of the Common Crawl data set (~360 million documents) and around 125 MiB for a Google sized system (~8 billion documents). Tiptoe uses a number of tricks to frontload this cost; most of the communication isn't dependent on the query, so that the client and server can exchange it in advance without it being in the critical path. The server also has to send the client the embedding algorithm, which can be quite large (e.g, 200+ MiB) but that is reused for multiple queries and so can be amortized out.

Using Amazon's list price costs, the overall cost is around 0.10 USD/query for a system the size of Google. Google doesn't publish their numbers but 9to5Google estimates it at .002 USD/query. This is 50 times less, which is a big difference, but actually that probably overstates the difference because Google isn't paying list price for their compute costs, so the difference is probably quite a bit less. In either case, this is actually a smaller difference than you would expect given the enormous improvement in privacy.

The Bigger Picture #

The lesson you should be taking home here is not that Tiptoe is going to replace Google search tomorrow. Not only are private search techniques like this more expensive than non-private techniques, they are inherently less flexible. Google's SERP is a lot more than just a list of results. For instance here's the top of the search page for "tiptoe":

Note that the first entry is actually a dictionary definition, the info-box on the right, and the alternate questions. The first website result is below all that. Obviously, one could imagine enhancing a system like Tiptoe to provide at least some of these features, though at yet more cost.

There are two stories here that are true at the same time. The first is about technical capabilities: in most cases, private systems are inherently less flexible and powerful than their non-flexible counterparts. It's always easier to just tell the server everything and let it sort out what to do, both because the server can just unilaterally add new features without any help from the client and because it's often difficult to figure out how to provide a feature privately (just look at all the fancy cryptography that's required to provide a simple list of prioritized URLs). This will almost always be true, with the only real exception being cases where the data is so sensitive that it's simply unacceptable to send it to the server at all, and so private mechanisms are the only way to go. However, I think the lesson of the past 20 years is that people are actually quite willing to tell their deepest secrets to some computer, so those cases are quite rare.

The other story is about path dependence. Google search didn't get this fancy at all once; the original search page was much simpler (basically a list of URLs with a snippet from the page) and features were added over time. If we imagine a world in which privacy had been prioritized right from the start, then we would have a much richer private search ecosystem—though most likely not as powerful as the one we have now. The entry barrier to increased data collection for slightly better features would most likely be a lot higher than it is today. But because we started out with a design that wasn't private, it led us naturally to where we are today, where every keystroke you type in the URL/search bar just gets fed to the search provider.

I'm not under any illusions that it will be easy to reverse course here: even in the much simpler situation of protecting your Web traffic in transit, it's taken decades to get out from under the weight of the early decisions to do almost everything in the clear and we're still not completely done. Moreover, that was a situation where we had the technology to do it for a long time, and it was just a matter of deployment and cost. However, the first step to actually changing things is knowing how to do it, and so it's really exciting to see people taking up the challenge.

Acknowledgement #

Thanks to Henry Corrigan-Gibbs for assistance with this post. All mistakes are of course mine.

Firefox, at least, does make some attempt to omit pure navigational queries, so if you type "http://" in the Firefox search box, this gets sent to the server, but "http://f" does not. ↩︎
Disclosure: this work was partially funded by a grant from Mozilla, in a program operated by my department. ↩︎
In a real-world example, one might well prune out these common not-very-meaningful words. ↩︎
Note that you might use a different algorithm to compute the embeddings on the documents as on the queries, for instance if you are doing text search over images. For the purposes of this post, however, this is not important. ↩︎
Note that this is basically the same trick that PIR schemes use. ↩︎

Desolation Wilderness Seven^H^H^H^H^HTwo Summits

2023-09-05T00:00:00Z

My two races this season were to be Broken Arrow Skyrace and then a hundred to be named later. I'd originally planned to do Whistler Alpine Meadows 100 but then it was cancelled in February and I spent a long time procrastinating but finally settled on Teanaway Country 100. Teanaway is about the opposite of UTMB: a tiny low-key race (59 entrants so far), but with pretty similar topline stats, with 32000 feet over 100 miles.

I've had several solid training blocks this year, but I wanted to try to get in one more adventure run this summer. Unfortunately, due to last winter's ridiculous snow season, most of the routes I was interested in doing in the Sierra were snowed in in midsummer, so I didn't start looking seriously till a few weeks ago, eventually deciding to take a crack at the Desolation Wilderness Seven Summits Loop, which I first saw on Leor Pantilat's fantastic site. As the name suggests, this route covers the seven named summits in the Desolation Wilderness. Technically speaking, the fastest known time for this is just to hit the peaks however, but there's a common loop linked above. The loop is 29 miles long with 10000+ ft of climbing including a fair amount of off-trail terrain, so I figured it would be a nice scaled down warmup for Teanaway. I'd actually intended to do a slightly longer variant of about 40 miles/15kft the week of August 13, but then I got sick and so had to defer to last weekend, and with only two weeks to Teanaway, decided to stick to the normal version.

Logistics #

This loop starts at a parking lot off US 50 en route to Tahoe a bit East of Kyburz. I stayed at the Sierra Inn On the River, which is conveniently situated about 15 minutes away. I was planning to start at about 5:30-6 AM (sunrise is at about 6:40), so I was able to sleep in till 4:30 and then drive over.

My stuff laid out for the next day. I ended up not bringing the remote control.

Desolation Wilderness requires permits which are self-issued at the trailhead—overnight stays require a separate permit—but even though the parking lot is at an official trailhead, I was unpleasantly surprised to see that there wasn't any kind of kiosk either at this trailhead or on the trailhead on the other side of the highway. This actually isn't the trailhead you enter the Wilderness from; instead you run down the highway for a few miles, so I figured I'd just head out and hope there was a kiosk at the other trailhead.

Start to Trailhead [3.3 mi, +249ft/-774ft] #

The first two miles or so is downhill on 50, and even though it was starting to get lighter, I did this on headlamp (Petzl Actik Core), both to make sure of my own footing and for visibility. I took this pretty easy at 8:15/mile so I could warm up.

There was no bathroom at the start, and predictably I'd only run a mile or so before I really needed to go. Fortunately, the Pyramid Creek trailhead is right along the highway and has flush toilets. They also have a pay parking lot but still no place to issue your own permit. I walked to the start of the trailhead and found a sign saying that there was permit issuance at the Wilderness boundary about a quarter mile up, so I went down the trail a bit hoping to find it, but despite going past the sign for the boundary and up to the top of a little ridge, I never found it and just gave up and headed back down the road. I did manage to lose my sunglasses, though, not, as it turned out, that I needed them.

Pyramid Peak Trail 10.33 [7.03 mi, +4262ft/-4196ft] #

This route involves a climb to the top of Pyramid Peak followed by a bunch of traversing of the high country, tagging the rest of the peaks, and then a descent to the bottom. The Pyramid Peak Trail doesn't actually have a real official trailhead, so much as a small parking area across the highway from a cut-out in the embankment. There were two cars there already and apparently it gets full later, but as I was on foot, it wasn't a problem for me.

The first summit is a monster climb right from the start, ascending almost 4000 feet in 3.3 miles. I didn't even bother to try to run any of it, but just pulled out my poles and started hiking. This is kind of an unofficial trail and isn't really marked but is in OK shape and so I was mostly just able to follow the tread pattern, occasionally checking the GPS to make sure I was on the right route.

The footing is pretty reasonable but it's still slow going because it's so steep. It also was starting to get windy so I decided to throw on my rain jacket. I have the Inov-8 Raceshell half-zip and I bought a size up with the idea that I could put it on over my pack so that you can get it on and off quickly, but this works a lot better in theory than practice, as it's a pullover and gets caught on the bulge of the pack, so I fought with it for a few minutes and then finally just took my pack off. The jacket is comfortable and breathes well, though.

Eventually, the trail just kind of ends and you get to the final 500ft or so of climb, which are just one giant talus pyramid. I forgot to take a photo here, but this shot gives the idea:

[Source: Charles Jenkins]

There didn't seem to be any obvious trail up to the top, so I just started to scramble up. As I was doing so, I saw what looked like a runner at the top starting to come down and then I ran into two hikers. They told me that it was really windy at the top (it was already quite windy where I was) and that it was safer to stay towards the right (the way they had come down). I followed their advice and sure enough it started to get quite bad to the point where I wasn't that comfortable just standing up and had to use my hands more than usual. This last 500 feet of climbing and maybe a half mile probably took me like 30+ minutes and I almost turned back once because it was so sketchy.

I finally made it to the top and found somewhere that was a little sheltered and managed to take some photos. I didn't really want to stand too much on the rock ledges surrounding the hollows people had opened up at the top (presumably for shelter), and it wasn't really that clear, but there are still some great views.

This last one really lets you see the rock slope you have to descend. Sketchy!

At this point, you're supposed to head down the back side of Pyramid Peak and head offtrail to Aggasiz Peak, but when I looked down it was pretty unclear where the trail was and I really wasn't thrilled about the idea of being exposed to that much wind for the next 10 or so miles, so I made the—in retrospect correct—decision to turn back.

As is commonly the case, coming down that rockpile was actually worse than going up: you've got gravity trying to pull you down and because you're facing forward, you can't really use your hands, so I slipped and fell on my ass a bunch of times. Because I was trying to stay out of the wind I veered way off course and ended up kind of skirting the edge of the peak and then had to bushwhack my way back to the trail. From there it was a pretty straightforward descent to the bottom and I was able to run a fair bit of it.

Back to the Car 12.24 [1.91 mi, +443ft/-39ft] #

From the bottom, I needed to climb another 500 feet or so on 50 to get back to the car, which gave me some time to regroup. At this point I was about 11 miles (though 4000+ ft) and 5 hrs in, so I had plenty of time and even though the whole route was out of the question it seemed silly to drive all the way here for what was basically a medium long run. I decided the right thing to do was to head up the trail in the opposite direction to Ralston Peak. By this point I had gone through most of my fluid, so I stopped off at the Pyramid Creek parking lot to use the bathroom and refill my bottles (I didn't have extra water in my car). From there, it's an easy run back to the car.

One nice thing about doing the route this way is that your car is a sort of impromptu aid station, so I decided to change my shoes. I do most of my running in Salomon Sense Ride 5s, but I started the day in a pair of Salomon S/LAB Ultra 3s (what I used for UTMB). I like Ultra 3s but when I put them on for the first time in months on Friday morning I didn't feel like they were giving me quite as much support as I wanted I was kind of disappointed in the traction I was getting on the loose rock, so I decided to swap them for the Sense Rides, in part so I could compare them back to back on similar terrain.

I was also starting to get a bit of a hot spot on my right heel was starting to hurt and sure enough when I took my sock off, I had a blister that had formed and popped. There's only one thing you can really do at that point, which is to tape it up, and fortunately I had some strips of kinesio tape, so I slapped one on, carefully pulled my sock back over it so it didn't peel off, and put the Sense Rides on.

By this time it had really started to rain so I swapped out my wind pants (warmish but not waterproof) for a pair of Raidlight rain pants (the old version of these). I also grabbed my waterproof mittens which go on nicely over my regular gloves. With that, I was ready to head up to Ralston Peak.

Ralston I [6.86 mi +3159ft/-2943ft] #

The Ralston climb is pretty straightforward: 2700ish feet up over a bit more than three miles. It starts out as fire road but you quickly come to a single track trail marking the wilderness boundary, where I also found a kiosk for you to register for a permit (finally). I took a moment to do that and headed up.

The climb to Ralston is a lot easier than Pyramid. The footing is about the same, except for the top, but it's only about 900fpm rather than 1300, and that makes a big difference. Of course, that's in equivalent conditions and by now it was really starting to rain and I was getting pretty cold. Starting from the bottom when I was in a rain jacket alone, I gradually ended up in glove liners, rain gloves, and rain pants, and I would have put on my arm warmers too but I wasn't able to get them on under my rain jacket (because of the cuffs) and wasn't willing to take the jacket off in order to put then on.

Partway up Ralston right after I put my pants on. Not quite above the treeline

The trail situation is a little confusing as there is a spur trail to the top but also a trail that bypasses the peak, and it appears that when Leor Pantilat did this he actually went cross-country. I opted for the spur trail, which is still pretty passable, with only a bit of climbing over rocks at the very end.

Even with all this stuff on, and working hard, I was starting to get cold as I got near the top and it got windier. A lot windier, though not as windy as Pyramid. I don't have any pictures from the summit however, or rather, I have this:

Me on the summit of Ralston Peak. You can get a sense of the wind in this clip.

This isn't really white out conditions in that you can see around you just fine at least to see the trail in front of you, etc; it's just that I'm at the top of a mountain and so everything you would otherwise be able to see is miles away and visibility is a lot less than that.

The run down is pretty easy: it's steep but good footing and as soon as you got off the peak there was more wind cover and I started to warm up again. By the time I was close to the bottom I was closing in on 19 miles and 7500 ft and runner brain took over and I started to think "maybe I should do just a bit more", so I decided to turn around at the wilderness boundary and go up "some of the way".

Ralston II [3.59 mi, +1207ft/-1348ft] #

My original plan was just to go up about .5 miles to make it a round 20 miles, but as I started to get closer to the turnaround I was like "maybe 21", then "maybe 22", and finally "maybe 9000 ft total". All this seemed fine and then my GPS started to act up and was getting stuck at a given elevation before jumping 50-100 feet. 9000 feet did come eventually at about 1.8 miles, and so I turned around and headed down, somewhat regretfully, as I was feeling quite good, but two factors pushed me to play it safe: (1) I had to race a hundred in two weeks and I really didn't want to dig myself too deep a hole (2) that it was still going to be cold and rainy at the top and I didn't want to take a chance on getting hypothermic.

I made it down to the car with no issues. As before, this isn't super fast terrain and I didn't want to fall, so I just took it easy and focused on my footing. It was still raining pretty hard, so then I got the fun of having to get out of my wet clothes while trying to stay modestly dry. As usual, by the time I had my clothes on I was super cold and had to run the heater on full for the next hour or so of the drive back, but otherwise I felt fine.

Nutrition #

I did this all on Maurten, which is what I plan to mostly use for Teanaway, as my stomach can be a bit finicky and I've found Maurten works pretty well. This was a lot intensity effort which is easier on your stomach, but I never really felt any stomach distress.

The table below shows what I brought and what I used.

	Brought	Consumed	Calories
Maurten 160 drink	10	6	960
Maurten Solid	3	1.5	338
Maurten Gel 100	3	2	200
Maurten Gel CAF 100	4	2	200
Maurten 320 drink	2	0	0
Spring Speednut gel	1	0	0
Total	-	-

As usual, I overpacked quite a bit, carrying more calories out than I consumed. Some of this is attributable to not being out on the trail as long as I expected, but it's also less calories/hr than I did at Tenaya last year. In part this is because I got distracted in the first 90 minutes and didn't eat or drink much of anything and then also kind of lost focus on my nutrition at the top of Pyramid. Generally, I did OK but not great once I got to Ralston. With that said, I also clearly brought too much stuff; it's good to have some for emergencies, but you don't need to have enough of everything for emergencies. In retrospect I should have probably dropped the Spring gel and one of the Maurten 320s, which would have given me a reasonable buffer even if I had been out longer and eaten according to plan.

This is the first time I had tried using Maurten Gel CAF (100 mg caffeine) on something extended like this and I think that went well. It's easier than having to juggle caffeine pills and you can just take one every 2-3 hrs. I brought salt tablets (you can see them in some of the pictures above) but you don't need them in these cool temperatures.

Retrospective #

Map and profile via Runalyze

Obviously this didn't go as intended, which I attribute about 20% to not being prepared and 80% to weather. I should have taken more time to really recon the course and realize that the approach to Pyramid was iffy I would have been more ready for it and felt better when I hit the top. On the other hand, if the weather hadn't been as bad, I would have been a lot more comfortable at the top and more willing to try to find my way down the back half of Pyramid. As it is, I think I made the right decision not to go it alone, especially in light of how rainy it got later. I have good gear and experience in the mountains so I think I would have been fine, but being out that far alone^[1] in bad weather is no fun. Moreover, while I did want to do an adventure run, this was primarily a training exercise and a strategy checkout, and from that perspective, it didn't matter that much which sections of the trail I ran.

Other than course recon, I was prepared pretty well. I had the right gear—though if I had kept going around the loop I might have been pretty sad about not having my rain pants and rain gloves—and everything worked well. I did get to try out some options and I've now concluded that the "pull the jacket over the pack" thing isn't going to work so I'll be going back to a normal sized zip-up jacket. Based on this experience I'm not planning to race in the Ultra 3s: the traction on the Sense Ride 5s is better and I like having the more modern bouncy foam instead of the more solid Ultra 3 foam; Salomon seems to have really dialed in the ride now on the newer foam so it feels stable and yet bouncy.

Fitness wise, this actually went quite well. This is an absurd amount of vert over 22 miles, over 25% more than Teanaway and UTMB. Obviously it's not as long as either, but feeling like I'm not even really that tired at 22 miles and 10 hrs is about what I would want. Usually after something this long I would be like "when will I be done" but this time I had to really restrain myself from going all the way to the summit on the second lap.

Overall: 22.7 mi, 9308 ft, 9:48:52

And I do mean alone. I only saw three people on the trail the whole day, at the top of Pyramid. ↩︎

Private Access Tokens, also not great

2023-08-29T00:00:00Z

In my post on Chrome's Web Environment Integrity (WEI) proposal I briefly mentioned Apple's Private Access Tokens (PAT) mechanism, which, as Tim Perry observes, is already deployed. The stated use case for Private Access Tokens is to reduce the need for CAPTCHAs (the little puzzles you get asked to solve to prove that you are a human).

This is a good objective because (1) CAPTCHAs suck (I can never decide whether the post holding up the stoplight is part of the stoplight!) and (2) they increasingly don't work because captcha solving bots have gotten very good and humans aren't getting any smarter.

Source: Searles, Nakatsuka, Ozturk, Paverd, Tsudik and Enkoji "An Empirical Study and Evaluation of Modern CAPTCHAs"

This is of particular relevance for Apple which is also leaning in hard to privacy technologies like iCloud Private Relay, which conceals your IP address. The problem here is that a lot of anti-abuse mechanisms rely heavily on IP address reputation. it's hard for those technologies to build up a reputation—either positive or negative—for your IP address. This is especially true if you are also browsing with settings that reduce the effectiveness of cookies, for instance if you are using Tor Browser or any regular browser in Private Browsing Mode/Incognito mode because it also prevents the site from building up reputation via the cookie. (See Matthew Prince's post on this for more background.)

One response by sites is just to show CAPTCHAs whenever they see a "new" user who doesn't have a cookie or with an IP address that doesn't have a reputation—or has a bad reputation— or is used by an anonymity service. This is obviously annoying to users and not really what sites want either, because they want people to visit their site, not bounce off the CAPTCHA. What you really want is some way to attach a positive reputation to someone without tracking them.

Privacy Pass #

When you look at the problem this way, the broad shape of a solution presents itself, at least if you're a cryptographer: you need anonymous tokens. The basic idea here is that you solve a CAPTCHA and in return get an anonymous token which lets you prove that you solved it so you can skip the CAPTCHA next time. This is what is specified in the IETF's Privacy Pass protocol. In Privacy Pass, tokens are issued by working with a pair of entities called the "Attester" and the "Issuer", and are consumed by the "Origin" (the Web server) as shown below:^[1]

[Source: Privacy Pass Draft]

In this scenario, the Attester is responsible for ensuring you solved the CAPTCHA—or enforcing whatever other properties one might be interested in, as we'll see shortly—and then conveys some kind of attestation to the the issuer that it has done so.^[2] The issuer then issues an anonymous token (see here for an overview of how this works) to the client. The client can then use the token to prove to the Origin (the actual site) that it is approved. It can also be used for other forms of anonymous authentication, for instance iCloud Private Relay uses a similar technique to allow users to anonymously prove that they are customers.

Obviously, I'm oversimplifying here and a huge amount of work has gone into trying to make Privacy Pass have the right security and privacy properties. There are also still some pieces which need work,^[3] but for the purpose of this post we can ignore the details and assume that it functions as advertised.

Private Access Tokens #

The important thing to realize here is that Privacy Pass is a generic technology which just transports the fact that you satisfied the attester. The important operational question, however, is what you had to do to satisfy the attester. The original design of Privacy Pass was built around the idea that what you did was solve a CAPTCHA, but Privacy Pass is agnostic on this point, and in principle the attester can demand anything. This brings us to Private Access Tokens, Apple's implementation of Privacy Pass using Apple as the attester, as shown below:

[Source: Apple]

Based on the description in the video, Apple is checking for the following properties:

This is a valid piece of [Apple] hardware
The user's iCloud account is in good standing (i.e., you have to be signed in with an Apple ID).
[Optional] performs rate limiting to limit the use in bot farms

If these checks pass then you will be able to get a token from the issuer.

iOS Browser Engines #

One thing that a lot of people don't know is that Chrome and Firefox on iOS are quite different from Chrome and Safari on desktop. The reason for this is that Apple requires everyone to use their WebKit browser engine (the thing that actually renders the Web page) on iOS; in fact you have to use the copy of WebKit built into iOS. Chrome and Firefox each have their own engines (Blink and Gecko respectively), but they aren't allowed to use these on iOS. As a result, both Chrome and Firefox on iOS behave have a lot more like Safari—at least from the perspective of how they interact with the Web—than they do like their desktop counterparts. This is not true for Android, where these browsers use the same engine as on desktop.

As I understand the situation, this will just work if you are on Safari but doesn't work on other browsers such as Chrome and Firefox, at least on desktop. This is partly because Apple doesn't seem to provide generic APIs that allow you to to use Private Access Tokens but instead only makes them available via their own networking APIs (WebKit and URLSession). This means every browser on iOS because Apple requires you to use their browser engine on iOS. However, on desktop Firefox and Chrome use their own networking stacks, so this doesn't work for them, really,^[4] though I suppose Apple could provide APIs that those browsers could use. Of course, those browsers could also negotiate their own deal with attesters.

Policy: Browsers, Issuers, Attesters, and Origins #

This is a complicated system with four separate players and that makes it hard to sort out the various policies in play:

The Origin server (i.e, the Web site) gets to decide which Issuers they accept.
The Issuer gets to decide what Attesters it trusts and which policies it expects them to enforce.
The Attester gets to decide what policies they actually enforce.
The Browser gets to determine which Attesters and Issuers they are actually willing to work with.

The result is that what policies you are actually subject to is determined by the interaction of the preferences of all of these parties, with the Browser and the Origin being the most important, because the Origins know what they are demanding and the Browser knows which Issuers and Attesters they will work with. The Origins can always find new Issuers/Attesters, and the Browsers can always blocklist them.

In the actual existing Apple system, the Attester and Browser Apple and Apple's policy is that you need to have an Apple device and an iCloud account. The current issuers they have announced are Cloudflare and Fastly. Moreover, Cloudflare and Fastly can also act as the origin servers (web sites) in this case, which means that if you use them to serve your Web site they can automatically consume PAT. Because of the way the crypto is designed, it's fine to have the Issuer and the Origin be the same, as they cannot link the client's behavior; in fact the Issuer, Attester, and Origin can all be the same.

The General Equilibrium #

From a technical perspective, this is all pretty reasonable stuff, but the thing to understand is that this is a generic system which is compatible with any policy the Attesters and Issuers want to enforce. As we saw with WEI, the question is then what policies they will choose to enforce. The policy enforced by the combination of Apple's attesters and the issuers they have chosen is that you paid Apple for a device and have an iCloud account. This is very different from "the person solved a CAPTCHA" because that policy works just as well for people who don't have Apple devices.

This is actually a pretty reasonable proxy for "is a person and not a bot", but the bigger picture consequences aren't great, as I don't really want to live in a world where everyone who hasn't bought an Apple device has to solve CAPTCHAs all the time. Of course, most people don't use Apple devices and many of those still use Chrome or Firefox, so that limits how aggressive sites can be about requiring repeated CAPTCHA solving for people who don't have those devices. But what happens if similar functionality gets added to Android^[5] and Windows and now suddenly the vast majority of devices have some kind of PAT-like functionality? In that case, sites will be able to be much more aggressive about requiring CAPTCHAs or just refuse to serve other users at all, as they will only be annoying a fairly small fraction of their users.

Of course, the situation will become even worse as AI gets better at solving CAPTCHAs. The basic problem here is that we don't really have a good, cheap, signal for "is a human" that doesn't require somehow buying into some bigco ecosystem, whether it's buying a device from a given manufacturer, having an account with some big service, or both. But the consequence of that is risking making using the Internet a lot harder for people who don't want to do one of those things.

Stepping back, I worry about the equilibrium steady state: the more that people are able to authenticate these technologies the more attractive it is for sites to basically require them, to increase the level of scrutiny (as in WEI), and provide a massively inferior experience to those who can't. Ironically, this is actually a direct consequence of Privacy Pass being well-designed so that it's seamless and provides a good level of privacy, because that makes it seem less objectionable to require, as opposed to (say) making everyone log in with a Google account.^[6] At the end of the day, though, the risk is further entrenching the existing big players.

This split architecture is intended to be flexible but is a bit confusing pedagogically. ↩︎
As I understand the situation, despite this somewhat confusing diagram, the browser talks to the issuer through the attester. ↩︎
In particular there are a concerns about metadata smuggling by using different keys to sign different people's tokens, and there are efforts to address that. ↩︎
This is a feature of a lot of Apple's networking technologies, which they like to bake into the operating system. This is very convenient for small shops but less so for big implementors like browsers who would prefer to control networking themselves. ↩︎
Chrome does have a similar technology called Private State Tokens but as far as I can tell it's not tied into a Google-operated attestation system the way that PAT is. ↩︎
I owe this observation to Kate Hudson. ↩︎

The endpoint of Web Environment Integrity is a closed Web

2023-08-18T00:00:00Z

Chrome's Web Environment Integrity (WEI) proposal for remote Web browsing attestation is being justly criticized from a broad variety of perspectives (Mozilla Standards Position, Brave, EFF). I certainly agree that WEI is bad news, and I'll get to that part eventually, but first I'd like to situate it in the broader context, both of the Web and the Internet, starting with some history.

The Bell System #

The first communications network available to regular people was the telephone. Of course, the telegraph already existed, but regular people didn't have telegraphs: you went down to the telegraph office to send messages. By contrast, you could have a telephone in your home and use it to call other people who had phones in their homes. Miraculous!

Bring your own phone #

I didn't know until I started writing this post that it was sort-of possible to buy your own phone and install it but you had to first transfer the phone to AT&T and then rent it back from them.

From the early 1900s until 1983, telephone service in the United states was essentially a monopoly (the Bell System) operated by AT&T. The telephone network included not only the wires and switches that the phone company operates today but also the wire in your house and the phone in your hand, all the way up to your ear. Customers rented phones from a subsidiary of AT&T called Western Electric, and they generally looked something like this:

[Source: Wikipedia]

If you wanted to connect something else not made by Western Electric to the phone network, you were mostly out of luck. This doesn't just mean no cooler looking phones, but also no cordless phones, answering machines, or modems; basically anything other than a Western Electric brick. Unsurprisingly, there was not a huge amount of innovation in this market, though Western Electric would sell you a somewhat cooler looking "Princess Phone":

[Source: Wikipedia]

It's important to understand that there wasn't any real technical obstacle to connecting your own phone to the AT&T network. Regular telephones (what people used to call POTS for "plain old telephone service") are actually quite simple devices to build, mostly consisting of analog signals over two copper wires; you just weren't allowed to, by which I don't just mean that AT&T would be mad at you but that it was actually prohibited by the FCC:

No equipment, apparatus, circuit or device not furnished by the telephone company shall be attached to or connected with the facilities furnished by the telephone company, whether physically, by induction or otherwise except as provided in 2.6.2 through 2.6.12 following. In case any such unauthorized attachment or connection is made, the telephone company shall have the right to remove or disconnect the same; or to suspend the service during the continuance of said attachment or connection; or to terminate the service.

That changed in 1968 with the Carterfone decision in which the FCC struck this provision and allowed consumers to connect their own equipment^[1] to the network as long as it did not cause harm to the network itself. This opened the door for customers to attach their own equipment to the phone network and more importantly for innovation that didn't come out of New Jersey.

Naturally, the first things people wanted to install were local improvements to their experience that worked with standard voice phones on the other end (cordless phones, answering machines, etc.), but the Carterfone decision also implicitly allowed the use of the phone network for data transmission—effectively encoded in sound,^[2] because that's all the phone network could carry—which meant fax machines and eventually modems (originally for primitive computer networking like BBSes and eventually for the Internet). Of course, you were still tied to the phone network, which—at least until 1984—was entirely owned by AT&T, but as long as you were calling someone with a compatible system and could cram your data into an 8 kHz channel, you could do anything you wanted without getting permission from the phone company.^[3] If you were really fancy, you could even get the phone company to sell you a leased line that would carry data, but that's not something regular people did.

In which the phone company was sort of right #

Ironically, while the phone company was wrong about consumer devices like Carterfone presenting a threat to the telephone network, they were sort of right about the threat of letting anybody interconnect. The basic problem is that the telephone network was designed under the assumption that all the constituent parts were operated by the same people and that those people were trustworthy. When this is not true the security of the system breaks down.

Probably the best publicized example of this is the widespread exploitation of the phone network by phreaks for free phone calls—especially long distance—and general exploration of the phone system. The details of this kind of exploitation are out of scope of this post, but the general problem was that the system wasn't designed to be robust to compromised endpoints, or even, famously, to someone who could inject the right tones into the network. Less famously, the network is still vulnerable to impersonation attacks in which the caller generates a fake number and the callee's network just trusts its representation. These attacks are finally being fixed by a set of technologies known as STIR/SHAKEN.

From the perspective of someone who works on Internet protocols, all of these issues just look like design flaws in the system: we just assume that other components of the system are malicious unless proven otherwise. But from the perspective of the original designers, these were closed systems consisting of trusted elements, and when one of the elements misbehaved then you had problems.

The Internet #

At around the same time all this was happening, the first primitive computer networks were being constructed (the first ARPANET nodes went online in 1969). From nearly the beginning, the ARPANET and then the Internet was conceived of as an open system, a "network of networks" in which each network was independent.^[4] All that was required to be part of the Internet was to (1) speak the right protocols and (2) find someone willing to connect with you and route your traffic.^[5] And the protocols were of course public, being published in the earliest Requests For Comments (RFCs). This applied not just to the basic protocols like IP itself, but also to the application protocols on top like e-mail (SMTP, RFC 822) and remote access (Telnet). From very early on there were multiple implementations of these systems that would talk to each other; as long as your implementation could send and receive the right messages, everything would work right.

Electronic Mail: The Original Killer App for the Internet #

As an example, let's look at the original Internet communications app: electronic mail.

When the Internet was first developed, personal computers were uncommon and instead what people mostly had was access to bigger computers (e.g., owned by their company or university) in what's called a "time sharing" system, which just meant that multiple people could use the same computer at once, with everyone having their own account and workspace.

The diagram above shows how mail works in this environment. Each computer has a single system process called a mail transfer agent (MTA), which is responsible for sending and receiving e-mail with other computers. The historical program was called Sendmail. In order to use the system, the user logs into the system (more on this below) and then uses a program called a mail user agent (MUA) (traditionally just a program called "mail").

Alice can send mail to Carol using the MUA, which contacts the MTA^[6] and asks it to send it to Carol. The MTA then contacts the MTA—using a protocol called SMTP—on Carol's computer and asks it to deliver it. Carol's MTA then stores it on the disk in Carol's mail file (this is just a single big file with all the messages in it). Carol can then use her MUA to read her messages.

Importantly, both the MTA and MUA are readily replaceable: the system administrator can replace the MTA (other popular MTAs include postfix and qmail) and users can choose their own MUAs (writing new MUAs was a very popular pass-time in the early days of the Internet). In fact, two users on the same computer can run different MUAs without interfering with each other. What makes this work is that both the protocol that the MTAs use to talk to each other and the interface between the MUA and MTA are stable and well-defined. The end result is that people are able to customize their own e-mail experience, including the look and feel, filtering, etc.

Remote Mail #

Back in the really old days, you would log directly into the server, either by using a terminal directly connected to it or over a modem. In either case, you're running the MUA directly on the server, which, recall you are sharing with others. That computer is just displaying stuff on your screen. This typically looked something like this (if you were lucky):

Mailbox is '/usr/mail/mymail' with 15 messages  [Elm 2.4PL22]
        ->   N     1   Apr 24   Larry Fenske   (49)    Hello there
             N     2   Apr 24   jad@hpcnoe     (84)    Chico?  Why go there?
             E     3   Apr 23   Carl Smith     (53)    Dinner tonight?
             NU    4   Apr 18   Don Knuth      (354)   Your version of TeX...
             N     5   Apr 18   games          (26)    Bug in cribbage game
              A    6   Apr 15   kevin          (27)    More software requests
                   7   Apr 13   John Jacobs    (194)   How can you hate RUSH?
              U    8   Apr 8    decvax!mouse   (68)    Re: your Usenet article
                   9   Apr 6    root           (7)
             O    10   Apr 5    root           (13)

       You can use any of the following commands by pressing the first character;
       d)elete or u)ndelete mail, m)ail a message, r)eply or f)orward mail, q)uit
       To read a message, press <return>.  j = move down, k = move up, ? = help
        Command : @

[Source: ELM user's guide]

This is from a relatively modern UNIX mailer called ELM.

This was fine back in the day, but as people started to get more powerful personal computers, it became increasingly unsatisfactory, for a number of reasons, but principally because it was slow and ugly. Slow because every time you wanted to do anything it required a round trip to the server. This included when you were composing an email and every character you typed had to go up to the server before it was echoed on your screen. Ugly because it was only this kind of text-based display and people (1) wanted a GUI and (2) wanted to be able to display rich content such as emails containing images.^[7]

POP versus IMAP #

The major conceptual difference between POP and IMAP is that POP is designed for a scenario where the user downloaded all of their new messages and then deleted them from the server. This works fine if you only have one mail client but if you have multiple devices (say a laptop and a phone) then once one device has downloaded the messages, they won't be available for the other device, which is obviously bad. By contrast, IMAP is designed to leave all of the messages on the server, which means that multiple devices can be used to access the same mail account. IMAP also has support for storing a lot of state (e.g., folders, read versus unread, etc.) on the server, thus providing a more seamless experience for the user.

The obvious fix is to run the MUA on the user's machine and instead have it retrieve the mail from the server and display it locally. In principle, the MUA could just log in as Alice, download all the messages, and process them locally, but that would be inconvenient and slow; what you want is some network protocol that allows you to retrieve messages one at a time. The first popular such protocol was called Post Office Protocol (POP) but POP has been to some extent superseded by Internet Message Access Protocol (IMAP). In either case, there is some program running on the mail server machine which runs POP or IMAP. The MUA on the user's machine contacts that server and uses the relevant protocol to retrieve the user's messages, as shown in the figure below:

Importantly, nothing had to change on Carol's side in order to allow Alice to read her mail remotely like this. atlanta.org just had to install an IMAP server^[8] and then Alice could download an appropriate MUA and use it to talk to the server. Moreover, it's possible for some people on atlanta.org to use remote mail and some to read their mail by logging in as before, as we see Bob doing in the picture above. Of course, the mail provider can choose to offer remote only service without offering the ability to run programs on their servers at all. This is an important operational and security advantage and is how most big mail providers (e.g., Gmail) operate now. However, all of this is invisible to the other side.

Moreover, once atlanta.org has installed an IMAP (or POP) server Alice is free to use any MUA she wants as long as it speaks IMAP (or POP). Because the protocols are published anyone can just write their own MUA that conforms to the protocols. Again, this is critically important because it allows for new mail software to innovate and for Alice to choose the interface and features she likes the best (or even to write her own mail software!). You want all the images suppressed or rendered in black and white? Simple matter of programming? No problem. You want to read your email in a different font? Sounds good. You want it read out loud to you in the voice of Malcolm Tucker? Simple matter of programming. The client is in total control of how things are rendered because it's an open, interoperable system.

In principle, of course, it was always possible to build a totally closed mail system—Microsoft Exchange was like this to some extent—once an interoperable ecosystem had been developed it had a tremendous advantage because it was easy to unilaterally roll out a new mail client or server without changing every other part of the system. Even mail systems which had proprietary elements were still forced to speak standard protocols to some extent, especially for the mail format and delivery parts of the system.

Other Applications #

Of course, e-mail isn't the only application that can run on the Internet. The way the Internet protocols was designed is inherently flexible. providing transport protocols that can carry any kind of traffic, so if you want to build a new application and it can run over IP (these days, TCP and UDP), you can carry it over the Internet, with no need to stuff it into an 8 kHz voice channel. Moreover, you don't need any cooperation from the network itself; you just need to upgrade the endpoints to support your new application, which is a huge deployment for advantage. The result of these design choices was an explosion of innovation, starting in around 1992 with the Web and that is still happening today.

The Web #

This brings us to the topic of the Web which is probably still the most important single application on the Internet. With all that, it's technically just another networked application.

When the Web was designed, it was built on similar principles to the Internet as a whole, with published—though initially without really clear specifications—interoperable protocols that anyone could implement. More or less independent implementations of Web clients and servers started to appear quite soon after Tim Berners-Lee's initial announcement of the Web and everyone just expected that they would talk to each other. In fact, that's what it meant to be part of the Web. Here's how we described this in Mozilla's Web Vision (Emphasis mine):

A key strength of the Web is that there are minimal barriers to entry for both users and publishers. This differs from many other systems such as the telephone or television networks which limit full participation to large entities, inevitably resulting in a system that serves their interests rather than the needs of everyone. (Note: in this document "publishers" refers to entities who publish directly to users, as opposed to those who publish through a mediated platform.)

One key property that enables this is interoperability based on common standards; any endpoint which conforms to these standards is automatically part of the Web, and the standards themselves aim to avoid assumptions about the underlying hardware or software that might restrict where they can be deployed. This means that no single party decides which form-factors, devices, operating systems, and browsers may access the Web. It gives people more choices, and thus more avenues to overcome personal obstacles to access. Choices in assistive technology, localization, form-factor, and price, combined with thoughtful design of the standards themselves, all permit a wildly diverse group of people to reach the same Web.

As of the mid 2000s, the Web was the dominant paradigm for application delivery: if you wanted to build some kind of networked application—and often a non-networked one—you stood up a Web site. This paradigm was so powerful that it even started to absorb standalone applications like e-mail. A full account of this phenomenon would be too long to include in this post, but it seems clear that a huge part of it is due to how easy it is to deploy Web applications to users; there's nothing for them to download or install, they just go to your Web site and the application runs right in the browser. Better yet, when you release a new version you don't need to update the user, they just get the new version whenever they go to your site again.

As with other interoperable applications, the design of the Web allows the client to control how content is rendered and how the user interacts with it. Some important examples of this kind of user control include:

Accessibility features such as screen readers
Automatic password and credit-card form-fill
Ad blocking
Translating Web pages into a different language
"Reader" modes
Downloading pieces of the page (e.g., images) or the whole page
Developer tools which allow the user to inspect the Web page contents

The Web differs from e-mail in one very important respect, which is that the Web allows the server to run programs on the user's computer and those applications can talk back to the server. The vast majority of Web pages have some dynamic content in the form of JavaScript. By contrast, e-mail content is largely static. This makes the Web a much more powerful deployment platform but also limits the ability of the the client to strictly control every aspect of the user's experience.

A good example of this phenomenon is Web-based mail systems like Gmail. The diagram below shows the high level architecture of this kind of system.

Conceptually, this is exactly the same architecture we had before, with a MUA talking to a server, except that instead of being a standalone app, the MUA is a JavaScript program running in the browser. However, there's one big difference: because the Webmail service controls both the Webmail server and the Javascript based MUA they don't have to use a standardized protocol like IMAP; they can just build a proprietary protocol. And because deploying new JS code on the Web is so close to frictionless, they can change it whenever they want. So even though it's all running on a standardized substrate of the HTTP and HTML/JS/CSS, systems like this are actually fairly closed because all the important stuff is happening in the downloaded JS code rather than in the standardized pieces.^[9]

Even so, the browser itself still maintains a fair amount of control over how the application behaves. Aside from the examples above, such as Firefox Picture-in-Picture or add-ons like such YouTube Enhancer which modify the behavior of popular sites such as YouTube even though they are to a great degree JS applications.

Mobile Apps and App Stores #

In the early 2000s it looked like the Web model had totally won and native apps were toast but that changed in 2008 with the opening of the iOS app store.^[10] The app store standardized the process of downloading, installing, and updating mobile applications—at least on iOS—resulting in a system with almost as frictionless as the Web and with a number of important technical advantages. The result was a rapid takeoff of the use of mobile apps to the point where they are the dominant mode of mobile usage.

[Source Wikipedia]

Because of the app store, mobile apps have many of the deployment advantages of the Web but are far less open. Just like a Web app, the vendor controls both the client and the server, but unlike on the Web, there is no browser intermediating the app's interaction with the user, and so there's no opportunity to modify the behavior of the app, e.g., for ad blocking or translation. Of course, the operating system could in principle decide to do this kind of stuff—and the mobile OSes do do some technical enforcement of their policies—but the platform just isn't engineered for this kind of user agent the way the Web is.^[11] As a practical matter, then, if you want to use some network-based service that hasn't gone out of their way to open their interfaces you're mostly going to be using their app without any real opportunity to control your own experience except in ways designed into the app. This is why, for instance, you have to have five different apps on your Roku, one for each streaming service (including separate ones for Disney and Hulu, even though they are owned by the same company!), rather than a single app which will work with any streaming service.

Closed versus Open #

There are a number of reasons why application vendors might prefer closed versus open systems:

Flexibility.: If you control both ends of the system, then you can evolve it much more quickly because you don't need to wait for anyone else to change. This is the argument made by Jonathan Rosenberg and also in this post by Moxie Marlinspike on why Signal isn't federated.
Barriers to entry.: In an open system a potential competitor can enter the market by standing up a new endpoint (e.g., a new client) without having to displace the entire ecosystem. As a concrete example, when Google launched Chrome they didn't have to displace every Web server in the world because Chrome automatically worked with them.
Control.: If you control the clients then you know that they behave the way you want them to. To some extent this is just a matter of system stability and not having to deal with potential problems from broken clients, but it's also a way to enforce your preferences when they might differ from those of the users.

The important point for the purposes of this post is "control". There are a number of situations in which the user's preferences and those of the site aren't in alignment, such as:

Ad blocking.: Sites and apps make money by showing ads, but users don't like to see ads, which is why they often run ad blockers. Obviously, the providers would prefer that users actually saw the ads.
Access to content (digital rights management).: Web pages can of course play audio and video, but historically the providers of that content have been very concerned about unauthorized downloading and reproduction. In an open system, however, nothing stops the client from storing the raw media.

Encrypted Media Extensions #

This last issue was responsible for the one major case in which the Web has deviated from the principle of openness, namely HTML Encrypted Media Extensions (EME). In the early days of the Web, media was largely played through Adobe Flash, which had Digital Rights Management (DRM) mechanisms designed to prevent exporting content. These mechanisms took in encrypted media and decrypted and displayed it, but were designed to resist user tampering to exfiltrate the media.

Starting in the early 2010s browsers gradually started to deprecate Flash, both in response to concerns about security and as more and more of its capabilities started to be added to the Web platform. One of those capabilities was the ability to play video, but the large video streaming services (especially Netflix) were concerned about people using the browser to save media and so were unwilling to use the HTML5 <video> tag as-is. Instead they proposed a new technology called Encrypted Media Extensions (EME), in which a closed DRM Content Decryption Module (CDM) was embedded in the browser to decrypt and display the media.

EME was highly controversial but eventually every major browser included it. I can't speak for other browsers, but I was at Mozilla when they decided to implement EME in Firefox and the conclusion was that given that other browsers were going to implement EME it was better to have people able to watch videos—which we knew they wanted to do—in Firefox than that they switch to another browser. The implementation of EME in Firefox was designed to limit the capabilities of the CDM, so that it had limited access to the user's computer and couldn't be used to track users.

Back to Web Environment Integrity #

This all brings us back to WEI, which is a proposal for attestation for the Web. For more background on attestation see here, but briefly the idea with attestation is that you have some "trusted" piece of hardware on the user's device (in this case "trusted" means "not controlled by the user but rather by the manufacturer", so it's trusted by the web site, not by the user) which is able to vouch for the software that runs on the user's computer. Most modern mobile devices and many if not most laptop devices now have such a piece of hardware.

The motivation for the proposal is described as follows:

Users like visiting websites that are expensive to create and maintain, but they often want or need to do it without paying directly. These websites fund themselves with ads, but the advertisers can only afford to pay for humans to see the ads, rather than robots. This creates a need for human users to prove to websites that they're human, sometimes through tasks like challenges or logins.

Users want to know they are interacting with real people on social websites but bad actors often want to promote posts with fake engagement (for example, to promote products, or make a news story seem more important). Websites can only show users what content is popular with real people if websites are able to know the difference between a trusted and untrusted environment.

Users playing a game on a website want to know whether other players are using software that enforces the game's rules.

Users sometimes get tricked into installing malicious software that imitates software like their banking apps, to steal from those users. The bank's internet interface could protect those users if it could establish that the requests it's getting actually come from the bank's or other trustworthy software.

The high level idea is that there would be a JS API that the site could call which would cause the browser to ask the OS—and presumably transitively the aforementioned trusted hardware—to attest to some properties of the browser^[12] The spec is silent on what is being attested to and the Explainer is pretty fuzzy:

The proposal calls for at least the following information in the signed attestation:

The attester's identity, for example, "Google Play".

A verdict saying whether the attester considers the device trustworthy.

These two pieces of information basically serve to guarantee that the code is running on some device made by a manufacturer that the Web site trusts. This already means that we don't have a completely open system: because it's not possible to build a new piece of hardware yourself that will be able to provide the correct attestation: you instead need to have some closed third party module. You probably also need a trusted and locked-down operating system, because otherwise the OS can tamper with the behavior of the browser, so good luck if you want to run Linux!

Moreover, this attestation isn't very useful in and of itself: the first three use cases are ones in which the browser connecting to the server is controlled by the attacker, and so all they demonstrate is that the attacker was able to afford a single device made by such a manufacturer. However, they could be running any software they want on it. They don't even need to be using the device to run their browser. They can use a single trusted device to generate an arbitrary number of attestations up to the performance of the device—and modern hardware is very very fast—so the effectiveness of this limited attestation seems fairly low. In order to effectively address these use cases, you need the attester to provide more information.

The explainer goes on propose two other types of information:

The platform identity of the application that requested the attestation, like com.chrome.beta, org.mozilla.firefox, or com.apple.mobilesafari.

Some indicator enabling rate limiting against a physical device

The basic intuition behind rate limiting is that it prevents the kind of large-scale attacks I mentioned above in which the attacker has a lot of browsers connected to a single trusted device. This might be useful in terms of preventing ad fraud attempts where the attacker pretends to have a large number of devices representing a large number of legitimate users, though it could be tricky to set the rate limits correctly: some people do a lot of browsing and you don't want them to suddenly run up against a rate limit. So at best this multiplies the attacker's costs by making them buy more trusted devices.

Rate limits, do not, however, address the game anti-cheating use case because the problem isn't that the user is doing an unreasonable number of attestations but rather that they are running cheating software on a legitimate device. The only way to address this is to have the attestation cover the software itself, in this case the Web browser. This is where the proposal to indicate the identity of the application (e.g., com.chrome.beta) comes in. Presumably the relier would have a list of browser software that it trusts behaves correctly and would reject any requests from other pieces of software, or at least flag them for special handling (and inconvenience). This means that if you want to run something other than a major browser or even build your own, you're totally out of luck.

Moreover, in order for this to work, the software—and probably the operating system—needs to be unmodified and not to have affordances that allow the user to adjust its behavior in an undesired fashion. This is an incredibly strong condition because a browser is a very complex and configurable piece of software. For instance Firefox has hundreds of configuration parameters that users can set, some supported and some unsupported; it's very likely that some of them would let users modify behavior in ways the site wouldn't want. Beyond configuration, most browsers allow you to install extensions/add-ons which substantially change the behavior of the browser, so any add-ons need to be part of the trusted list. The WEI proposal says that this should be fine because:

Web Environment Integrity attests the legitimacy of the underlying hardware and software stack, it does not restrict the indicated application’s functionality: E.g. if the browser allows extensions, the user may use extensions; if a browser is modified, the modified browser can still request Web Environment Integrity attestation.

I don't see how this can be the case, though. I suppose it's possible that as a technical matter, you could get an attestation (e.g., "This is a version of Firefox with unknown modifications" or "This is a version of Firefox with the 'I am cheating at this game'" add-on), but the site clearly can't treat this attestation as meaningful without defeating the security guarantees of the system.

Of course, you might decide to abandon the anti-cheating use case—and any others that don't involve pretending to be a lot of different devices—but that would be much more limited system than this, more similar to Apple's Private Access Tokens, which are supposed to just attest to the device itself (this is also bad, but not as bad as WEI). However, if you want to ensure that individual users' machines behave in some specific way, you need the attestation to cover the software on the user's machine, not just to attest that they had some limited amount of control of a trusted device.

I know a lot of people care about cheating in games, but it's a bit of a niche use case. However, the elephant in the room here is advertising: a lot of people use ad blockers and many sites try to detect this case and refuse service to them. One potential application of WEI is forcing users to prove that they're not running an ad blocker. The explainer doesn't list this as a use case, but also doesn't really disclaim it and once remote attestation exists there is going to be a huge financial incentive to deploy it for this purpose. Obviously, preventing ad blocking in the browser would require attesting to the whole browser stack, not just that the browser is running on a trusted device, as if the user controls their browser they can just disable ad display, since ad blocking is typically a modification, or sometimes a feature, of the browser.

The bigger picture #

The basic property of an open system like the Internet and the Web is that you can only be assured of the properties of the elements you directly control. The elements that belong to other people work for them and not you. In a closed system, by contrast, the software on the end user device works for the provider, not for them, whether it is officially owned by the user (as in mobile apps) or it actually belongs to the provider (as with the old Bell System monopoly).

WEI and similar attestation technologies represent an attempt to impose an alien model, that of a closed system, onto the open system of the Web. As with any closed system, the net impact will be that users don't control their own experience of the Web but rather have only the experiences that sites are willing to let them have. That seems bad.

Ironically, the Carterfone didn't actually plug into the wall socket. Instead, it used an acoustic coupler that tied into the phone handset. However, the decision was broad enough to allow for electrical interconnection. ↩︎
Yes, I'm simplifying here, because the phone network just carries analog signals in a given frequency and amplitude range. ↩︎
Obviously, the phone company could tell that this wasn't voice traffic, they just had to pass it through anyway. ↩︎
The jargon in routing is "autonomous system". ↩︎
I'm simplifying a bit because for some time there were actually restrictions on commercial use, but these were gone by the early 1990s. ↩︎
Actually, back in the day, it just executed sendmail directly. ↩︎
And yes, I do I know about X, but remote X is not the answer. ↩︎
In principle Alice could have installed one just for herself, but that's not how it's typically done. ↩︎
See this 2011 presentation by VoIP pioneer Jonathan Rosenberg (JDR) and this Internet Draft by Tschofenig, Aboba, Peterson, and McPherson for an argument that this phenomenon meant the end of application-layer standards. ↩︎
Ironically, Steve Jobs initially didn't want an app store and instead had in mind something more like what you'd now call a Progressive Web App but demand for real apps was overwhelming and here we are. ↩︎
In addition, because of the way that the Web evolved, many JS applications operate by changing elements on the Web page (e.g., "now render this new piece of HTML") which means that the browser can generally figure out what the page is doing; a property called "semantic transparency". In principle, those applications could just write pixels onto an HTML canvas but that's more difficult and not the standard approach. ↩︎
This might also involve calling out to some server, but everything here is rooted in the trusted hardware on the device. ↩︎

How NATs Work, Part IV: TURN Relaying

2023-07-17T00:00:00Z

As discussed earlier there are some configurations where it is not possible to establish a direct connection between two endpoints. For instance, if Alice has a NAT with address-dependent mapping and Bob has a NAT with address-dependent filtering, then the packets from Alice will never match any filter on Bob's NAT and will just be dropped. Similarly, the packets from Bob will not match any mapping on Alice's NAT and will be dropped. The only way to send data between these two endpoints is with the assistance of a server, as shown in the blue path in the diagram below.

There are any number of possible protocols one might use to send data through a server. For instance, you could connect through a VPN or even send each individual packet as an HTTP request to the server. However, the IETF has standardized a specific protocol which is designed to be used with ICE, Traversal Using Relays Around NAT (TURN).

TURN #

Conceptually, TURN is an application layer relay protocol: the TURN client (i.e., the user's device) sends packets to the TURN server addressed to the other side and the server forwards them, as shown below:

In this example, Alice is communicating with Bob through her TURN server (generally each client will have an associated TURN server, as described below):

When she wants to send a packet to Bob, she sends it to the server's address (198.51.100.1) but with a label telling the server to forward it to Bob. The server removes the label and sends the packet to Bob.
When Bob wants to send a packet to Alice, he sends it to the TURN server, which forwards it to Alice. The packet will arrive at Alice's machine with the TURN server's IP address, so the TURN server has to add a label telling Alice that it originally came from Bob. Otherwise Alice wouldn't be able to distinguish between packets from Bob and Charlie when they come through the TURN server.

It's important to see that there is an asymmetry here: Alice has a relationship with the TURN server and is explicitly communicating with it. From Bob's perspective, however, it's just as if the packets came from the TURN server, and unless he has some external knowledge, he has no way of seeing that he's actually communicating with Alice through the TURN server, rather than the server itself (because from an IP layer perspective that's actually what's happening).

The opacity of the TURN server from Bob's perspective has an important consequence, which is that the server has to keep state in order to distinguish multiple endpoints that Alice is talking to. Consider what happens if the server has two clients, Alice and Charlie. The packets from Alice and Charlie are labeled with where to send them, but the packets from Bob are not, so do they go to Alice or Charlie? The only way for the TURN server to know is to keep some state. For instance, it can assign outgoing packets from Alice one port and packets from Charlie a different port, so that when Bob replies it can look up incoming port and know where to send it. If this sounds familiar, it's because this is exactly what a NAT does and for the same reason: it has more than one client sharing the same external IP address, in this case the address of the TURN server. All application relays have to do something like this, because otherwise they wouldn't be able to talk to unmodified peers, which is a hard requirement for incremental deployment.

Allocations and Permissions #

In order for Alice to send and receive data from Bob, TURN requires that she explicitly create state on the relay (unlike a NAT where the state is implicitly created by sending packets). This is done using two transactions, allocating an address and creating a permission, as shown below:

The first thing Alice does is to allocate an address (really a port, because the server probably only has one address, or maybe one each for IPv4 and IPv6) on the TURN server that she will be using to send and receive packets. The TURN server replies with the address and port that has been allocated. Alice can immediately send this entry to peers so they know what it is.

Alice can use this address to send to multiple peers, as described above, but it's not yet associated with any individual peer. In order to actually send packets, Alice needs to next create a permission entry for a specific peer. Until Alice has created a permission for a given peer, packets to from that address will just be dropped by the TURN server. With ICE Alice learns peer addresses because those peers send their candidates and then Alice would create a permission for each candidate address before sending packets to it.

Note that this is effectively an address-independent mapping with an endpoint independent filtering policy: Alice uses the same address and port to talk to everyone but the TURN server blocks incoming packets from anyone that Alice hasn't explicitly identified. This analogy isn't perfect because the permission is explicitly created and Alice can't even send packets to those endpoints either before sending a permission request, but it's close enough as a mental model. However, this isn't port-dependent filtering; the TURN server will accept packets from any port once a permission has been created for a given address. This produces better results with endpoints which have address-dependent mappings.

To put this all together, here is what TURN looks like as part of an ICE transaction, showing a complete connectivity check.

The initial part of this example is the same as the previous one: Alice contacts the TURN server, gets an allocation, and send it to the signaling server. That signaling server forwards it to Bob, who sends back his own candidate. At the same time, Bob also tries to do a connectivity check to Alice's candidate, just as he would any other candidate. However, this fails because Alice hasn't created a permission for Bob. Once Alice creates that permission, then she sends her own check to Bob, which succeeds, as does Bob's in the other direction. Note that there is a race condition here: it's possible for Alice's permission request to complete before Bob's connectivity check arrives, in which case that packet would get delivered, even though Alice hadn't send a connectivity check to Bob. Either way, ICE will eventually succeed.

You should notice that Bob doesn't need to be aware of the fact that Alice's candidate is actually from a TURN server; it just sends to it as if it were any other candidate. In ICE, candidates are actually labeled by type, but this isn't necessary for ICE to work.

I can't believe it's STUN #

Believe it or not, TURN is actually an extension for STUN: TURN data is encapsulated in STUN packets. For instance, you do allocation by sending a STUN message of type "Allocate" and you send packets by sending a message of type "Send". This is actually not quite as strange a design decision as it might initially appear, for several reasons:

You really really want to run TURN over UDP rather than TCP (see below).
Because UDP is unreliable you need some transaction mechanism to allow the client to make requests from the server, retransmitting those requests when lost. STUN already has this.
ICE implementations already have STUN stacks. As one nice side effect, though the TURN server will actually tell you your server reflexive address, so you don't need to do a separate request to a STUN server to learn it.

If one were designing this protocol today, you would probably base it instead on some protocol that added reliability to UDP (e.g., QUIC), but TURN was originally designed in 2010, so things were different back then.

Channels #

One real drawback of using STUN is bloat. Sending a single packet with a Send (outgoing) or Data (incoming) indication adds 36 bytes of overhead. Here's an example packet diagram, based partly on the one from the STUN RFC:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
|0 0|     STUN Message Type     |         Message Length        |\
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 
|                         Magic Cookie                          | | 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Header
|                                                               | |
|                     Transaction ID (96 bits)                  | |
|                                                               | /
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Type=XOR-PEER-ADDRESS |            Length=8           | \
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Peer
|0 0 0 0 0 0 0 0|    Family     |         X-Port                | | Address
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
|                X-Address (32 bits for IPv4)                   |/
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Type=Data             |            Length             |\
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Data
|                       Variable data ....                      |/
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Most of this is overhead. First, every packet has a fixed 20 byte header, which mostly acts to identify it as STUN and tell you what message type it is (e.g., Send indication). Then you have the peer address and the data encoded in an inefficient tag-length-value format. None of this overhead really mattered for STUN's original application, where you just sent a few messages, but when you have to absorb it for every packet you're sending (at a rate of maybe 20-50 per second) it adds up quickly. The remote address and port is also sort of redundant because there are only a few addresses in use, so you could compress them by just sending a short address ID.

TURN includes a mechanism called "channels" which does exactly this. The client can send a request to the TURN server to allocate a two-byte channel ID to a given remote address and port (the same information as would be needed for a permission). Once the channel is allocated, packets can then be sent or received by just prefixing them with the channel ID and length,^[1] like so:

0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|         Channel Number        |            Length             |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
/                       Application Data                        /
/                                                               /
|                                                               |
|                               +-------------------------------+
|                               |
+-------------------------------+

If you're a real protocol engineering nerd, you might ask how you distinguish a message containing channel data from a STUN message, as they are carried on the same host/port quartet. The answer is that STUN message types always have the first two bits as zero and channel IDs are required to be between 0x4000 and 0x4fff.

You might also wonder at this point why STUN conveniently has a range of message types which can't be allocated: the reason is that when STUN was designed people wanted to make sure that it could be easily demultiplexed (i.e., distinguished) from RTP and RTCP, which always have the first bit of the first byte set to 1.^[2] There has actually been quite a bit of hackery around easily demultiplexing various types of messages in real-time multimedia. Some of this was due to intentional design and some was just fortuitous design choices that people—by which I partly mean me—took advantage of. For instance, DTLS has record types as the first byte, but these are always low numbers and so easy to distinguish from RTP and RTCP. At this point there are actually five separate types of protocol message which can be carried over the same host/port quartet: (1) STUN (2) ZRTP (3) DTLS (4) TURN channels and (5) RTP/RTCP. Someone had to write a whole RFC to systematize how to do it.

No incoming connections? #

One side effect of the requirement to create a permission for a specific peer address is that it is not possible to use TURN to run a generic server behind a NAT or firewall. A typical server, such as for Web or mail has a fixed address and port which anyone can use to connect to it, but because TURN requires that the TURN client create a specific permission for each peer, arbitrary clients on the Internet cannot just connect.

This limitation is not an oversight but rather a deliberate design choice. Recall that it's common for firewalls to enforce an "outgoing connections only" security policy. Without this limitation it would be straightforward for clients to bypass this policy by just connecting to a TURN server on the Internet. The TURN designers were concerned that if TURN enabled this kind of policy bypass enterprise administrators would respond by blocking TURN entirely (recall from the previous section that TURN is trivial to identify.) The idea was that if TURN could only be used for outgoing connections, then administrators would be more likely to allow it through the firewall.

What about when STUN or UDP is blocked? #

Despite the "no-incoming" compromise embodied in the permissions design, it is still sometimes the case that STUN over UDP is blocked. The reasons for this vary, but include:

Firewalls that block all UDP traffic.
Firewalls that do so-called "deep packet inspection" and block any packets from protocols they don't recognize

Data from the initial deployments of QUIC suggest that somewhere around 5% of clients can't use an arbitrary new UDP-based protocol, though it's unclear how often this is due to UDP blocking or just to blocking unrecognized protocols. In order to get around this kind of blocking, it is also possible to run TURN over TCP as well as over TLS. If you have a firewall which just blocks UDP, then running TURN over TCP will often work. If you have a firewall which blocks unknown protocols then running TURN over TLS^[3] might work.^[4] The idea here is that there are other protocols that firewall administrators want to support (e.g., HTTP or HTTPS) that run over TCP and/or TLS and if they haven't configured their firewall rules too strictly, then TURN may also work.

It's important to understand that it's still quite easy to recognize TURN in these situations:

By default STUN uses a different port number than HTTP
If TLS isn't used you can just look at the TCP packets to see if something is STUN.
When TLS is used, the TLS ALPN extension indicates that TURN is in use.

Again, this is by design and reflects an attempt to take a compromise approach to blocking of TURN in which network operators can block TURN if they want to but in cases where they just configured their rules in a way that incidentally blocks TURN (in some cases before TURN was even designed), then TURN should work. The history of new protocol development is full of this sort of uneasy compromise: on the one hand we want to deploy new stuff and there are lots of network elements which are very hostile to that, often unintentionally. On the other hand, a situation in which the applications are just at constant war with the administrators is a recipe for breakage.

With that said, in the past few years attitudes towards network-based blocking have changed a fair bit, including technologies like DNS over HTTPS, QUIC, and TLS Encrypted Client Hello which are intended to make it harder to selectively block traffic unless you have control of one of the endpoints. If TURN were being designed today, I'm not sure the same choices would be made.

Why not TCP #

While it's possible to run TURN over TCP, you really don't want to if you can avoid it because performance will generally be bad. Covering this topic fully is out of scope for this post (though stay tuned for my long-delayed posts about transport protocol performance), but here is a brief sketch to help you build some intuition.

Head-of-line Blocking #

The first problem derives from the fact that TCP delivers packets to applications in order. However, this means that if a packet is dropped, then every packet received after that is held by the receiving TCP implementation until that packet is received, as shown in the following diagram:

In this case, the sender sends packet 1 which arrives at the receiver and is delivered to the app immediately. However, Packet 2 is dropped and so packets 3 and 4 are just buffered until Packet 2 is retransmitted, at which point all three are delivered. For more on this topic see my introductory post about transport protocols. This phenomenon is called head-of-line blocking (HOLB).

HOLB is fine for applications where everything happens in order but less good for audio and video (A/V). A/V consists of a series of independent pieces of media, short sound snippets of 20-50ms in the case of audio, and frames in the case of video. In order to have a good experience, these need to be played out at regular intervals or the media will look and/or sound choppy. Of course, the network doesn't deliver them at exactly the right time, so the receiving implementation delays them a little bit in what's called a jitter buffer before playing them out.

The key word here is "a little bit": media latency of more than 200 ms or so is intensely undesirable. However, it's not uncommon for TCP implementations to wait far longer than this for retransmission, during which all the media would be delayed. In these cases, it's better to just drop the missing frame and play the next frames at the appropriate times. Fancier implementations use packet loss concealment techniques to fill in the missing data, but even if you just play the next frames it's better than waiting. With UDP, packets are delivered to the application at the time of receipt, but the TCP logic is all in the operating system, so there's no way to get any data until all earlier data is received.

Rate Control #

The second problem is that TCP is designed to adapt its sending rate to match network conditions, in part by buffering data until it thinks it's safe to send. The problem here is that unless the media sender is also adapting its rate to network conditions, then it's sending data to TCP faster than it can be transmitted, which creates buffering and/or packet loss. Rate control for real-time protocols is a complicated topic, but the TL;DR is that you really only want to have one rate control regime, which should be at the media layer, and then the network protocols just transmit whatever they are asked to right away. Sending over TCP prevents that. Obviously sending over TCP is better than not being able to make a call at all, but if at all possible you want to send your media over UDP.

TURN Server Deployment Scenarios #

In ICE, both sides will generally have TURN servers, in which case each side will offer relayed candidates. Depending on the properties of each network, ICE might end up using neither relayed candidates, have one of the sides talk directly to the other side's relayed candidate, or have the traffic go through both relays. In general, because TURN's mapping and filtering model are fairly permissive, it will generally not be necessary to go through both TURN servers unless both sides have really unfortunate networking configurations.

Note that with WebRTC generally both sides will use the same TURN server. When TURN was first designed, real-time communications over IP mostly meant people with softphones or hardware IP phones. Those devices were associated with some provider, whether it was an enterprise system or a consumer VoIP provider. In either case, the provider would supply the TURN server (recall from part III that running TURN servers isn't cheap). If someone from provider A is calling someone from provider B—though SIP federation was never as common as people were hoping—then you might have a situation where each user had a different TURN server. By contrast, most WebRTC deployments are in settings where there is only one provider and so everyone uses the same TURN server.

Note that most conferencing systems are deployed in a star configuration in which each participant sends their media to a central media conferencing unit (MCU) or switched forwarding unit (SFU).^[5] Because these servers are both on the open Internet, it's much less likely you will need to use a TURN server. Because you don't need to get through a NAT or firewall on the server side, it should work even if you have a really uncooperative NAT. The main time you would need a TURN server in this environment is if you were behind a firewall which blocked all media (e.g., because it blocked UDP). Note that if the MCU/SFU and TURN server are operated by the same entity, there is an opportunity to integrate them closely, though I don't know if people actually do this.

Final Thoughts #

Out of the whole IETF NAT traversal protocol suite, TURN probably feels the oldest, even though it was designed at about the same time. It's a bespoke application relaying protocol built on top of a protocol which was originally designed for a totally different job, namely discovering your reflexive IP address. In the modern era, we'd probably build something fairly different and more like MASQUE, which is a generic UDP proxying protocol built on top of HTTP/3 and QUIC. On the other hand, STUN and TURN are a lot simpler than QUIC, they get the job done, and they're already built in browsers and softphones, so I imagine we'll be using them for some time.

You could actually omit the length field as well if you restricted yourself to UDP and only sent one packet per UDP datagram. ↩︎
The reason for the magic cookie is to ensure that it could easily be demultiplexed from any protocol, whether it had this distinguishing first byte or not. The cookie is just a fixed 4 byte value that is at the same position in every STUN packet. It's unlikely that it will be in the same position in other protocols and so helps identify STUN. ↩︎
Note that it's not necessary to run TURN over TLS in order to protect the media, which needs to be encrypted anyway. ↩︎
It's also possible to run turn over DTLS, but this isn't much more likely to work than regular TURN. ↩︎
These are different, but the difference doesn't matter for these purposes. ↩︎

Broken Arrow Triple Crown Race Report

2023-07-10T00:00:00Z

This year has turned out to be light on racing in part because I was kind of wiped out after last year and in part because I had signed up for the Broken Arrow Skyrace in Tahoe in June. Broken Arrow isn't actually one race but a race festival that takes place over three days. All of the races are relatively short compared to what I usually do (the longest is nominally 46 km/29 mi, but they offer what's called the "Triple Crown" which consists of the following three races over three days, listed as:

Race	Distance	Vert
Vertical Kilometer (VK)	4.8 km/3 mi	914 m/3000 ft
46K	42.5 km/26.5 mi	2774 m/9100 ft
23K	21.75 km/13.5 mi	1443 m/4700 ft

The 46K is supposed to be two loops of the 23K, but you'll notice that the distance and vert don't quite line up and of course the distances don't actually match the names. This is in part because of rerouting due to the huge amount of snow that dropped in the Sierra this summer (also preventing me from doing the warmup adventure run in the Sierras that I had planned). In the event, the 23K got totally rerouted on race day anyway.

Anyway, naturally I decided to do the Triple Crown, both because it sounded fun and because I wasn't really willing to drive to Tahoe for a 46K. Also, they gave out a massive amount of swag. My overall plan was to push the VK moderately hard, race the 46K, and then see what I could do on the 23K.

Flagstaff #

The race start is at Palisades Tahoe (6253 ft) and goes up from there, so you're at significant altitude the whole time. I've gone directly from sea level to altitude and raced before, with mixed results (OK at Tahoe 100K, awful at Tushars 70K) but often people actually feel worse on the second or third day at altitude (see Corinne Malcolm's excellent article on altitude adaptation at iRunFar.com), and so I didn't want to try to race three days in a row without any adaptation, so I decided to spend two weeks in Flagstaff (altitude ~7000 ft) beforehand.

On balance, I think this was a good choice. As usual, I felt lousy the first few days at altitude but by the time I had been there a couple of weeks I was feeling mostly adapted. I flew back on Wednesday and on Tuesday, my friend Kate, my son (3200m PR: 10:52), and I went to the Grand Canyon to do the Bright Angel–Tonto–South Kaibab loop. This was a bit of a hot dry slog on the way up, but I generally felt OK, so I figured I was ready for Broken Arrow, which of course is actually cold and snowy rather than hot and dry.

VK (results, finish video) #

Kate and I drove out to Tahoe Thursday morning where we were staying with Lisa and Stephen who were both doing the 46K. Kate was doing the VK and the 23K, so I was the only one doing the Triple Crown. We got there around 6 PM, but fortunately the race didn't start until 10 AM, so we were able to go out and grab some pasta and still get enough sleep.

The profile for the VK is shown above. Looks gentle, but that's just a trick of perspective because it's stretched out; it's actually about 1000 feet per mile.

I'd never done a VK before, so I wasn't sure what to expect. The pros do it in about 30 minutes (winning time was 39) so I was expecting an hour or so, which means you're going at a fairly high intensity right from the start. On the other hand I knew I had to save for the 46K the next day, so it's a bit of a balancing act.

The initial climb was quite steep but on trail with good footing so I was moving pretty fast. I decided to start about midway through the field, which in retrospect was a bit of a mistake, as I immediately had to make my way through people moving slower than me. I was of course hiking at this point, but so was basically everyone else. Quickly, though, the climb turned into a snow slope, where things were quite a bit more challenging. At this point in the day, the snow was already quite slippery and even with poles(LEKI Fx.One Superlight^[1]), I slipped a fair bit. The trick seems to be to step where others have stepped, where the snow is packed and you have a little more traction. It's very hard to pass people on this section because there are only a few lines up the slope and if you get outside the packed down areas you're slipping a lot. There were a couple places where super helpful volunteers had carved out snow steps and those were a lot easier.

Once you get over the first climb, there's a downhill of about half a mile, starting with snow and then moving onto rocky trail. This was the first part of the race where you had to run downhill on snow. I was a bit unstable and managed to trip and fall on the transition to dirt, jamming my 2nd and 3rd fingers on the left hand (but fortunately not breaking either of them like I did to my right 3rd finger in the Grand Canyon at the beginning of May).

From there on it's another climb mostly on trail until you drop off on a sort of fire road. I passed quite a few people on this stretch as the footing was good and so it's just a matter of your ability to power up the climb, something I'm good at. After the fire road, there's maybe 400 m of a fairly rocky (as in almost scrambling) traverse, at which point you get the the "stairway to heaven", which is this sketchy looking metal ladder that you really do not want to fall off of:

There was actually a bit of a backup at the ladder and I had to wait for some others to get over it. In retrospect I should have stowed my poles at this point because they get in the way of climbing and the finish is right after the ladder.

The ladder is obviously single file, and so at this point I figured the finish order was fixed, but there are actually some snow steps and a short flattish stretch of snow before the finish and someone passed me right after the steps before I realized I should sprint, which I tried to do, which resulted in slipping and falling again, but I eventually made it to the line.

Unlike other races, however, the VK just finishes at the top of the hill so there's not much of a finish line, just the arch and a few race staff standing around to give you your medal. Even the finish line drop bags are about a half mile away. I opted to wait around for Kate to finish, but I hadn't brought a jacket and it was super windy, so when she got to the top I was getting cold. We then headed down to the drop bags at the "Siberia" aid station to get our drop bags with jackets. From there it's about a mile to the top of the gondola for the ride down.^[2]

That afternoon, Tailwind Nutrition was having a "meet and great" with ultra great Courtney Dauwalter to introduce their new Courtney-inspired flavor Dauwaltermelon. Back when I did Tahoe 100K in 2018, while my family was waiting for me at the finish line, Courtney rolled through en route to her second place overall at Tahoe 200, and spent a few minutes talking to my then 11 year old son, which he found really inspiring, so I got a chance to thank her for that. Courtney went on to absolutely shatter the women's Western States Endurance Run record the next weekend.

Kate and Courtney talking about ultra

I had brought a pair of the Kahtoola NANOspikes for the snow but didn't use them, in part because it never got super bad and in part because I didn't want to take the time to put them on. However, the trip down to the gondola was mostly snow so I did try them out and they seemed to help a bit, though they're Kahtoola's lightest and shortest spikes and the snow was about 6 inches deep, so they're not magic.

Overall: 1:07:02, 142/395 finishers, 7/39 M50-59

46K (results) #

The 46K was on day two and my plan was to push the pace a bit and then try to hang on for day 3.

As I said earlier, this is two loops, arranged as follows:

A runnable rolling but gradually uphill section, partly on the Western States Trail.
A series of steep climbs on dirt and snow up to the Snow King aid station.
A semi-rocky traverse followed by a climb up to KT-22 where it rejoins the VK course.
From the top of the VK course there's a gradual descent on snow followed by a series of very steep descents.
A climb of about a quarter mile and 400 feet, again on snow.
A fast descent of about 1.5 miles on snow, followed by a mile on dirt road back more or less to the start.

And then you do it all over again. Simple. I didn't really know what to expect on this timewise, but I was thinking something like 7 hours.

After the VK, I was kind of worried about traction, so on Friday afternoon I dropped by Alpenglow Sports and bought a pair of the slightly more aggressive Kahtoola EXOspikes.^[3] They're not that heavy and I figured I could carry them in my pack. Lisa was also doing the 46K and broke the rule about not buying new stuff for a race to get a pair of purple Hoka Torrents.

Lap 1 #

After having to fight my way through people on the VK, I decided to start out more towards the front. This turns out to have been a good plan because you first run across a parking lot and then there is a short section of fire road for a total of maybe 400 m and then you're into single track, so there was kind of a rush for position. I hadn't really warmed up—I usually don't before ultras as you can just warm up in the first few miles—and so I probably wasn't as fast as I should have been and things got bunched up in the single track. It didn't help that there was a low of snow runoff so you were literally running through a stream a lot of the way (no chance of keeping your feet dry!). Eventually I settled into my position, as usual being passed some on the downhills and passing people on the climbs.

After about 3.5 miles, you hit the first climb, which is a steep dirt section, so it was time to pull out the poles. The "trail" part of this climb was pretty rough anyway, so it didn't make much difference if you took a slightly different line and I pulled to the left of the line of climbers and passed a number of people en route to the top. After this, it's another climb mostly on snow up to the Snow King aid station, where I made my first mistake of the day.

As I mentioned, I had broken my finger in the Grand Canyon about 6 weeks before and while I was finally out of a splint, I was still supposed to "buddy tape" the broken finger to the next finger. Anyway, I'd started out wearing gloves but it was starting to get hot and so I wanted to take them off, but then I had to retape the finger and the coban I had been using didn't want to re-stick once it got wet, so I had to get one of the medics to do it with some medical tape. All of this must have taken like 3-5 minutes and I know a lot of people passed me. As they say, when you're stopped you're going infinity minutes per mile.

From Snow King it's a short downhill followed by a bunch of up and down (but mostly up), including a knife edge traverse over a bunch of scree. I took this really tentatively and a bunch of people passed me, but after the Canyon I was mostly focused on making sure I didn't fall and hurt anything, so I was willing to live with it. The climb up to KT-22 is steep and rocky, so I started passing people again.

From here it's the VK course and once I hit the snow traverse I decided it was time for the spikes. They're easy to get on, so it probably only took a minute or two. I do think this helped some as I felt like I was passing some people who were slipping, but it wasn't dramatic the way (I imagine) it would be with crampons. Everything was smooth to the top of the VK and I felt a bit more comfortable on the ladder this time, though I wasn't looking forward to having to do it two more times (the next loop and then the 23K).

The descent from the top of Washeshu Peak starts out straightforward: it's rock and then snow, but then right when I was expecting a nice flattish descent down to the gondola (and then what? not sure) there was a marshal telling me to take a left turn onto, well, I guess you'd call it a slope, but it was straight down and I remember saying something to the effect of "holy shit". The whole slope is something like -15%, and was about mid-calf deep in snow, so I spent the first part of it just desperately trying not to fall until I saw some of the chutes where people had been glissading. I took the hint and sat down and sledded down them (cold!). This got me to the bottom pretty fast and then I turned and saw something else I wasn't expecting: a 400 foot climb. I trudged up the climb, which actually wasn't so bad and then it's a short downhill to the aid station. I stopped and took off my spikes, as they didn't seem to help much on the snowy downhill, and I never used them again. This whole section was also deepish snow for another 1.5 miles or so and then it was onto fire road back to the start.

Split: 3:11:09

Lap 2 #

I wasn't feeling real good about having to do all this again, but I blew through the half-way aid station (split 1: 3:11:24) and headed back out for loop 2. There was definitely more power hiking on the Western States Trail this time, but I still managed to run a fair bit of it. By the time I got to Snow King again I was quite tired and was glad to see that they had Coke (caffeine + sugar = performance) which I used to fill up one of my bottles.

Once I got past Snow King, this loop seemed a lot easier, probably due to some combo of the caffeine and knowing that I was over halfway done. Also, as mentioned above, I'm a lot better on the steep climbs than I am on descents, so once we got past the opening rollers, I knew I just needed to push through those sections fairly hard and then survive the downhill. I did spend some time talking to one of the other runners who was doing her first trail race but had been a collegiate 10K runner and had done a lot of mountaineering and she gave me some tips on how to descend in the snow (heels first!), which seemed to help some.

Things were pretty uneventful from here: I made it to the top and felt a lot more comfortable on the glissading portions and on final the snowy downhill. I didn't need the poles on the downhill but at this point my coordination was starting to go and I couldn't quite get them into the quiver (the typical thing is that one end doesn't quite make it in), so I ended up just folding them and carrying them. By the time I hit the fire road I was mostly alone so I settled in at a comfortable but not all out pace, remembering that I had to race again on Sunday. Coming through the final stretch to the finish I just focused on trying to finish strong.

Split: 3:31:26

I had a bit of time before Lisa and Stephen finished, so I decided to go back to the VRBO and shower and change, but still made it back in time.

All of us after the finish of the 46k

Analysis #

Segment	Overall	Division	Time
Snow King	136	103	4
Siberia	175	130	8
High Camp	177	135	8
Village	184	139	9
Snow King	169	128	8
High Camp	155	117	7
Finish	167	127	9

The chart below tells about the pattern you would expect from the narrative about (though I hadn't actually looked at the chart before I wrote it.) Specifically:

I was doing well on the climbs but badly on the downhills.
I lost a lot of time screwing around at Snow King. Several of people who were ahead of me passed between Snow King and Siberia on the first loop.

With that said, things were tight: 4th was 6:25:27 (17 minutes behind me) and I was less than 10 minutes behind 7th. It's possible I went out a bit hard and faded, but my sense is I was actually stable and that I ran a solid, but conservative race. Probably the biggest loss is between High Camp and the Finish on the last downhill, where if I'd just been better on snow I might not have lost as much time or place.

Overall: 6:42:35, 167/542, 9/46 M50-59

23K (results, finish video) #

My initial plan for the 23K had just been to kind of hold on, but given that I actually felt OK after the 46K, I knew the course, and the 46K start time was fairly late (8:00) so I could get some rest my coach Emily Torrence and I decided it was worth going for it.

Kate and I lined up at the start only to hear the RD announce that because of very high winds at the summit they were rerouting the course from the original 23K loop to be twice the 11K loop and that they would be starting the race at 9:30 to give them time to set things up. In retrospect we should have just gone back to the VRBO to chill out, but instead we ended up just sitting in chairs out front of one of the local restaurants for the next 90 minutes.

Eventually, though, we lined up at the start. The 11K course followed some of the same sections of the WS trail but skipped a bunch of the rollers in favor of the climb to KT22 and then a fast descent on snow back down to the road, then to the finish and repeat. Given the 46K experience, I figured it was a good idea to start near the front and push the pace at the beginning so I didn't have to fight past too many people.

The first loop went quickly (only 10K afer all). After the first mile you're basically climbing the entire time up to KT22 and then it's straight back down. The downhill snow section was steep and slippery with fewer snow chutes on this course so I mostly had to just try to stay on my feet and get down as fast as possible. After the 46K I felt a lot more comfortable with the glissading this time and managed to navigate it reasonably well. Then it was onto the road and the second loop.

Split: 1:18:01

With only 11 km (officially, it was really more like 10 km, though ~2400 ft), to go in the weekend, I felt like it was safe to push the pace more on the last lap, and I ran more of the trail portions. Of course, I still had to hike the main climb, but really let myself take some chances on the final snow descent (full send!). The final mile long stretch of road is moderately steep and while I pushed the pace as fast as I felt comfortable consistent with being reasonably sure I wouldn't fall, two men and one woman passed me on this stretch. I was able to keep one of them—a man in a red shirt that I'd been back and forth with all day—in sight but the other two dropped me.

At the bottom of the road the course turns flattish and then there are a few turns and then into the shoot. As soon as I hit this section I knew that it was more about power than about the ability to run downhill and I could see that I was gaining on the man in red in front of me, and I eventually caught him right as we entered the chute. I was actually expecting a sprint finish as I went on by, but he didn't respond so I ended up comfortably beating him by five seconds.

Overall: 2:38:58, 169/671, 5/56 M50-59

Analysis #

Overall I think this was my best race of the three both in terms of results and how I felt: my place was highest both overall and in my division and I almost felt stronger going into lap 2 than lap 1, and this is confirmed by the even splits. I'm still doing a lot better on the climbs than the descents, but that gap seems to have narrowed from the 46K. You always look a bit worse in the finish videos than you feel inside, but I'm moving well and passing people at the very end is generally good.

Segment	Overall	Division	Time
Snow King	164	3	34:15
Village	183	8	1:18:01
Snow King	160	4	1:53:38
Finish	169	5	2:28:58

Overall #

Broken Arrow also keeps Triple Crown standings, computed by the sum of all your times. This tends to really overweight the 46K, where I was just OK, but even so my result isn't bad. I was 37/100 overall and 4th/17 in M50-59, with a time of 10:28:38. Third was 10:22:15, which seems plausibly in reach if things had turned out differently.

Generally, this seems like a successful weekend. I had never had three days of racing before and was worried that I would be super tired but I seem to have gotten stronger as the weekend went on and wasn't even that tired after the 23K. I attribute this to a combination of a strong training block right before—including the two weeks in Flagstaff—and really paying attention to nutrition and recovery post-race on Friday and Saturday. The snow was definitely a real obstacle and I clearly would have been quite a bit faster if I'd had more practice on snow, but I felt like I got the hang of it after a few days and while people were still passing me it wasn't anywhere near as bad. I think I also handled nutrition well both during the race and after: I never had much GI distress (thanks, Maurten!) and only felt bonky a bit midway through the 46K, which Coke fixed up. That may also have just been the "I've got to do this loop another time???" feeling.

I'm not sure if I'd do Broken Arrow again: it's a generally well-run event and I had a good time, but I think on balance I more gravitate towards the longer events, especially those where you're covering a lot of ground rather than repeating the same part of the course. On the other hand, it was a great experience and I definitely recommend giving it a shot if you've been mostly racing standard trail ultras.

P.S. I'd been having some trouble with the engagement on my poles and the LEKI guys at the expo just swapped out the gloves. Great customer service. ↩︎
The Web site actually says you might need to run down, but that didn't happen. ↩︎
I also tried on a pair of the NNormal Kjerags. I've been looking for a new pair of race shoes and I'd heard good things about the Kjerags, but they're way too wide in the forefoot for me. This was actually kind of surprising, because NNormal is a partnership between Kilian Jornet and Camper and the shoes that Salomon made for Kilian were all narrow. ↩︎

How NATs Work, Part III: ICE

2023-07-02T00:00:00Z

The Internet is a mess, and one of the biggest parts of that mess is Network Address Translation (NAT), a technique which allows multiple devices to share the same network address. This is part III in a series on how NATs work and how to work with them. In part I I covered NATs and how they work, and part II covered the basic concepts of NAT traversal. If you haven't read those posts, you'll want to go back and do so before starting this one, which describes the main standardized technique for NAT traversal, Interactive Connectivity Establishment (ICE).

As you may recall from part II, there are many circumstances where two endpoints (clients) want to communicate directly rather than through a server. However, your typical Internet client is also behind a NAT or firewall, which means that you can't just publish your address and have people connect to you as they would with a Web server. Instead, you need some NAT traversal mechanism. When the IETF originally set out to address the problem of NAT traversal, the idea was that you would characterize the NAT (i.e., figure out what its behavior was) and use that information to publish an address that would work via a signaling server. Once each side has the other side's address, it can try to transmit to it, as in the diagram below:

Unfortunately, that there was too much diversity in NAT behavior to make this work reliably, so we needed something else. Enter ICE.

Multiple Addresses #

Recall that the client will generally have multiple addresses, as shown in the diagram below [Updated for clarity 2023-07-02]:

In this case, the client has two addresses:

The host address (10.0.0.3:1111): which is the one assigned to its own network interface and which it is directly aware of.
The server reflexive (srflx) address (192.0.2.1:5678): on the outside of the NAT. The client can typically only learn this by connecting to the STUN server and asking it what address it sees.

Now what happens if two clients with this kind of topology want to talk to each other. There are two main scenarios, as shown in the diagram below.

The clients can be on different networks (probably the normal case on the Internet)
The clients can be on the same network (as is common in Enterprise or gaming scenarios, for instance if you have multiple players in the same house and hence the same network)

Clients on different networks

Clients on the same network

The reason that this matters is that neither the host address nor the server reflexive address will work all the time. For obvious reasons, if Alice and Bob are on different networks and Alice sends Bob her host address, Bob won't be able to address it from his own network (in this case, they actually share the same address range, but those addresses are actually on different networks, so there might be another host with Alice's address on Bob's network). On the other hand if they are on the same network and Alice sends Bob her server reflexive address, this may not work if the NAT doesn't support hairpinning.

What you want is for the media to take different paths (shown in red) depending on the topology: if Alice and Bob are not [corrected, 2023-07-02] on the same network, the media should flow between the server reflexive addresses (on the outside of the NAT) and if they are on the same network it should flow between the host addresses (on the local network interfaces). The problem is determining which of these address pairs to use, because it's not practical to determine which scenario you are in.^[1] If neither address is guaranteed to work, the only option is for each side to send both addresses. In this case, Alice would send Bob two addresses (ICE calls these "candidates"):

1.0.0.3:1111 (host)
192.0.2.1:1234 (server reflexive)

Bob would send Alice:

1.0.0.2:1111 (host)
198.51.100.1:5678 (server reflexive)

Once Alice sees Bob's addresses, she tries to transmit to both of them, as shown below:

In this case, Alice and Bob are on different networks, so Alice's attempt to transmit to Bob's host candidate (10.0.0.2:1111) doesn't work, but her attempt to transmit to his server reflexive candidate (198.51.100.1:5678) does, though it goes through two layers of translation along the way. If we drew Bob's side of the exchange, it would look similar.

If you look at this diagram closely, you will notice something potentially surprising: Alice only sends two packets, even though their are four pairs of addresses (host/host, host/server reflexive, server reflexive/host, and server reflexive/server reflexive). Why doesn't Alice try to send from her server reflexive address? The answer is that there is no way for her to do so. Alice can only send packets from her host address: if they go through the NAT, it will translate them into the server reflexive (or maybe some other address) and if they don't go through the NAT they won't be translated, but Alice can't control this. In either case, Alice just needs to send one packet to each address from the other side.

Connectivity Checks #

Sending to both of Bob's addresses lets Alice get traffic through, but we obviously don't want to have to send two copies of every packet (or worse, if Bob has more addresses, as discussed below). What we need is a mechanism for Alice to determine which of the packets got through and then she can only send on that address pair. As you might expect if we read my post on reliable transports, we do this by having Bob acknowledge Alice's packet in what's called a connectivity check.

Instead of sending media to Bob, Alice sends a STUN check^[2] to Bob (much like she would if she were trying to learn her address from a STUN server) and waits for the response. If Bob doesn't answer, she can infer that that address pair won't work. If he does, then she knows that this is a valid address pair and can then use it to send media (Alice knows which checks worked and which ones didn't because the check and the acknowledgment contain an identifier, which I haven't shown in the diagram to keep things simple).

This process is shown below:

I'm obviously simplifying quite a bit here. In particular, because packets can get lost, Alice has to retransmit her STUN checks for a while; otherwise a single packet on a valid address pair might get lost. For instance, if packet 2 got lost, and Alice didn't retransmit, then Alice would be left with no valid pairs.^[3] Moreover, as discussed in the next section, there are reasons besides network failure why one of the packets might be dropped.

Bidirectional checks #

First, as discussed in part II, if Bob doesn't transmit at all but just responds to Alice's checks, then Alice's checks may never get through. If Bob's NAT has address/port-dependent filtering, then it will drop any incoming packets on a given NAT binding until Bob has sent an outgoing packet; this requires Bob to initiate his own checks, as shown below:

To walk though this a bit, Alice starts by sending a check (msg 1) but because Bob has address/port filtering NAT, it filters out the packet. When Bob initiates his own check (msg 2), it creates a binding on his own NAT on the way out and gets delivered to Alice (this works even if Alice also has address/port dependent filtering because her outgoing packet created a binding). Alice receives the packet and sends an ACK (msg 3) which is able to traverse Bob's NAT because of the aforementioned binding. At this point, Bob knows that the pair B:b -> X:x works and that it's safe to transmit on that address pair.

When Alice's client retransmits its check (msg 4) it is able to get through Bob's NAT (again because of the outgoing binding created by message 2). Bob receives it and sends an ACK, and at this point Alice knows that the pair A:a -> Y:y works and it's safe to transmit on it. Note that this would have worked perfectly well if Bob had transmitted first (just flip the diagram around), and of course each side is retransmitting anyway.

At this point you might ask why Alice needs to do a second round of connectivity checks after receiving; after all, she knows that Bob can successfully transmit on the Y:y -> X:x path and she can receive it. However, she does not know that messages on the return path (X:x -> Y:y) work. For instance, Bob might have a firewall that blocks all incoming UDP packets, in which case Alice's ACK would be blocked (which she wouldn't learn about) as well as her own connectivity checks. If she sends her own checks, then she will learn that that path doesn't work and can try something else. In practice, however, this scenario is reasonably uncommon and it's quite likely that when Alice received Bob's check that her check in the reverse direction will also work.

Relayed Candidates #

As mentioned in part II, there are situations in which it is not possible for Alice and Bob to directly send traffic to each other, for instance if both of them have NATs with address-dependent mapping. In that case, getting a successful connection requires using a relay, which is just a public server on the Internet that will forward traffic to and from a machine, like so:

In standard ICE, clients speak to the relay over a protocol called Traversal Using Relays Around NAT (TURN). Because the TURN server is on the public Internet and not behind a firewall or NAT, it will almost always be possible for the client to connect to it—assuming that it's possible for the client to connect to any other network element at all. Note, however, that the client may have to use TCP if the local network blocks UDP.

It's quite cheap to run a STUN server because it just has to respond to a small number of packets per client, and there are a number of free public STUN servers. However, a TURN server has to be able to relay all of the media between the clients, which can be quite a bit of bandwidth. For this reason TURN servers are usually not free but rather are provided by the calling service people are using. Because a modest fraction (single digit percentages) of people cannot connect without a TURN server, this means that there is a certain minimum cost to running a video calling service even if you prioritize peer-to-peer media.

Picking the best path #

At a high level, then, there are (at least) three potential paths data can take between Alice and Bob, as shown below:

It's also quite possible that there will be multiple viable paths. As noted above, a path through a relay will almost always work, but it's also quite common that it's possible to have a direct path between Alice and Bob.

These paths are not all created equal. Latency is a key performance property for real-time voice and video. If the delay between you speaking and the other side hearing you is too long it creates a really jarring experience. If you've ever been on such a call you may have noticed that you and the other person end up interrupting each other a lot because the pauses in the conversation that leave room for the other person to talk get delayed as well, with the result that both people try to talk at the same time. In general, shorter (fewer hops) network paths will have better latency, both because more hops will often mean more meters of cable/fiber to traverse and because the hops themselves take time.^[4] In particular, if you can send media directly rather than going through a relay, you really want to do that, both for performance and cost reasons.

Lots of Candidates #

This is really the simplest possible scenario. In practice the client might have many more addresses. For instance, the client might have:

Both a WiFi interface and a mobile phone interface, each of which will have their own address.
Both IPv6 and IPv4 addresses.
A VPN, which has its own address.
Multiple NATs between it and the Internet (e.g., if it is served by a carrier grade NAT), each of which will have its own server reflexive IP addresses.
On or more relayed connections through TURN relays.

What ICE does is (approximately) to try the combination (Cartesian product) of all of the candidates from Alice and all of the candidates from Bob until it identifies a set of candidates that work (the "valid set"). Of course, some candidate pairs will not be possible (e.g., mixed IPv4 and IPv6), but it's still possible to have quite a few compatible candidates and hence quite a few candidate pairs. As a concrete example, the machine I am writing this on has two interfaces (wired and wireless), each with local IPv4 and IPv6 addresses, but not IPv6 connectivity, so that gives me 4 host candidates, 2 server reflexive candidates (for v4 only), plus at least one relayed candidate. If I'm connecting to another similar machine, we're potentially looking at something like 15 IPv4 pairs (remember, you don't pair up the server reflexives locally) plus 4 IPv6 pairs. It's a lot!

Peer-Reflexive Candidates #

You may recall from part II that some NATs have address and port-dependent mappings, in which case the candidate gathering process will find a different external mapping (the server reflexive address) for a given internal address/port than is observed by the peer (the peer reflexive address). What this looks like to the peer is that it receives a check from an address that it doesn't have a candidate for. Fortunately, there is enough information in the STUN check to determine what is going on, and the endpoint responds by synthesizing a remote peer reflexive candidate, pairing it to its local candidate, and starting checks to it. The other side doesn't have to do anything special here, because—as with server reflexive candidates—it automatically sends requests from the peer reflexive address just by sending to the peer.

Prioritizing Checks #

A naive implementation of ICE would just send all the connectivity checks at the same time. This turns out not to work well because you can overload the Internet link or the NAT, causing them to drop packets, thus making ICE take longer to converge. Instead, you need to space out the checks over some time. However, you also want ICE to find a viable path as soon as possible because while ICE is running the user is just sitting there waiting—depending on the design maybe listening to ringtone.

In order to optimize the time to convergence, ICE uses a prioritization scheme designed to provide two main properties:

The most direct candidate pairs are checked first.: As discussed above, you want media to traverse the most direct path. ICE is designed so that it also checks the most direct paths first. I'm actually not so sure about this design decision—in particular, the host/host paths often will not work—but it's what ICE does.
Checks are roughly synchronized between both sides.: Remember that in many cases, in order for Alice's checks on a given candidate pair to succeed, Bob also needs to run a check in order to create a binding in his NAT. If Alice checks that candidate pair first and Bob checks that pair last, then (at best) Alice's check won't succeed till the very end of the ICE process. At worst, by the time Bob's check runs Alice's NAT binding will have timed out and both checks will fail. This isn't that likely in most networks; in practice the ICE process would just be slower than ideal.

Of course, synchronization is only loose. Let's look at the case where both sides run checks again:

Recall that in this scenario Alice runs her checks, which fail but open a binding in her NAT, allowing Bob's check to succeed. Eventually, Alice would retransmit her checks, but this might take some time because retransmits, like the checks themselves, need to be paced to avoid overflowing the network. Because it's very probable that Alice's check will work, ICE includes an optimization called triggered checks in which an endpoint immediately (well, mostly immediately) schedules a check in the reverse direction upon receiving a check. This allows Alice to quickly discover that the path that is likely to work actually does work in the common case where it is valid.

Multiple Media Paths/Frozen #

There's an additional complication. When ICE was first designed it was standard practice to use different address pairs for different streams of media. For instance, if you had an audio and video call, you would use different ports for them. Moreover you needed twice as many ports because the media protocol that is in use here (Real-time Transport Protocol (RTP)), has an associated control protocol that is used for measuring packet delivery and that also used its own ports. In other words, a simple two person A/V call could need as many as four separate address/port pairs, which means that you need four times as many candidate pairs (two each for audio and video), and hence four times as many checks. ICE's term for these flows is "components".

This may be hard to visualize, so imagine a simplistic case in which we only have host and server reflexive candidates and we only want to establish two components. If we go back to our example above, Alice would have the following candidates:

Type	Address	Usage
Host	1.0.0.3:1111	Audio
Server Reflexive	192.0.2.1:1234	Audio
Host	1.0.0.3:1112	Video
Server Reflexive	192.0.2.1:1235	Video

And Bob would have:

Type	Address	Usage
Host	1.0.0.2:1111	Audio
Server Reflexive	198.51.100.1:5678	Audio
Host	1.0.0.2:1112	Video
Server Reflexive	198.51.100.1:5679	Video

Looking at it from Alice's perspective, she has four candidate pairs to check (recall that Alice doesn't need to pair her srlfx candidates with Bob's candidates).

Local	Remote	Type	Usage
10.0.0.3:1111	10.0.0.2:1111	Host ↔ Host	Audio
10.0.0.3:1111	198.51.100.1:5678	Host ↔ Srflx	Audio
10.0.0.3:1112	10.0.0.2:1112	Host ↔ Host	Video
10.0.0.3:1112	198.51.100.1:5679	Host ↔ Srflx	Video

In order to optimize these checks, ICE takes advantage of the observation that NAT behavior is likely to be consistent, so if a set of candidates works for the audio component then a set of similar candidates (though of course with different addresses) is likely to work for the video component. In order to exploit this, ICE initially only checks one set of candidate pairs for each type and sets the others as frozen. If the first candidate pair succeeds, then ICE unfreezes the others. This avoids doing redundant checks in parallel. In this case, at the start of ICE, we would have a situation like this:

Local	Remote	Type	Usage	State
10.0.0.3:1111	10.0.0.2:1111	Host ↔ Host	Audio	Checking
10.0.0.3:1111	198.51.100.1:5678	Host ↔ Srflx	Audio	Checking
10.0.0.3:1112	10.0.0.2:1112	Host ↔ Host	Video	Frozen
10.0.0.3:1112	198.51.100.1:5679	Host ↔ Srflx	Video	Frozen

ICE would first check the pairs listed as "checking"^[5] Then if the audio host ↔ host candidate pair works, ICE would unfreeze the corresponding video candidate pair.

Local	Remote	Type	Usage	State
10.0.0.3:1111	10.0.0.2:1111	Host ↔ Host	Audio	Succeeded
10.0.0.3:1111	198.51.100.1:5678	Host ↔ Srflx	Audio	Checking
10.0.0.3:1112	10.0.0.2:1112	Host ↔ Host	Video	Checking
10.0.0.3:1112	198.51.100.1:5679	Host ↔ Srflx	Video	Frozen

The result of this is that once you determine that a given type of candidate pair works, you start checking the rest of the pairs of that type; as with triggered checks the idea here is to converge to a working set of candidate pairs as fast as possible.

As I said above, I'm simplifying a bunch and there's more to candidates being "similar" than just the types of the candidates. For instance, if I have both wired and WiFi network interfaces, each of those would have a candidate. If the wired candidate pairs succeed, I would just unfreeze those but not the wireless pairs. The way this is captured in ICE is by assigning each candidate a "foundation" that characterizes the candidate (based on IP address, type, etc.). The foundation of a candidate pair is the pair of local and remote foundations.

This is clearly not a great situation but, remember we're not building from scratch. VoIP systems are built out of technologies designed back in the 1990s when people had different ideas about how to design networking protocols (and in particular when NATs and firewalls were less ubiquitous). Eventually, the IETF worked out how to multiplex multiple flows on the same address/port quartet using a pair of technologies called RTCP-mux and BUNDLE). This actually represents years of engineering work to retrofit the protocol mechanisms without causing backwards compatibility issues, but fortunately it mostly works now, so if you're on a modern system you're back to only needing a lot of checks rather an absurd number.

Selecting Pairs #

OK, we're almost to the end now. Alice and Bob are running checks, some of which succeed and some of which fail. As noted above, it's quite common for more than one candidate pair to succeed for each path because the host ↔ srflx candidate pair will often work and one of the relayed candidate pairs will almost always work. This means you have multiple paths that might work, so now what?

You could just have each side independently pick its favorite candidate pair and send on it, but this turns out to be bad idea. Remember that many NATs time out their bindings after a short period (10-30 seconds) of inactivity and that it's outgoing packets that keep the binding alive. If Alice and Bob use different paths, then Alice may not be sending the packets that keep the binding open for Bob's incoming packets. If Alice and Bob use the same candidate pair, then the path will be symmetrical and the binding will stay alive. This means we need some mechanism for picking which pair the endpoints will use.

In modern ICE, this works by having one endpoint (the "controlling")^[6] side pick which pair to use. The controlling endpoint runs checks for each component until one succeeds that it wants to use (the actual logic here is unspecified, but typically you'd do something like wait until one of the direct pairs worked or they had all failed and one of the relayed pairs had succeeded) and then it sends another check on the same pair with the USE-CANDIDATE flag (this is called "nominating" the pair). When the (controlled) peer sees that flag it knows to use that candidate pair going forward. When the controlling side's check succeeds—which should always happen if the pair is already successful—then it knows it is safe to use the pair as well and from here forward both sides will just use that pair.

Of course, it might take some time for the controlling endpoint to run enough checks to feel comfortable picking one, and you want to have media start flowing right away. To accommodate this, ICE allows endpoints to start sending media as soon as they have a valid pair, even before one has been nominated.^[7] This shortens setup latency while allowing time for the controlling endpoint to nominate the optimal pair. Usually this will happen quickly enough that you don't need to worry about the bindings timing out. It does mean, however, that the path the media takes may change as the ICE checking process proceeds.

Trickle ICE #

Classic ICE is a sequential process:

Gather all your candidates and send them to the other side.
Receive the other side's candidates
Run checks

This all works fine if candidate gathering is fast, but what if it's not? For instance suppose you are behind a firewall which blocks UDP and you have to use TCP to connect to the relay server? If the firewall just drops the packets without sending you errors, you're waiting for the candidate gathering process to time out. This might take several seconds (potentially more, depending on your timers) to discover. In the meantime, people are just waiting, which isn't ideal.

To deal with this, the Google Hangouts team invented a technique called trickle ICE,^[8] in which each side sends candidates as soon as it has them, so that they "trickle" in over time. This creates some additional complexity because you have to incrementally pair new local or remote candidates, but has the potential to significantly decrease the time to connection establishment. This is especially useful in the context of WebRTC, when the Web site doesn't necessarily know in advance which of the various STUN or TURN servers it is offering will actually be reachable by the client.

Backwards Compatibility #

As described above, ICE has been through a number of iterations, and so it's possible that a modern endpoint will end up talking to an older endpoint. For example:

An endpoint that supports RFC 8445 ICE might need to talk to an endpoint that supports RFC 5245 ICE.
An endpoint that supports trickle ICE might talk to a non-trickle endpoint.
An endpoint that supports component multiplexing (BUNDLE) might talk to one that does not.

In the classic SIP softphone setting, there's no real way to know what the peer supports, so you need to send ICE information that is compatible with the other endpoint.^[9] For instance, if you support trickle but you don't know what the other side supports, then you need to gather all the candidates you will need anyway, but you can say in your message that you support trickle, and so the other side can use it (this is called "half trickle").

Similarly, if you support component multiplexing, but you don't know if the other side does, then you may need to gather candidates for all the components, even if the other side is going to throw most of them away. This can get quite expensive, however, and the default for WebRTC is what's called "balanced" mode, in which you gather candidates only for the first stream of each type (e.g., the first audio channel). If the other peer supports bundling components, then this works fine, and if it doesn't, then only the first stream connects. Of course, actually designing something that fell back gracefully in this situation instead of just freaking out because there were no candidates available for the later components took some doing.

The situation is a bit better if you know you are doing a call that is WebRTC on both ends—e.g., because both ends are browsers or one end is a modern conference server—for two reasons. First, the WebRTC specifications (specifically JSEP) require support for multiplexing (both BUNDLE and RTP/RTCP) and for trickle ICE, so you know you have a modern endpoint on the other side. Second, the server can use JS APIs to determine the capabilities of each endpoint, so it has a better chance of getting an interoperable configuration.

Security #

The threat model for ICE is confusing for a number of reasons:

A full network attacker will generally be able to manipulate packets (e.g., drop them, send them with a bogus IP, etc.) and so you have limited protection against such an attacker.
If you use media encryption between the endpoints—as was uncommon back in 2010 when ICE was first designed but is mandatory in WebRTC—then even an attacker who sees all the packets has limited abilities.
In the WebRTC case, the Web site actually invoking the APIs may be an attacker, though they probably do not control the network.

In general, then, we have three main objectives:

That an attacker who can't see your packets can't interfere with connection formation or reroute traffic to themselves.
That an attacker who can see packets can't just forge arbitrary content (more on this below).
That a non-network attacker driving the WebRTC API (or a SIP peer, though this is a weaker attacker) can't force you to connect to someone besides themselves by providing their address in a candidate.

Most of these attacks are prevented by two security mechanisms found in STUN:

Each STUN message is cryptographically protected (via an authentication tag that prevents tampering with the message) with a username and password exchanged along with the ICE parameters.
Each STUN check has a unique 96-bit transaction identifier which must be echoed in the response.

These two mechanisms work together.

Because the credentials are not known to network attackers, they are unable to forge requests or responses. This is not a complete defense because—as noted above—a full network attacker can take a valid packet and send it from a fake IP address, thus causing the receiver to think it came from somewhere else (as in a peer reflexive address) but they can't tamper with the contents, but it prevents a number of attacks. The username mechanism also prevents cases of ambiguity in which a STUN check arrives at another endpoint which just happens to be doing STUN. Because the username will be different, it will not respond to the check.

However, the username and password mechanism does not prevent attacks by a Web site using WebRTC, because that site knows the username and password. However, because the transaction ID is unpredictable—and importantly, not revealed to the site's JavaScript—it can't forge a response to any check it doesn't receive. Thus, ICE establishes that the receiver of the traffic has consented to receive it.

Final Thoughts #

It's important to remember how we got here. In part I I wrote:

NATs provide a particularly good example of the way the Internet evolves, which is to say workaround upon workaround. The reason for this is what Google engineer Adam Langley calls the "Iron law of the Internet", namely that the last person to touch anything gets blamed. The people who first built and deployed NATs had to avoid breaking existing deployed stuff, forcing them to build hacks like ALGs and unpredictable idle timeouts. Now that NATs are widely deployed, new protocols have to work in that environment, which forces them to run over UDP and to conform to the outgoing-only flow dynamics dictated by the NAT translation algorithms.

As we can see with ICE, it's not just a matter of working with existing NATs but of working with all the previously deployed systems that were deployed before ICE was available, as well as working with previous versions of ICE. The result is a system of extreme complexity which almost nobody really understands, which has to run before even the first byte of media is delivered. And yet, it mostly works, as you can see for yourself if you use any WebRTC-based calling system such as Meet or Teams.

The astute reader may have noticed that in the "different network" scenario, Alice and Bob's server reflexive addresses have different IPs whereas in the "same network" scenario they have the same IP address. You might think you could compare the addresses to determine which situation you were in. Unfortunately, this isn't dispositive because Alice and Bob might be behind a carrier grade NAT which had a pool of multiple IP addresses that it assigned from. Because of the way that IP addresses are assigned, it's not generally possible to determine whether two addresses belong to the same network. ↩︎
When all you have is a hammer, everything looks like a nail. ↩︎
Technical note: Bob doesn't retransmit his ACKs; he just responds to Alice's retransmissions. This is a pretty typical reliability design because otherwise you end up worrying about whether ACKs were delivered and having ACKs of ACKs, which is a mess. ↩︎
This is of course not always true, but it's a good rule of thumb. ↩︎
I'm simplifying the algorithm here as they would actually start in Waiting and then move to In-Progress. ↩︎
Don't make me explain how we decide which is which. ↩︎
In the original version of ICE, there was instead something called "aggressive mode" in which the controlling endpoint would send USE-CANDIDATE on multiple pairs and the controlled endpoint would pick the highest priority one, but that was removed in favor of this rule. ↩︎
This idea, documented in XEP-0176 appears to be originally due to Joe Beda. Thanks to Justin Uberti for helping me track this down. ↩︎
You might even be talking to an endpoint that doesn't support ICE, but for all the reasons we've discussed here, that's basically not going to work. ↩︎

Defending against Bluetooth tracker abuse: it’s complicated

2023-05-08T00:00:00Z

Bluetooth-based tracking tags like AirTags and Tiles are fantastically useful for finding lost stuff like your keys, your bike, or your cat. Unfortunately, they are a dual use technology which is also easy to use for surreptitiously tracking other people. This isn't a complicated attack to mount: you get a tracking tag and pair it with your own phone, plant it on your victim, and then use the find my stuff feature to monitor their location. This unpleasant fact isn't news: there have been concerns about misuse of these technologies for years, especially after the release of AirTags (see my earlier post for some initial thoughts).

On Tuesday Google and Apple published a set of guidelines for how trackers should behave to reduce the risk of unwanted tracking. This post takes a look at that document and the bigger problem space.

Background: Bluetooth Trackers #

Because these tracking systems are non-interoperable, they don't necessarily all work the same way. However, Apple provides some detail about how the system works, and back in 2021 Heinrich, Stute, Kornhuber, and Hollick reverse engineered the system and published a paper in PoPETS describing how it works as well as some vulnerabilities.

The obvious design for this kind of system would be to just have each tag have a single fixed identifier which it broadcast periodically over Bluetooth Low Energy (BLE). As a practical matter, the tag doesn't actually broadcast unless it's out of range of one of the devices its owner has paired it with; if it's in range, then the owner device can find it directly. Whenever the tag was within range of a participating device (e.g., a phone), that phone would then upload the device tag and its own position to some central server. When you lost your device, you would then contact that server and request its last known location, as shown in the diagram below:

This system has some obvious security and privacy issues:

The service can track the position of any tag (and in fact all tags) just by looking at the database.
In fact, anyone can track a tag if they know the identifier, so if you see it once, you can just query the database.
Even without access to the database, an attacker can reidentify a given device. For instance, if you had a receiver at the entrance to a store, you could see when the same person came by again (this is a similar set of issues to those with license plates.

The second and third attacks can be addressed by just having a rotating identifier. I.e., each tag $i$ has a secret value $SK_i$ which it shares with its owner at the time of pairing with the device. Instead of broadcasting $SK_i$ directly, it uses it as the seed for a pseudorandom function (PRF) to create a rotating identifier $ID_{i,t}$ where $t$ is the current time and broadcasts that instead. Each identifier will be used for a fixed time (say 15 minutes) and then the tag generates a new identifier and broadcasts that. The device owner knows $SK_i$ and can use it to generate $ID_{i,t}$ so it can still query the central service just by asking for the IDs for recent times, but someone who just observes a single ID can't query the service for the locations of other IDs for the same tag (and of course they already know the location at the time of observation).

Rotating IDs #

This also partly solves the problem of the service tracking the tag, because it also cannot link up multiple identifiers, so all it has is a set of locations. However, if there are comparatively few tags then the service can infer people's behavior just by looking at the unlinked locations. E.g, if I see two IDs on Highway 101 traveling in opposite directions (inferred from the lane they are in) and them some other ID getting off on a Southbound exit, I can infer that there was a single device that was going South and then exited, but it's less information. In addition, when someone queries for the location of their tag, then the service provider gets the IDs for a range of time periods, which it knows all correspond to the same device, and can then link up the motion of the tag during that time range.

Apple's design (the best documented) addresses this by having the locations where the tags are detected encrypted to the device owner. This works similarly to the rotating ID system except that instead of generating a rotating ID, the tag generates a rotating private/public key pair: $(Priv_{i,t}, Pub_{i,t})$. The tag broadcast $Pub_{i,t}$ just as it would the ID, but then when a device sees the broadcast, it uploads the location encrypted under that public key. When the device owner wants to find the tag, it queries the server using the public key^[1] (just as it would have before with the tag) and gets the encrypted value. Because it shared $SK_i$ with the tag, it can generate $Priv_{i,t}$ and can decrypt the encrypted location, as shown in the figure below:

Privacy Properties #

This system has significantly improved privacy properties. As with a simple rotating identifier, an attacker can't track a tag using multiple observations over an extended period. And because the reports are encrypted, the service provider is not able to directly determine the actual location of the device. However, that doesn't mean that the service provider doesn't learn anything. In particular:

If two owners both query the location of lost tags which are reported by the same device, than it allows the service to infer that the owners were at one point in the same location (this attack is reported in the Heinrich et al. paper).
If two devices both report the location of the same tag then the provider can infer that those devices were in the same location at the time of the report.
If the service provider has an independent way of learning the location of a reporting device—for instance by IP location or because the owner uses some location-based service—and then the owner queries for its location, the service gets to learn information about the owner's movements (because that is where they probably lost the tag). This attack is exacerbated by the fact that you want to query multiple keys (one for each time range), so the service might learn multiple locations for the same tag and be able to link them.

The root cause of all of these issues^[2] is that the service gets to learn the identity of reporting devices when they make reports, as well as potentially of the device owner when they query for location. This part of Apple's design isn't very clearly documented, but presumably the rationale for identifying the endpoints is to prevent abuse (e.g., forged location reports) by requiring that they be genuine Apple devices (see Section 9.4 of Heinrich et al.). It should be possible to address this issue using standard anonymity techniques such as Oblivious HTTP,^[3] though it doesn't appear Apple does that.

Unwanted Tracking #

The privacy mechanisms described above are about preventing other people from learning the location of your tags, but the way you use a system like this to track someone else is to attach one of your tags to something of theirs and then query the system to see where your tag is. This is a much harder problem to solve because the whole point of the system is that the tag isn't attached to you (that's why you're looking for it!) and there's no real technical way to distinguish the case where I accidentally left my keys in your car from the one where I maliciously stuck an AirTag to your car to track you.

Instead, the countermeasures that Apple and others have designed seem to center around making this situation detectable. Specifically:

If AirTags are away from their owners for "an extended period of time" they make a sound when moved.
If your iOS device detects that an AirTag that doesn't belong to you moving with you, it will notify you on the device and then you can try to find it and figure out what's going on.

Once you have detected a tag that appears to be following you, AirTags also include a feature that lets you partially identify the owner of the tag, as long as you can physically access the tag.

[Source: Apple]

My personal experience is that these features are both fairly hit and miss. In terms of the sound notification, the speaker in AirTags is pretty quiet and the noise is kind of intermittent. We use AirTags to keep track of our cats, but it's paired to my wife's phone not mine. After she had been out of town for several days, I finally noticed the AirTags making sound and took them off the cat's collars, but the first time this happened I probably heard the sound about three or four times—and who knows how many times I didn't hear it—before I figured out what it was. We're all constantly surrounded by stuff beeping so it's easy to get habituated to it.

Similarly, I've had the "someone is moving with you" trigger a number of times—most recently Saturday—such as when someone accidentally left their AirPods around, but that also takes a while to trigger and is easy to ignore. I imagine both of these features would work a lot better if you were really worried about being tracked, but at least in my experience there are a lot of false positives, which makes the whole system less useful than one might like.

The Apple/Google Draft #

On Tuesday, Apple and Google published a document describing guidelines for how trackers ought to behave in order to make unwanted tracking easier to detect.

Today Apple and Google jointly submitted a proposed industry specification to help combat the misuse of Bluetooth location-tracking devices for unwanted tracking. The first-of-its-kind specification will allow Bluetooth location-tracking devices to be compatible with unauthorized tracking detection and alerts across iOS and Android platforms. Samsung, Tile, Chipolo, eufy Security, and Pebblebee have expressed support for the draft specification, which offers best practices and instructions for manufacturers, should they choose to build these capabilities into their products.

Mostly this document provides detailed specifications of the behaviors I've described informally above. For instance, here's the portion describing how the audible alerts should work:

   After T_(SEPARATED_UT_TIMEOUT) in separated state, the accessory MUST
   enable the motion detector to detect any motion within
   T_(SEPARATED_UT_SAMPLING_RATE1).

   If motion is not detected within the T_(SEPARATED_UT_SAMPLING_RATE1)
   period, the accessory MUST stay in this state until it exits
   separated state.

   If motion is detected within the T_(SEPARATED_UT_SAMPLING_RATE1) the
   accessory MUST play a sound.  After first motion is detected, the
   movement detection period is decreased to
   T_(SEPARATED_UT_SAMPLING_RATE2).  The accessory MUST continue to play
   a sound for every detected motion.  The accessory SHALL disable the
   motion detector for T_(SEPARATED_UT_BACKOFF) under either of the
   following conditions:

   *  Motion has been detected for 20 seconds at
      T_(SEPARATED_UT_SAMPLING_RATE2) periods.

   *  Ten sounds are played.

   If the accessory is still in separated state at the end of
   T_(SEPARATED_UT_BACKOFF), the UT behavior MUST restart.

Not a full specification #

What this document is not, however, is a complete specification of a tracking system. In particular, it doesn't cover any of the fancy (well fancy-ish) cryptography I described above. Instead, it describes a Bluetooth container for the messages, with the following contents:

Bytes	Description	Requirement
0-5	MAC address	REQUIRED
6-8	Flags TLV; length = 1 byte, type = 1 byte, value = 1 byte	OPTIONAL
9-12	Service data TLV; length = 1 byte, type = 1 byte, value = 2 bytes (TBD value)	REQUIRED
13	Protocol ID (TBD value)	REQUIRED
14	Near-owner bit (1 bit) + reserved (7 bits)	REQUIRED
15-36	Proprietary company payload data	OPTIONAL

As far as I can tell, the cryptographic pieces would go in the "proprietary company payload data" portion, though it's actually not clear to me precisely how this works in the case of AirTags. As Heinrich et al. describe, the BLE payload is quite small (31 bytes for the ADV_NONCONN_ID PDU) but the BlueTooth standard requires a 4-byte header for manufacturer-specific data, so Apple had to do do some tricky engineering to get the P-224 public key (28 bytes) into the remaining 27 bytes of the packt (they repurpose part of the MAC address to do this). It's not quite clear to me how Apple plans to stuff the public key into the 21 "proprietary payload" bytes, but presumably they have some plan in mind. Any readers who know how this is supposed to work should reach out. Maybe they plan to send two packets?

The key point here is that this isn't enough of a specification to provide interoperability between systems. For instance, it wouldn't tell you enough to build your own tags which worked with Apple's tracking network; it's just supposed to be enough to tell you how to build your tracking tags so that they are detectable. Note the careful phrasing here: the document doesn't tell you how to detect tracking tags, it just tells you how to build tags which are trackable and you are left to infer how to detect them.

Detecting Tracking Tags #

With that said, this document does help explain something confusing about the description I provided above, namely how devices are to detect that a tag is following them if the identifier it broadcasts changes every 15 minutes. The answer appears to be that the BLE address doesn't change.

An accessory SHALL rotate its resolvable and private address on any transition from near-owner state to separated state as well as any transition from separated state to near-owner state.

When in near-owner state, the accessory SHALL rotate its resolvable and private address every 15 minutes. This is a privacy consideration to deter tracking of the accessory by non-owners when it is in physical proximity to the owner.

When in a separated state, the accessory SHALL rotate its resolvable and private address every 24 hours. This duration allows a platform's unwanted tracking algorithms to detect that the same accessory is in proximity for some period of time, when the owner is not in physical proximity.

The "resolvable" address refers to the BLE network address (MAC address). In other words, when in the separated state, the tag sends out beacon packets where the MAC address is constant for 24 hours even if the public key rotates every 15 minutes (and remember that the public key encryption piece isn't specified here). So presumably what you are supposed to do as a device is look for any tag (identified by MAC address) that has been following you for a while and if so alert the user. But how long a period is "a while". Who knows? That's up to you.

Why not just rotate the address every 24 hours all the time? Two reasons: (1) it prevents triggering the detection algorithm as long as it has a trigger at more than 15 minutes and (2) it make the tag less trackable in cases where it is traveling with its owner (see rotating IDs above. There is also a "near-owner" bit in the advertisement that says that the tag is near its owner and that detecting devices shouldn't treat it as tracking them.

Once a tag is detected, it is also possible to connect to it directly and query its information (manufacturer, product type, etc.), as well as to cause it to play a sound. It is also possible to retrieve the device serial number as long as you can demonstrate close proximity, either via an NFC connection or some user action on the device itself (pressing a button, etc.)

The Broader Threat Model #

My bigger concern is that this document seems be limited to a fairly narrow threat model, which is to say tracking by naive attackers who take an off-the-shelf tag and attach it to their victim. The Apple/Google document describes a set of behaviors that companies ought to build into their trackers to mitigate this threat, but unfortunately, this isn't the only threat.

It's already possible to buy relatively compact GPS trackers that don't depend on using Bluetooth to talk to other devices (see this older post for more on this topic.). However, these trackers are expensive (about $300, plus a subscription), have battery lifetimes measured in days or (at best weeks), and are several centimeters across, so are somewhat hard to conceal. By contrast, tracking tags like Tiles or AirTags have a combination of features that makes them more attractive for surveillance.

They are compact (thus easy to hide)
They are cheap (thus easy to obtain)
They have long battery lifetimes (and thus are suitable for long-term surveillance)

These features are made possible by the existence of a widespread network of devices (phones, etc.) which can report the position of a lost tag. That network allows the use of much cheaper and energy efficient technologies than a tracker like the Garmin inReach, which needs both a GPS receiver and a satellite transmitter. It's that network that creates the risk, not the tracking tags themselves. Specifically, if the attacker can obtain a tag which can successfully be located with the tracking network but which doesn't conform to the behaviors specified in this document, then the detection mechanisms that this document anticipates will be less effective if not completely useless.

There are at least two possible ways for an attacker to obtain such a tag:

Modifying an existing tag.: The stock tags made by each manufacturer are cheap and generally reasonably well-engineered, so it's convenient for the attacker if they can just buy them and disable the anti-tracking features. For example, in his thorough AirTag teardown, Adam Catley observes that it's possible to disable the speaker in an AirTag and suggests that the tag be modified to check to see if the speaker is actually making noise. Depending on the design of the tag, it might be possible to rewrite the firmware to violate the requirements in this document, for instance by rotating the MAC address frequently to evade detection (oddly: this document says "The accessory SHOULD have firmware that is updatable by the owner", which is the opposite of what you want here.)
Building an entirely new tag.: Even if the stock tags are hard to modify, once it's public information how these devices are built it's possible to make your own tags that don't have any anti-tracking features at all. In fact, this already exists in the form of OpenHaystack built by the same team as that published the PoPETS paper I've been relying on for most of this analysis. OpenHaystack is designed to run on commodity hobby hardware like the BBC micro:bit which is quite a bit bigger than an AirTag but obviously it would be possible for someone to engineer something compact and cheap, perhaps using the AirTag design as a starting point. Note that it doesn't really help that the specific design of any individual system is secret: there are tens of millions of these devices out there, and it just takes one person to reverse engineer a tag and publish the results.

Either of these attacks requires more sophistication than just buying an AirTag through Amazon, but the would-be stalker doesn't have to have that sophistication themselves; they just need some third party to start making and selling tags that are suitable for surveillance. If such devices become widely available, then the countermeasures Apple and Google are proposing will become much less effective. There's already a market for "stalking apps," so this seems like a real risk.

What you really want here is for it not to be possible to make a tag which participates in the tracking network without implementing the specified anti-tracking behaviors. This is a hard job under any circumstances (though see some handwaving ideas below), but is made much harder by specifying a design in which tracking detection pieces are specified at one level (the BLE layer) and the official "find my device" functionality is implemented in a proprietary layer that sits on top of that. That makes it very easy for an attacker to build their own tag that complies with the (reverse engineered) proprietary pieces but then violates the rules at the BLE layer. I can understand why Apple and Google, who each presumably have some proprietary design, want to avoid standardizing that piece, but the result is that the problem of detecting unwanted tracking is much harder.

Attestation #

The most straightforward approach is if we assume that "official" devices behave correctly and then have some mechanism for detecting official devices. The standard approach here is to have what's called an "attestation" mechanism in which each legitimate device has some secret embedded by the manufacturer which can be used to prove that it's legitimate (e.g., by signing something). See ([here](/posts/verifying-software for more on this.) Devices would then require tags to prove they were legitimate before reporting their location to the network. Of course, this secret has to be embedded in tamper-resistant hardware to prevent an attacker stealing the secret and making their own fake devices.

Actually building a system like this in such a way that the attestation doesn't itself become a tracking vector (e.g., by having each device have a single attestation key which can then be tracked) is challenging cryptographically (this is also an issue with the WebAuthn public key authentication system), but there are some approaches that sort of work, or at least are somewhat better than the naive design.^[4]

However, even if you know for sure that you are talking to a legitimate device, that doesn't necessarily tell you that it's acting as its supposed to. As a simple example, you might have a device which sent the right BLE data but whose speaker had been disabled (or which was wrapped in sound-absorbing material). A fancier attacker might take a legitimate tag and proxy its signals to the device by putting it in a radio-absorbing case and then receiving and retransmitting whatever signals it sent, as shown below:

In this example, the tag is in the separated state, so it is supposed to keep a constant MAC address (though presumably still rotate its public key). However, the attacker captures this message and rewrites the MAC address so it looks like it a different device, fooling the detection algorithm.

This kind of cut-and-paste attack is possible to address by having the proprietary pieces that the network relies on enforce the correctness of the anti-tracking pieces (e.g., by signing the expected MAC address), but in order for this to work, they need to be aware of each other, which, as I said, isn't specified anywhere in this document. The point here is that successfully designing anti-tracking mechanisms requires analyzing the system as a whole, not just looking at one piece at a time. In particular, it's necessary to understand how the as-designed functionality works in order to build anti-tracking countermeasures which can't be separated from that functionality. And of course, in the case of the audible alerts, in some cases that may not be possible to do.

Worse yet, we already have a giant installed base of devices which don't have any kind of attestation, and presumably vendors want them to continue to work. This means that even if we were to deploy a system with this kind of attestation today, attackers could still exploit it by pretending to be one of those old devices.

The Status of this Specification #

This is slightly off topic from the technical content of this post, but I think it's important to observe that this isn't an IETF specification. There has been some confusion on this point, in part due to Apple's misleading PR statement:

The specification has been submitted as an Internet-Draft via the Internet Engineering Task Force (IETF), a leading standards development organization. Interested parties are invited and encouraged to review and comment over the next three months. Following the comment period, Apple and Google will partner to address feedback, and will release a production implementation of the specification for unwanted tracking alerts by the end of 2023 that will then be supported in future versions of iOS and Android.

Whats an RFC? #

RFC stands for "Request For Comments", and dates from the prehistory of the Internet when there wasn't a real standards process and people would just publish memos describing protocols. The IETF loves its traditions and "RFC" is now an important brand (so much so that other organizations such as the Rust Project now publish standards "RFCs" even though they have no connection to the IETF process. To make matters worse, there are also RFCs published in the same series as IETF RFCs that aren't standards, including those published in what's called the Independent Stream, which don't have any standards status and are just approved by a single Independent Submissions Editor.

That's pretty carefully worded, but it certainly gives the impression that Apple and Google want to standardize this work. The quote from Erica Olsen from the National Network to End Domestic Violence (NNEDV) is even more explicit, referring to these as "new standards" (and of course this is in Apple's press release, so it's not like they aren't aware of the context). Of course, there are other meanings to "standard" than "document produced by some Standards Development Organization", but in this context, the best you can say about this press release is that it's misleading in a way that is very convenient for Apple and Google, who would no doubt like the protective cover of appearing to standardize something while in fact acting unilaterally to address a problem they created by acting unilaterally.

Needless to say "two big companies submit a specification, take comments for three months, and then do whatever they feel like" is not the way that the IETF standards process works. The IETF lets anyone "submit" a specification by posting an Internet-Draft (ID) which is what Apple and Google have done, but those don't have any formal status. Some IDs will be adopted by the IETF as part of the standards process and some of those will actually be standardized and become RFCs. This process takes much longer than three months and involves achieving "rough consensus" of the IETF Community, not just a few vendors. I know that this sounds like standards inside baseball, but there is an important point here. One of the functions of standards is to ensure that there is widespread review from a variety of stakeholders, who might have a different viewpoint (for instance that actual interoperability is useful, or that you need a different set of tradeoffs between privacy and functionality), but the way that that works is that you need buy-in from those stakeholders before the standards are finished.

One critique you often hear is that the standards process is too slow and that this is why industry actors need to ship first and standardize later. The three month comment period seems to reflect that attitude (it's certainly true that the IETF can't standardize anything in three months). However, the decision by Apple and Google (and others!) to ship these technologies without real public review is one reason why we now are in a situation where they are being actively misused, something people have been expressing concerns about for two years. Apple/Google could have brought this work to IETF—or some other standards body—at any point during that time, but they chose not to do so, so arguments about how the situation is now too urgent to go through a real multistakeholder process don't really move me.

I regularly work with a lot of people from Apple and Google and those companies know how to bring work to IETF when they want to. This isn't it.

Final Thoughts #

As I said two years ago, this is a classic dual-use technology. It's really convenient to be able to find your stuff when you lost it, but tracking tags just don't know whether they are attached to your stuff or other people's stuff. Trying to make it visible when you are being tracked via this method is probably about the best you can do, but it's also clear that it's a highly imperfect defense. Deploying this kind of defense is made even harder by having a large installed base of devices from multiple mutually incompatible networks, meaning that anything we do has to be backward compatible. It took us years to get into this hole; it will take a lot more than three months to get out.

[2023-05-08: Updated title.]

Actually a hash of the public key. ↩︎
Heinrich et al. also report an issue in which attacker is able to leverage temporary control of the user's device to steal $SK_i$ and afterwards can track the user. Apple has reportedly solved this by making the keys harder to learn, but this is a generically hard problem in an open system. ↩︎
The way this would work would be that the device encrypt the report as described above and then encrypt it yet again for the service. It would connect to a proxy, authenticate as a valid device, and then send the doubly-encrypted report. The proxy would then strip off the reporter's identity and send it to the service, which would remove the outer encryption layer and store it, just as before. ↩︎
Note that this would most likely not all fit into a single packet, but you could imagine that the reporting device would ask the tag to attest in a separate message before reporting its position to the service. ↩︎

How NATs Work, Part II: NAT types and STUN

2023-04-17T00:00:00Z

The Internet is a mess, and one of the biggest parts of that mess is Network Address Translation (NAT), a technique which allows multiple devices to share the same network address. This is part II in a series on how NATs work and how to work with them. In Part I I covered NATs and how they work. If you haven't read that post, you'll want to go back and do so before starting this one. This post starts to discuss NAT traversal, covering the different types of NATs and how to build peer-to-peer applications that still work from behind NATs.

As IP addresses became increasingly scarce, more and more of the client devices on the Internet started to move behind NATs. I don't have any real data here, but pretty much every consumer level WiFi router I've ever used is also a NAT, sharing a single externally assigned IP address amongst all the devices behind it. By contrast, servers typically have stable public IP addresses.^[1] This arrangement works reasonably well in client-server situations because the client initiates the connections, and so doesn't need an address/port pair that's stable for more than the life of the connection. However, it doesn't work for peer-to-peer applications.

Peer to Peer Applications #

Although much of the Internet is client-server, there are a number of more or less important peer-to-peer (P2P) applications in which data flows directly between end-user machines rather than via a server (as in e-mail, Web, etc.). Some examples are:

1-1 video calling
File distribution (BitTorrent or IPFS)
Some Web3/blockchain systems
Games

In principle, P2P systems have a number of advantages, including:

Reduced cost: because you don't need to pay for a server somewhere. This is an especially big deal for high-bandwidth applications like video calling or file sharing.
Reduced latency: because you don't need to send traffic up to the server and then from the server to the other side, which will generally be slower than sending it directly.
Censorship resistance/avoiding centralized control: because there's no central server to attack.

In practice, some of these advantages often come with disadvantages, which is why you see a lot of client/server applications and not a decentralized Web, but there is still a fair bit of P2P. The application I'm most familiar with is voice and video over IP: it's moderately expensive to run a centralized system like Meet or Zoom where you have to process all the media, but much cheaper to run one where the endpoints just talk to each other.^[2]

P2P Challenges #

The way that a Web server works is that the server operator knows the IP address of the server and publishes it in the DNS. The port number is just 443 or 80 depending on whether the traffic is encrypted or not. Unfortunately, this won't work for P2P systems for two fairly obvious reasons (and also a number of non-obvious ones, as we'll see below):

Machines behind the NAT don't know their own IP address. If your machine has a public IP address, you can just look at how its configured and know what to publish in the DNS. In managed systems, the operators have some mechanism for assigning addresses and storing the data in the DNS. But when you connect your laptop to the WiFi, the IP address that the laptop sees is likely in some private range, e.g., 10.0.0.*, which isn't useful for other people to connect to unless they happen to be on your network.
Public IP addresses and ports aren't stable. In general, the NAT will only have a single public IP address, so that's reasonably stable (though see here), but the port is not. As I mentioned previously, the NAT creates a binding in response to outgoing traffic and then deletes it when there isn't any traffic. As a result, even if you knew the mapping of internal to external ports at some time in the past, that mapping may no longer be valid.

For these reasons, clients can't just publish their IP addresses like a server does (there is also the question of where you would publish them, but put that aside for a moment). Instead, you need some kind of server to help them.

Background: Voice over IP Architecture #

Just for convenience, let's focus on voice over IP. The diagram below shows what you might call the "reference architecture" for a voice or video over IP system like you might build with WebRTC:

Video Conferencing Topologies #

Ironically, despite all the work that has gone into NAT traversal, many video conferencing systems, the media doesn't actually go directly but rather goes through the server. The reason for this is that if you have many people in the call, then the sender needs to send a copy of their media to each other person, which means that if there are N people in the call, and their video is M megabits/second, they need to send (N-1) * M megabits/second of media, which can quickly overrun a consumer Internet link. Instead, it's conventional to use a star topology where the user sends their media to a server which replays it to everyone else in the call. This is expensive for the server, of course, but cheaper for the user. Some conferencing systems do send media directly for smaller conferences to minimize costs. Sending media directly also currently works better with end-to-end encryption for video, though that's a problem that's being actively worked on because you'd like to have end-to-end encryption even in large conferences where a mesh design isn't practical.

In a system like this, Alice and Bob both connect to a signaling server which is responsible for orchestrating the calls. In the case of a traditional VoIP system like you would design with SIP, Alice and Bob would each have a device or an app (often called a "softphone") that had the actual calling logic, presented the user interface, etc., and would exchange SIP messages via the server. In a WebRTC system such as Google Meet or Microsoft Teams, there is a Web server which hosts the Web app and carries messages back and forth between the Web browsers, even though much of the actual calling logic is built into the browser.

In either case, you would ideally like the media (i.e., the actual voice and video) to go directly between Alice and Bob (though see below). There are two main reasons for this. First, it is cheaper: real-time video involves transmitting a lot of data and if Alice sends all that data to the server and then the server sends it to Bob, then the server operator has to pay for all that data transmission. Second, it generally takes longer for the data to go from Alice to the server and then the server to Bob, than it would for Alice to send the data to Bob directly, especially if, as is relatively common, Alice and Bob are geographically close and the server is not. But now we have to contend with NATs.

STUN #

As noted above, the first problem we have is that the client machine may not know its own IP address. The NAT knows, of course, but there's no universally deployed protocol for it to tell the client. Instead, the client has to measure it directly. The standard protocol for this is called Session Traversal Utilities for NAT (STUN). STUN works by having the client talk to some server on the Internet (unsurprisingly called a STUN server). Typically, this server will be provided by the calling service, and configured into the clients somehow. For instance, WebRTC provides an API to tell the Web client which STUN server to use.

In order to discover its IP address, the client sends the server a STUN Binding Request, and the server responds with the IP address and port that the server saw (technical term: reflexive address) like so:

This is how the original version of STUN, published in 2003, behaved. Unfortunately, it is impossible to make things foolproof because fools are so ingenious. As you may recall from the discussion of Application Layer Gateways (ALGs) in Part I, some NATs will rewrite messages coming in from the Internet, rewriting the external (reflexive) address to the internal (host) address. If you have such a NAT, what you will instead see is a flow like below, where the client gets a response that just contains its own local address rather than the external one.

This is not useful! Unfortunately, we have to traverse the the NATs we have, not the NATs we wish we have, so a way around this was needed. The second version of STUN, published in 2008, added a new way to return the reflexive address in what is called the XOR-MAPPED-ADDRESS attribute. This attribute worked by XORing the host and port with other values from the packet. This is pretty weak sauce as encryption goes but it's usually good enough to break up the simple-minded pattern matching that NAT ALGs were using at the time (the idea here isn't to avoid NATs which know about STUN and want to rewrite values, but just to prevent accidental breakage). This is mostly how STUN works today.

One thing that may not be immediately obvious is that you need to do the STUN queries from the same address and port that you want to receive media on. The reason for this is that each port you send from will have a different NAT binding, and so if you know the binding for port A this doesn't tell you about the binding for port B.^[3] This is the same reason why you can't just have the Web server send you your reflexive address and use that: you're contacting the Web server from a different port (and when this stuff was designed, TCP rather than UDP) and so the binding that the Web server sees doesn't help you for your media. Instead, what you do is allocate a port to use for media, discover the reflexive address with STUN, and send that reflexive address to your peer, and then subsequently use that port to send and receive media.

NAT Types #

If only things were that simple. There are in fact NATs for which this will work, but many where it will not. There are two basic problems:

NATs which use different mappings for different remote addresses (and ports). Note: As a convenience, I am going to start saying "address" when I mean "address and port", because the alternative is clunky. For instance, if Alice sends packets to both Bob and Charlie, Bob and Charlie might see different reflexive addresses (or more likely ports, as your typical consumer NAT only has one IP address) even if Alice uses the same local address and port. These NATs are said to have address-dependent mappings or address and port-depending mappings, depending on which differences trigger variation. The alternative is called endpoint-independent mapping (these terms come from RFC 4787).
NATs which have consistent mappings but filter packets from addresses that the client hasn't sent to. For instance, Alice might send a packet to Bob, creating a mapping, but if Charlie sends a packet to Alice on the same reflexive address, the NAT would drop it. If Alice then sends a packet to Charlie, he will see the expected address, and if he responds to this packet, the NAT will deliver it. These NATs are said to have address-dependent filtering or address and port-dependent filtering. The alternative is endpoint-independent filtering.

The bottom line here is that there are a lot of different types of NAT, and depending on what kind of NAT you (and the person on the other side) have, you need to do different things in order to establish a connection.

How to get through a NAT #

As a notational convenience, I'm going to describe NATs using the following abbreviations:

Behavior	Abbreviation
Endpoint-Independent Mapping	EIM
Address-Dependent Mapping	ADM
Address and Port-Dependent Mapping	APM
Endpoint-Independent Filtering	EIF
Address-Dependent Filtering	ADF
Address and Port-Dependent Filtering	APF

A NAT is defined by the pair of mapping and filtering behaviors, so, for instance, EIM:APF is a NAT that has consistent mappings across addresses but filters based on address and port.

I'm also going to simply addresses and ports by writing them as A:a, A:b, etc. where the first letter is the address and the second is the port. Alice's local address will always be A:a and Bob's will be B:b. The STUN server's will be S:s.

EIM:EIF ↔ EIM:EIF #

A NAT which has endpoint-independent behavior for both mapping and filtering is the easiest type to traverse: it's basically like having a public IP address except that the binding may not be stable over long periods of time. You can traverse this kind of NAT by just having each side publish its address and the other side can send directly, as shown in the figure below:

This diagram shows about the simplest possible NAT traversal scenario. It starts with Alice deciding to call Bob. She uses the STUN server to discover her reflexive address by sending a Binding Request from A:a. The STUN server responds with her reflexive address: X:x. Alice then sends a message to the signaling server to initiate the call (the details of this depend on whether you are doing WebRTC, SIP, etc. In SIP this would be an INVITE).

The signaling server notifies Bob of the incoming call. When he decides to accept it, then he will also contact the STUN server to discover his reflexive address (Y:y). His response to the signaling server to answer the call will include this address. At this point, Alice and Bob know each other's addresses and can start sending media to each others reflexive addresses, as shown in the final block. Because the NATs have endpoint-independent mapping, the same binding will be in effect when the peer sends a message as they did for the STUN server, even though the message from the peer comes from a different IP address. Similarly, because they have endpoint-independent filtering, the NAT will accept an incoming packet directed to the reflexive from any source.

This is already a pretty complicated process, but it's conceptually fairly simple: each side discovers its address and sends it to the other side. If all NATs had endpoint-independent behavior for both mapping an filtering, then we could just stop here. Unfortunately they do not.

EIM:EIF ↔ EIM:APF #

Now let's look at the next most complicated case, in which Alice has the same NAT as before, but Bob has a NAT with endpoint-independent mapping but address and port-dependent filtering, as shown in the figure below. To keep things simple, I've omitted the opening phases where each side discovers their address and sends it to the other side, just showing the media phase. Note that the early phases look the same for every NAT type, which is part of what makes things difficult.

Unlike in the previous setting, when Alice sends her first packet of media to Bob, his NAT discards it. Because Bob's NAT has address and port dependent filtering, it has an access control entry only for the STUN server, but not for Alice's address, so when Alice's packet arrives, the NAT just drops it. By contrast, because Alice's NAT has address-independent mapping and filtering (as in the previous example), the packet is delivered correctly to Alice.

You might think at this point that we're just going to have media flowing one way (from Bob to Alice), but that's not what happens: when Bob sends his first media packet to Alice, it creates a new access control entry in his NAT for Alice's address, so that when Alice's second packet (either a retransmit or reflecting a later part of the media stream) arrives, it is delivered correctly:

From this point forward, you have two-way media.

The following table representing Bob's NAT's state might help visualize what's happening here.

Event	Mapping	Access Control List
Start	-	-
Address discovery	B:b ↔ Y:y	S:s
Packet 2 sent	B:b ↔ Y:y	S:s, X:x

Initially, Bob's NAT doesn't contain any mappings. After he sends a Binding Request to the STUN server, the NAT creates a mapping from B:b to Y:y and an access control entry for that mapping associated with just the STUN server. Thus, when Alice's packet 1 comes in, it is associated with a valid mapping, but is rejected because it doesn't match a valid access control entry. When Bob sends his first media packet (number 2), a new access control entry is added to the same mapping (recall that Bob can always send outgoing packets and they just add the appropriate access control entries). Then when Alice's packet 2 arrives, there is an appropriate access control entry and it can be delivered.

Obviously, this introduces a little latency before media starts flowing, but given that the Internet is already subject to packet loss anyway, this isn't necessarily that big a deal, especially with voice and video applications which will be sending packets every 20 milliseconds or so. It's potentially a slightly bigger issue for reliable transport protocols if they only send one packet at a time and have long retransmit timers, but even then the connection will eventually be established; it just takes a little while.

EIM:APF ↔ EIM:APF #

Now let's look at what happens when both Alice and Bob have NATs with endpoint-independent mapping but address and port-dependent filtering. This actually behaves identically to the previous scenario:

As before, the first packet from Alice to Bob is dropped by Bob's NAT but on its way out it establishes the access control entry in Alice's NAT in the opposite direction, thus allowing the next inbound packet to pass through the NAT. Here's Alice's table:

Event	Mapping	Access Control List
Start	-	-
Address discovery	A:a ↔ X:x	S:s
Packet 1 sent	A:a ↔ X:x	S:s, Y:y

All of this happens before the first packet from Bob arrives, and so even though Alice does have address and port-dependent filtering, the right access control entry is in place before that packet is received, and so the packet is just delivered.

One important feature to notice about all the scenarios we have seen so far is that they don't depend on knowing what kind of NAT the other side has: Alice and Bob just start transmitting and eventually the right access control entries will be established and the packets will flow properly. Now let's look at a scenario where that isn't true: when one side has address and port-dependent mapping as well as filtering.

EIM:EIF ↔ APM:APF #

Suppose we have a situation where Alice has the address-independent mapping and filtering but address and port-dependent filtering as before, but Bob has both address and port-dependence for both mapping and filtering. This produces the situation shown below:

As with the previous scenario, the first packet from Alice to Bob is dropped by Bob's NAT. This actually happens for a slightly different reason than in the previous examples. In those, there was a valid mapping for Alice's packet, but no corresponding access control entry. However, in this case, because Bob's NAT has address and port-dependent mapping, the packet from Alice (X:x) to Y:y doesn't match any mapping at all.

When Bob sends his first packet (2) to X:x, it creates a mapping for X:x but on a different outgoing port Y:y'. At this point, his NAT has the following mapping table:

Local Address	Remote Address	External (Reflexive) Address
B:b	S:s	Y:y
B:b	X:x	Y:y'

Packet 2 is still deliverable to Alice because her NAT has endpoint independent mapping and filtering. However, when Alice sends her next packet, it still goes to Y:y and not Y:y', and so just gets dropped. This one-way communication will persist throughout the connection: every packet Bob sends uses the X:x ↔ Y:y' mapping and every packet Alice sends goes to Y:y, so none of them will ever be delivered. Most likely, eventually one of Alice or Bob will get tired of not being able to communicate and hang up.

This scenario is actually recoverable, but it requires some cleverness on Alice's part. What Alice has to do is look at the source address of packets that Bob is sending her (the peer reflexive address) and if it differs from the one that Bob sent her over the signaling channel (the server reflexive address), try sending packets to that address instead, as shown below:

The first two packets here are the same as before, but for the third packet, Alice switches from sending to Y:y to sending to Y:y'. This corresponds to a mapping on Bob's NAT (and also an entry in his access control table) and so the packet will be delivered as expected. From here on, things work normally.

This technique works, but it requires care to use correctly. For instance, consider what happens if an attacker sends a bogus packet from a different address:

If Alice is naive, she will notice that Bob seems to have switched his address and just switch to sending to the attacker at Z:z. Importantly, this attack can be mounted by an attacker who cannot read packets en route from Alice to Bob; he just needs to know the address and port Alice expects packets on. In the best case, if encryption is in use, then the attacker won't be able to read the packets but he will have disrupted the connection. In the worst case, if encryption is not in use—and when all this stuff was designed, VoIP encryption was fairly rare—the attacker will be able to listen in on the Alice → Bob side of the call. Depending on Bob's NAT configuration (e.g., if it's actually endpoint-independent), the attacker may even be able to do so without noticeably disrupting the call, by forwarding the packets to Bob. There are a number of defense against this form of attack, as we'll see in the next post.^[4]

EIM:APF ↔ APM:APF #

Let's look at one more case, in which Bob has the same APM:APF NAT as before but Alice has a NAT that does address and port-dependent filtering. This produces the result shown below:

As in the previous example, Alice's packet gets dropped by Bob's NAT because there is no corresponding mapping for X:x, only one for the STUN server. However, unlike the previous example, Bob's packet is also dropped, because there is no corresponding access control entry. As you'll recall from above, Alice has the following mapping and access control entries:

Mapping	Access Control List
A:a ↔ X:x	S:s, Y:y

However, because the incoming packet is coming from Y:y' and not Y:y, Alice's NAT discards it. And because Alice never gets packet 2, she is unable to change the destination of her packets to Bob's peer reflexive address Y:y' and so just keeps transmitting packets to Y:y which Bob's NAT drops because it does not have a corresponding mapping. Similarly, Bob keeps transmitting packets to Alice, which her NAT drops because it doesn't have the corresponding access control entry.

What we have here is deadlock: Bob can't receive packets from Alice until she adjusts the address she is sending to, and Alice can't receive packets from Bob (thus learning about the new address) until she has sent one to the new address (thus creating the access control entry). The result is both sides transmitting and neither side receiving. This is not a recoverable situation.

Relays #

Getting out of this hole requires the use of a relay server. More on this later, but briefly a relay is some server on the public Internet that Bob can send his traffic through. Because this relay is something that Bob explicitly uses and has a relationship with—unlike his NAT, which just does whatever it does—it can have deterministic properties which facilitate NAT traversal. For instance, the most common relaying protocol, Traversal Using Relays Around NAT (TURN), provides endpoint-independent mappings and so effectively fixes the problem we see in this section.

As with STUN servers, it's conventional for the calling provider to provide a TURN server—indeed, they are typically the same endpoint. However, TURN is much more expensive to provide than STUN, so ideally if its possible for two endpoints to communicate without using TURN, you want them to do so. Fortunately, most client pairs can communicate without TURN, so it's still cheaper to operate a calling service that tries to send data peer-to-peer than one that sends everything through a central conferencing server, as long as you use non-TURN where possible. We'll discuss how to do this in the next part of this series.

Hairpinning #

There's one more scenario I want to cover here, which is what's called "hairpinning". Consider the case where Alice and Bob are actually on the same network, as shown below:

Recall that the NAT has two addresses, the internal address (10.0.0.1) which Alice and Bob communicate with and the external one (192.0.2.1) that is used communicate to the outside world. Just as in the scenarios before, Alice and Bob can connect to the STUN server and get their server reflexive addresses. For example, Alice might get 192.0.2.1:1111 and Bob 192.0.2.1:2222 (naturally, these use the NAT's external address). The problem comes when Alice tries to send a packet from inside the network to Bob's external address. If the NAT handles this properly, it will deliver this packet (technical term: hairpinning) but some NATs do not do so, and will just drop the packet. In this case, Alice and Bob will not be able to communicate.

Of course, Alice and Bob can communicate directly using their local addresses in the 10.0.0.* space, but it's hard for them to detect this case because many different networks use those addresses (that's the point of RFC 1918 addresses, after all). They could look to determine if they have the same server-reflexive address, but that might or might not be a reliable indicator, depending on what kind of NATs are in use. For instance, Alice and Bob might have their own NATs but also be behind a carrier grade NAT that causes them to have the same address. In this case, they will probably not be able to communicate directly.

Ideally, NATs would properly support hairpinning (this is what RFC 4787 recommends), but, as we've seen throughout this series, NAT behavior is inconsistent and the endpoints have no good way of asking the NAT what it does.

What a mess #

Back in 2003 when STUN was first being developed, the idea was that you would characterize your NAT. RFC 3489 had a whole algorithm you used that involved multiple STUN queries and tried to determine what your network configuration was (remember that you can't ask it any questions, you have to measure). The RFC described a whole menagerie of different NAT types ("full cone", "restricted cone", "port restricted cone", and "symmetric"),^[5] with the idea that you would classify your NAT according to one of these types. Based on what kind of NAT you had, you could then provide an appropriate address to the other side—or, in the case of the worst type ("symmetric NAT") potentially declare failure. This turned out not to work very well, in part because the ecosystem was just a lot more complicated than people expected. In the words of the revised STUN RFC:^[6]

STUN was originally defined in RFC 3489 [RFC3489]. That specification, sometimes referred to as "classic STUN", represented itself as a complete solution to the NAT traversal problem. In that solution, a client would discover whether it was behind a NAT, determine its NAT type, discover its IP address and port on the public side of the outermost NAT, and then utilize that IP address and port within the body of protocols, such as the Session Initiation Protocol (SIP) [RFC3261]. However, experience since the publication of RFC 3489 has found that classic STUN simply does not work sufficiently well to be a deployable solution. The address and port learned through classic STUN are sometimes usable for communications with a peer, and sometimes not. Classic STUN provided no way to discover whether it would, in fact, work or not, and it provided no remedy in cases where it did not. Furthermore, classic STUN's algorithm for classification of NAT types was found to be faulty, as many NATs did not fit cleanly into the types defined there.

Instead, the IETF devised a solution which was intended to work with any NAT type by the time honored technique of trying a lot of stuff and seeing what works. That solution is called Interactive Connectivity Establishment (ICE), and I'll be covering it in Part III.

They may also be NATted, but that's an operational convenience, because you still need a stable public IP. ↩︎
There are also reasons why centralized videoconferencing systems are good, but that's another post. ↩︎
Except that it it won't be the same as for A, at least for the same server. ↩︎
Note that some of the obvious defenses don't work. For instance, you can't just "latch" to the first packet you see because the attacker might be faster. Similarly, you can't just compare the peer reflexive address to the address Bob sent over the signaling channel because if Bob has address and port-dependent mapping, then the true peer reflexive address will also not match. ↩︎
This was before people came up with this "endpoint-independent", "address-dependent", etc. taxonomy. ↩︎
Which also includes a totally different expansion for STUN. In RFC 3489, STUN stood for "Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs)" and now it stands for "Session Traversal Utilities for NAT". The IETF loves its acronyms (and backronyms). ↩︎

Everything you never knew about NATs and wish you hadn't asked

2023-04-03T00:00:00Z

The Internet is a mess, and one of the biggest parts of that mess is Network Address Translation (NAT), a technique which allows multiple devices to share the same network address. In this series of posts, we'll be looking at NATs and NAT traversal. This post is on NATs and the next one will be on NAT traversal techniques.^[1]

Background: IP addresses and IP address exhaustion #

You may recall from previous posts that the Internet is a packet switching network which works by routing self-contained messages (datagrams):

Writing IP Addresses #

IPv4 addresses are 32 bits, hence 4 bytes. It's conventional to write them in what's called "dotted quad" format, which consists of writing each byte value (from 0 to 255) separately, followed by a dot. For instance, 10.0.0.1 corresponds to the bytes 0x0a 0x00 0x00 0x01. Because IPv6 addresses are so much longer, writing them is unfortunately kind of a pain, and you end up with goofy stuff like 2607:f8b0:4002:c03::64 (for google.com) where the :: means that everything in between is a 0.

Each packet has a source and destination address, which are just numbers, and each device has its own address, which is how packets get sent (routed) to it and not to other devices. In the original version of the Internet Protocol (IP version 4 or just IPv4), these addresses were 32 bits long, which means that there are a total of 2³² (about 4 billion) possible addresses. There are rather more than 4 billion people on the planet and many of them have more than one device, so it's not actually possible for each device to have a unique address.

This problem has been known about for more than 30 years, and the the Internet Engineering Task Force (IETF), which maintains most of the main networking protocols on the Internet, has an official fix, which is for everyone to upgrade to a new version of IP called IP version 6 (IPv6). IPv6 has 128 bit addresses, which, at least theoretically, means that there are plenty of addresses. Unfortunately, for reasons which are far too long—and depressing—to fit into this post, the transition to IPv6 has not gone well, with the result that over 25 years after IPv6 was first specified, significantly less than half of the Internet traffic is IPv6. The graph below shows Google's measurements of the fraction of its traffic that is IPv6, reflecting client-side deployment. Server-side deployment is also fairly bad, with ISOC reporting that about 44% of the top 1000 sites support IPv6.

[Source: Google]

This is, needless to say, not good. As a comparison point, TLS 1.3 shipped in 2018 and at this point ISOC's numbers show 79% support among the top 1000 sites. At some level this is a slightly unfair comparison because transitioning to IPv6 means changing your network connection whereas transitioning to TLS 1.3 just requires updating your software, but in any case, we're nowhere near full IPv6 deployment, even though we no longer have enough IPv4 addresses. Actually, addresses have been scarce for quite some time, as shown in the timeline below:

[Source: Michael Bakni via Wikipedia]

IP addresses are centrally assigned, with the overall pool being managed by the Internet Assigned Numbers Authority (IANA) which provides them to Regional Internet Registries (RIRs), which then hand them out to network providers, on down to hosts.^[2] IANA allocated its last block to the RIRs back in 2010, but addresses were already starting to get scarce before then. As you can see on the chart above, an immediate transition to IPv6 in which we just turn off IPv4 is implausible today but was out of the question in the early 2000s back when deployment was effectively zero. Another technical solution was needed, one that would be incrementally deployable rather than simultaneously replacing big chunks of the Internet (technical term: forklift upgrade). And the Internet delivered in the form of NAT.

Network Address Translation (NAT) #

The basic idea behind NAT is simple: you can have multiple machines share the same address as long as there is a way to demultiplex (i.e., separate out) traffic associated with one machine from traffic associated with another. Fortunately, such a mechanism already existed: ports.

Port Numbers #

Consider the case where you just have two computers, a client and a server, but where there are two simultaneous users on the client. This feels like an odd situation in 2023 when basically all computers are individual, but all of this stuff was designed back in an era when multiple users timesharing on the same computer was the norm. If both users want to connect to the same server, they will have the same IP address, so how does the server tell them apart?

The answer is to have another field, the port number, which is is just a 16-bit integer that can be used to distinguish multiple contexts on the same device (IP address). Port numbers have two main uses:

on clients: to distinguish multiple similar processes connecting to the same server.
on servers: to distinguish multiple different services. Conventionally, services will have specific assigned port numbers, such as 80 for HTTP, 443 for HTTPS, etc.

Port numbers don't exist at the IP layer but rather at the TCP or UDP layers, but virtually all the traffic we'll be talking about uses UDP or TCP, so that's usually not an issue.

NAT #

Port numbers allow two users on the same machine to share an IP address. The intuition behind NAT is that you can use the same mechanism to allow two machines to share an IP address, as long as you can ensure that they won't also try to use the same port.^[3] The basic way to do this is by having the network gateway device (e.g., your WiFi router) do the work. The basic scenario is shown below:

In this example, Alice and Bob are both on the same network and have addresses 10.0.0.3 and 10.0.0.2 respectively. The WiFi router has two addresses, one on the inside which it uses to talk to Alice and Bob (10.0.0.1) and one on the outside which it uses to talk to machines on the Internet (192.0.2.1).

When Alice wants to talk to the server, she sends a packet from her IP address and uses local port 1111 (this is usually written 10.0.0.3:1111), as shown above. This packet gets sent to the WiFi router, which rewrites the source address and port to 192.0.2.1:1234 and sends it along to the server. When the server responds, it sends the packet to 192.0.2.1:1234 (this is the only address that it knows), which routes it back to the WiFi router. The router duly rewrites the destination address to 10.0.0.3:1111 and sends it to Alice. The story is the same for Bob (he even uses the same port number!) except that the packets he sends are from 192.0.2.1:5678. In order to make this work, the router needs to maintain a mapping table of which external ports correspond to which internal machines. Each entry in the table is called a "NAT binding" and associates the external address and port to the internal one.

From the server's perspective, this looks exactly the same as if there were a single machine with address 192.0.2.1 talking to it; NAT is just something that happens unilaterally on the client side. This is a very important feature because it enables incremental deployment. A network that can't get enough IP addresses can use NAT without any change on the servers. Perhaps less obviously, it doesn't require changing the clients either: they just use their ordinary IP addresses and the NAT translates them.^[4]

NAT isn't magic, of course, and it can't create IP addresses out of nowhere; what it does is stretch them by using the port number as an extension of the IPv4 address space. In fact, we used to joke about the IPv7 packet header, in which the IPv4 address fields were the "high order" bits of the address and the transport port fields were the "low order" bits:^[5]

It's still possible to run out of ports on the NAT device if it has enough clients behind it, but because the NAT can use the same port to talk to two different server at the same time (though this turns out to be bad news for reasons we'll get into below) and there are around 65000 possible ports, you need a lot of clients to want to concurrently talk to the same server before this becomes a problem. As a general matter, NATs will reuse ports once they are no longer active, so that NAT bindings aren't stable over time: port 1234 might be Alice now but Bob in 20 minutes.

As a practical matter, you don't usually use NAT for servers, at least not this way, though it's not technically impossible. In particular, HTTP(S) URIs have a port number field, so you can say (for instance) https://example.com:4444 to indicate that the client should use port 4444 but this just isn't common practice, partly because the result is ugly and partly because there are other mechanisms for sharing multiple servers on the same client, such as TLS Server Name Indication (SNI).

RFC 1918 Addresses #

Of course, even if they are behind a NAT, each client still needs its own IP address, so how does this help? The answer is that these addresses don't need to be globally unique but just locally unique within a given network. This means that the local address of a machine on your network might be the same as one on my network, but they get translated to different addresses on the public Internet.

The IETF has reserved a number of address blocks for "Private" usage in RFC 1918. These addresses are never supposed to appear on the public Internet and so it's safe to use them on your network, as long as you translate them to a routable address on the way out to the Internet. The example above uses addresses from one such address block: 10.0.0.0/8, which means "all the addresses with the 8-bit prefix 10, i.e., 10.0.0.0 to 10.255.255.255 inclusive. This block has around 16 million possible addresses in it, so you can have a very large network behind a NAT.

Maintaining NAT Bindings #

Internally, a NAT needs to keep a mapping table that stores the bindings between internal and external addresses. In the example above, you would have a table something like:

Internal Address	External Port
10.0.0.3:1111	1234
10.0.0.2:1111	5678

Note that the external address is constant, so we don't need it in the table. Some larger NAT systems (see carrier-grade nat below) have multiple external IP addresses, but we don't need to worry about that right now.

When the NAT receives a packet on the outgoing interface, it needs to do a table lookup. If a binding already exists for the packet, then the NAT just uses the entry in the table. If no binding exists, it creates a table entry with an unused port and forwards the packet. In this example, I've described what's called an "address-independent" NAT in which you have a single binding for a given local address/port combination, no matter what the remote address is. There are also "address-dependent" NATs, which use a different binding. This will become relevant when we talk about NAT traversal in Part II.

When the NAT receives an incoming packet on the external interface, it also does a table lookup. If a table entry exists, it forwards the packet as expected, but if no entry exists then there's no way of knowing which host the packet is intended for; the sensible thing to do in this case is to just drop the packet. The result of this is that most consumer NATs only really support flows in which the machine behind the NAT speaks first to initiate the flow. This is usually conceptualized as an "outgoing-only" set of semantics and corresponds well to TCP connections, in which the client sends the first packet (a SYN). Indeed, some NATs rely on the TCP SYN to create bindings, and will just drop mid-connection TCP packets that correspond to unknown flows. This doesn't work with UDP so you just have to look at the first outgoing packet, ignoring whatever markings it has.

This "outbound connections only" semantic is often viewed as a security feature because it means that even if you have devices behind the NAT that have "open TCP ports", meaning that they listen on those ports for connections, external attackers may not be able to connect to them. This kind of device is surprisingly common, especially for things like printers or scanners which you want to be accessible to anyone on the local network, so a NAT is really providing a valuable function here. However, it's important to realize that unlike a firewall, which is explicitly designed to block certain kinds of connections, many NATs just do this as a sort of accidental side effect of their architecture—although others do so explicitly, as we'll see later—so it's not a guaranteed property that you should rely on.

Binding Lifetimes #

This brings us to the obvious question of when the NAT should delete bindings. Cleaning up old bindings is an important function because otherwise the NAT would quickly use up its available port space. There are a number of ways to manage this:

Keep the binding open until the connection is torn down,: either by a TCP FIN or a TCP RST. This doesn't work with many UDP-based protocols, which either don't have messages indicating connection closing (such as RTP) or where those messages are encrypted (such as QUIC or DTLS 1.3). This method also isn't sufficient even for TCP, because the client might have shut down without sending a FIN, for instance if it crashed or the user put their laptop to sleep.
Use a timeout: and tear down connections which are idle for too long. This guarantees that eventually the resources will be released, because if the client shuts down, it won't be sending packets. However, "too long" is just a heuristic. Network protocols are often designed so that if there is no data flowing they don't send any packet (TCP is this way), in which case you may just be tearing down a connection right as the client was about to send something. More modern protocols incorporate "keepalive" packets to keep the NAT bindings open, but remember that the idea here is that a NAT should work with protocols that were designed before the NAT was deployed, so this is not an ideal solution.
Delete the least-recently-used connections: once some maximum number of connections is reached and a new one needs to be allocated. This has many of the same problems as the timeout but is a slight improvement in some respects because it doesn't delete old connections unless the table is full.

It's of course also possible to use more than one of these mechanisms at once. For instance, you might look at the TCP control packets to drop TCP connections but use timers as a backup for client shutdown and for other protocols.

Non-TCP/UDP Protocols #

Of course, TCP and UDP are not the only protocols which it is possible to run on the Internet. The IP datagram's "next protocol" field is an 8-bit value and only about half of these are assigned so in principle it's possible to introduce new protocols that run directly over IP. In practice, however, NATs make this extremely problematic because the port field is not in the IP header but rather in the header of the protocol that sits above IP (e.g., TCP or UDP), which means that the NAT needs protocol-specific logic for each new protocol.

A good example here is SCTP, a TCP-like protocol that introduces a number of new features like multiplexing on the same connection. SCTP was intended to run over IP, just like TCP, and SCTP's header actually has the source and destination ports in the same location as TCP and UDP, as shown below:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     Source Port Number        |     Destination Port Number   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                      Verification Tag                         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Checksum                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Firewalls #

The situation is actually much worse than I'm making it out here, because network security devices like firewalls are often configured to reject any traffic that they don't understand. Even if a new protocol magically worked with NATs without modification, it would be blocked by many firewalls.

You might think, then, that a NAT which just always rewrote whatever bytes were in location for the source/destination port fields for UDP or TCP would work fine with SCTP, but that's not correct. It's true that it would rewrite the fields, but that would just create another problem, because the SCTP packet also includes a checksum (the last field in the header shown above) which is computed over the entire packet and is designed to detect any change to the packet, including the port numbers. This means that any NAT which rewrites the source and destination port also needs to rewrite the checksum, otherwise the checksum verification will fail at the receiver and the packet will be discarded.^[6] The SCTP checksum is in a different place than the TCP (or UDP checksum) and is computed using a different algorithm, so even if you just went ahead and used the TCP rewriting code—which isn't a good idea for other reasons—you'd just end up damaging some other part of the packet. The bottom line, then, is that it's not safe for NATs to just rewrite packets they don't understand (even though in some cases it might be safe), and instead NATs need to be modified in order to support each new protocol, which means that any such protocol starts out broken on a huge fraction of clients, making it very hard to get traction.

Fortunately there is a well-known solution to this problem, which is to run your new protocol over UDP. The UDP header is comparatively lightweight, consisting of 8 bytes, 4 of which are the host and port, which you'd need anyway. The other two are a two-byte length field, which you'd generally want and a kind of outdated checksum, which only takes up two bytes, so there's not that much overhead.

 0      7 8     15 16    23 24    31  
+--------+--------+--------+--------+ 
|     Source      |   Destination   | 
|      Port       |      Port       | 
+--------+--------+--------+--------+ 
|                 |                 | 
|     Length      |    Checksum     | 
+--------+--------+--------+--------+

If you run your protocol over UDP, then NATs will generally work mostly correctly—again with the caveat that the NAT doesn't know when a connection stops and starts—you start out from a position of things mostly working rather than them mostly failing (when QUIC was first rolled out, Google found that around 95% of connections succeeded.) Of course, 95% isn't 100%, and experience with new protocols such as QUIC and DTLS (with WebRTC) suggests that any new protocol will experience some blockage; in practice this means that you need to arrange some way to fall back to an older protocol such as HTTPS if your new UDP-based protocol fails. There are a number of possible approaches here, including trying both in parallel (a technique often called Happy Eyeballs), trying the new protocol first and seeing if it fails, or trying the old protocol first and then in the background trying the new protocol.

For this reason, the only really practical way to deploy new transport protocols on the Internet is over UDP,^[7] and this is what recent protocols such as QUIC (running directly over UDP) or WebRTC data channels (SCTP running over DTLS running over UDP) do.^[8] This principle was forcefully enunciated by voice over IP pioneer Jonathan Rosenberg (JDR) in an IETF session where someone was presenting a mechanism for running SCTP over NATs. JDR's response was something to the effect of:

There are some hard truths in the world and this is one of them. TCP and UDP are the new waist of the IP protocol stack.

In this context, "waist" refers to a famous analogy for the IP protocol suite illustrated by this image from a talk by IPv6 designer Steve Deering:

[Source: Steve Deering]

The idea is that IP can run on any kind of transport (radio, copper, whatever) and that you can run lots of protocols on top of it, but that IP is the common element hence the narrow "waist" of the hourglass. Rosenberg's point (which I agree with) is that this place is now occupied by UDP (and to a lesser extent TCP). Arguably, the situation is worse than this: it's so common to deploy new technologies over HTTP that I've seen arguments that HTTP is the new waist, but we're not there yet!

Application-Layer Gateways #

NAT works quite well for simple protocols which just consist of one connection (e.g., HTTP). However, there are some protocols which have a more complicated pattern. As an example, the File Transfer Protocol (FTP) is part of the original protocol suite and was widely used for downloading data prior to the dominance of the Web and HTTP. FTP had an unusual (to modern eyes) design which used two connections:

A control channel: which the client used to give instructions to the server.
A data channel: which was used to actually transmit data.

A download using FTP [edited from "UDP" — 2023-04-17] looks like the following:

The client would first connect to the FTP server and then issue instructions about what file to download. The server would then connect to the client (by default using the port number one lower than the one the client used, but the client can provide a port number) and send the file.

Of course, this won't necessarily work if you have a NAT, because the port number probably won't be right; even if the client uses the default, the NAT might not have two adjacent ports spare. Instead, the NAT would use what's called an application-layer gateway (ALG) and rewrite the client's request, like so:

An aggressive ALG #

Sometimes ALGs aren't so careful, however. The FTP ALG only works because the NAT knows about FTP, but what about unknown protocols? One possible implementation is to just pattern match by replacing any occurrence of the IP address (e.g., 10.0.0.1) or the IP address and port (e.g., 10.0.0.1:1111) with the NAT's address and port (and maybe even make a new NAT binding to go along with it.) This is a general mechanism but also a brittle one. In one hilarious case, Adam Roach (another VoIP pioneer) was trying to download a Linux disk image and kept getting checksum errors.

He eventually tracked it down by comparing the right image and the one he was getting and found a 4 byte difference, where the right value corresponded to his public IP address and the value he was getting was his internal address. What was happening was that the ALG in the NAT was just rewriting anything that looked like his external IP into his internal IP, regardless of where it was in the data stream. Not good!

Note that the NAT mostly doesn't interfere with the client's data: it just knows enough about FTP to know where the port number is, create the appropriate incoming NAT binding, and then replace it on the control channel. This of course won't work as well on unknown protocols and won't work at all on encrypted ones (in fact, any tampering with an encrypted protocol will generally just cause some kind of failure). At this point FTP is mostly gone (due to a combination of being insecure, being superseded by HTTP, at least in the case of Web browsers, concerns about the quality of the implementations), and newer protocols don't adopt this pattern because they want to work well with NATs. The reason that ALGs of this kind were needed was to avoid breaking existing protocols when NATs were first introduced, but now that NATs are widespread, the opposite dynamic is in play and new protocols have to avoid breaking when run over existing NATs.

Carrier-Grade NAT #

Initially, NATs were largely deployed at the boundary of consumer or enterprise networks (where they are now ubiquitous). However, as IP address space got more and more scarce, ISPs found themselves in the position where they were not able to get enough IP addresses for each customer to have one. The solution, of course, was to have a giant NAT (usually called carrier-grade nat (CGN) which multiplexes multiple subscribers onto the same IP address. Of course, the customer may still have their own NAT, so with CGN you can have multiple layers of NATting and address rewriting, which of course couldn't possibly go wrong.

In a CGN scenario, the addresses assigned to subscribers can either be from unroutable address space (either from RFC 1918 or from the new RFC 6598 block), or can be IPv6 addresses. In the latter case, subscribers would just have IPv6 addresses and the NAT would rewrite things to IPv4 on the way out the door, in a technique called NAT64. This scenario isn't as simple as with IPv4 because the network also needs to rewrite IPv4 addresses in DNS A records to IPv6 AAAA records (a technique called DNS64) so that the IPv6-only clients can send to them; this comes with its own problems, but that's a topic for another post.

The IETF and NAT #

For a long time, the IETF was basically in denial about NAT, for two major reasons:

Any packet rewriting (let alone ALGs) violates the end-to-end design of IP in which packets just go untouched from A to B.
It was seen as a technique to extend the lifetime of IPv4 when everyone should just be transitioning to IPv6 (sharpen the contradictions!)

The general attitude at the time was that standardizing NAT behavior would just encourage it and instead one ought to ignore NATs and hope they would go away, when the IPv6 rapture finally arrived. You can see this attitude as late as 2012, when RFC 6598 was published with the following statement:

A number of operators have expressed a need for the special-purpose IPv4 address allocation described by this document. During deliberations, the IETF community demonstrated very rough consensus in favor of the allocation.

While operational expedients, including the special-purpose address allocation described in this document, may help solve a short-term operational problem, the IESG and the IETF remain committed to the deployment of IPv6.

This all worked out about as well as you would think: NATs are everywhere and we still don't have anything like full deployment of IPv6. To make matters worse, in the absence of any guidance, NAT behavior became extremely variable and idiosyncratic, leading to ever more complicated workarounds. Eventually, in 2007, the IETF published RFC 4787 document describing how NATs ought to behave; by that time there were of course a huge number of NAT deployments which didn't follow these guidelines, though they're hopefully useful for developers of newer devices.

Final Thoughts #

NATs provide a particularly good example of the way the Internet evolves, which is to say workaround upon workaround. The reason for this is what Google engineer Adam Langley calls the "Iron law of the Internet", namely that the last person to touch anything gets blamed. The people who first built and deployed NATs had to avoid breaking existing deployed stuff, forcing them to build hacks like ALGs and unpredictable idle timeouts. Now that NATs are widely deployed, new protocols have to work in that environment, which forces them to run over UDP and to conform to the outgoing-only flow dynamics dictated by the NAT translation algorithms. Of course, there is a whole class of applications that don't fit well into that paradigm, in particular peer-to-peer applications like VoIP and gaming. In the next post we'll look at techniques to make those work anyway, even with existing NATs.

Yes, I know I still have two unfinished series, one on transport protocols and one on Web security. I got a bit distracted, and, in the case of the transport protocol series, a bit carried away with one of the posts, but I do plan to get back to them. I'm already partway through Part II, so I should have that up relatively soon. ↩︎
In theory IANA could just assign numbers directly, but this allows for regional governance. ↩︎
Technically this mechanism is known as "Network Address/Port Translation" (NAPT) but as this is the most common approach, NAT is the common term. ↩︎
I've omitted one detail, which is that you need to give the clients all new addresses from the RFC 1918 space, but in modern networks, the client addresses are centrally assigned by the local network, so this is typically straightforward. ↩︎
We actually had shirts made, with the front saying "32 + 16 > 128", with the joke being that the 32 bit address + 16 bit port of IPv4 was better than the 128-bit IPv6 address. Cafe Press seems to have lost the design though. ↩︎
At one point there was a draft to make SCTP work better with NATs, but it doesn't seem to have ever been standardized. ↩︎
A big reason to have a new transport protocol is to have your own rate limiting and reliability mechanisms, and that doesn't work if you run them over TCP, which has its own mechanisms. ↩︎
NATs aren't the only reason to deploy new protocols over UDP. It's also helpful that you can implement new UDP-based protocols entirely in application space rather than by modifying the operating system. ↩︎

Architectural options for messaging interoperability

2023-03-10T00:00:00Z

As I mentioned in some previous posts, the EU Digital Markets Act (DMA) requires interoperability for number independent interpersonal communications services (NICS), which is to say stuff like messaging (what we used to call "Instant Messaging") as well as real-time media (voice and video calling). Specifically Article 7 says that:

2.   The gatekeeper shall make at least the following basic
     functionalities referred to in paragraph 1 interoperable where
     the gatekeeper itself provides those functionalities to its own
     end users:

(a) following the listing in the designation decision pursuant to Article 3(9):
    (i) end-to-end text messaging between two individual end users;
    (ii) sharing of images, voice messages, videos and other attached
    files in end to end communication between two individual end
    users;

(b) within 2 years from the designation:
    (i) end-to-end text messaging within groups of individual end
    users;
   (ii) sharing of images, voice messages, videos and other attached
   files in end-to-end communication between a group chat and an
   individual end user;

(c) within 4 years from the designation:
   (i) end-to-end voice calls between two individual end users;
   (ii) end-to-end video calls between two individual end users;
   (iii) end-to-end voice calls between a group chat and an individual
        end user;
   (iv) end-to-end video calls between a group chat and an individual
        end user.

The European Commission (specifically the Directorate General for Communication (DG COMM)) has been holding a series of workshops on how to structure the compliance requirements for the DMA. Last week, I attended the workshop on messaging interoperability to serve on a panel along with Paul Rösler (FAU Erlangen-Nürnberg), Stephen Hurley (Meta), Alissa Cooper (Cisco), and Matthew Hodgson (Element/Matrix). (video here; my presentation starts at 13:24:10; slides here). Nominally the panel was about the impact of end-to-end encryption on interoperability (see here for some earlier thoughts on this), but in the event it turned into more of an overall discussion of the broader technical aspects.

The rest of this post expands some on my thinking in this area. Note that while I work for Mozilla, these are my opinions, not theirs.

Overview of Technical Options #

At a high level, there are three main technical options for providing this kind of interoperability, in ascending order of flexibility for the competitor product:

Gatekeeper-provided libraries.: The gatekeeper provides a software library (ideally in source code form, but perhaps not) which implements their interfaces. The competitor builds their app using that library and doesn't have to know—and maybe doesn't get to see—any details of those interfaces.^[1]
Gatekeeper-specified interfaces.: The gatekeeper publishes a set of interfaces (Web APIs, protocols, etc.) that competitors can use to talk to its system. The competitor implements those APIs themselves—or maybe someone writes an open source library to implement them—and talks to the gatekeeper's system that way.
Common protocols.: The gatekeeper implements interfaces based on some common—preferably standardized—protocol. The competitor implements that protocol and uses it to talk to the gatekeeper.

We'll take a look at each of these options below.

Requirements Scope #

The big question lurking in the background of the entire workshop was the scope of the requirement that the EC would levy on gatekeepers. I'm not a legal expert, but the above quoted text seems to require that the gatekeepers make this functionality available but not dictate any particular means of doing so. In particular, they might opt to just publish some libraries or specifications that anyone who wants to interoperate with them must conform to (the EU technical term here is "reference offer"). The Commission would then be responsible for ensuring that this reference offer was compliant, which is to say that:

It provides the required functionality.
It is sufficiently complete to implement from.

For reasons that will occupy most of the rest of this post, this is not really an ideal state of affairs, and it would be easier for competitors (technical term "access seekers") if there were some single set of interfaces (protocols) that every gatekeeper. However, the tone of the workshop is that the Commission is not eager to require a single set of standards at this stage and that there's some question about exactly what the DMA empowers them to require in this area. For the purposes of this post, I'm going to put that question aside and focus on the technical situation as I see it.

Interoperability is Hard #

The first thing to realize is that interoperability is really difficult to achieve, even when people are trying hard. The basic problem is that protocol specifications tend to be fairly complicated and it is difficult to write one that is sufficiently precise and complete that two people (or groups) can independently construct implementations that interoperate. In fact, some standards development organizations require demonstration that every feature has two independent implementations that can interoperate in order to advance to a specific maturity level (in IETF, "Internet Standard", though as a practical matter, even many widely deployed and interoperable protocols never get this far, just because it's a hassle to advance them). Part of the process of refining the protocol is finding places where the specification is ambiguous and modifying the specification to clarify them.

Over the past 10 years or so, I've been heavily involved in the standardization and interop testing of at least three major protocols (and a number of smaller ones):

In each case, we discovered issues with just about every implementation, in many cases leading to interoperability failure.

A full description of how interop testing works is outside the scope of this post, but at a high level each implementor sets up their own endpoint and you try to make them communicate; in most cases this will initially be unsuccessful. If you're lucky, one of the implementations will emit some kind of error, but sometimes it just won't work (e.g., you just get a deadlock with neither side sending anything). What you do next depends on precisely what went wrong. As an example, if a message from implementation A elicited an error from implementation B, then you look at the message and the error it generated and try to determine if the message was correct (in which case it's B's fault), if it was incorrect (in which case it's A's fault), or if the specification is ambiguous (in which case it needs to be updated). Once the implementors have decided on the correct behavior, then one (or sometimes both!) of them change their implementations, and you rerun the test, hopefully getting a little further before things break. This process repeats with increasingly more difficult scenarios until everything works.

There are several important points to remember here:

The specification is frequently unclear; even when there is a best reading, it's often not entirely obvious.
Even when the specification is clear, implementors make mistakes, leading to interoperability problems.
Interop testing is a high-bandwidth process that requires close collaboration between implementors. In particular, it's vital to be able to understand what the other implementation didn't like about what your implementation did, rather than just knowing that it didn't work.

This last point is especially important: if you just send a message to the other side and get an error, then you're left scrutinizing your code over and over to see if you did something wrong, even when it turns out the problem is on the other side.

When I was working on WebRTC and trying to get interoperability between Firefox and Chrome, I spent quite a few days in Google conference rooms with Justin Uberti, the tech lead for Chrome's WebRTC implementation, doing just what I described above. It also helped that both Firefox and Chrome were open source, so we were able to look at each other's code and figure out what must be happening. Getting this to work would have been approximately impossible if all I had had was a copy of Chrome and no insight into what was happening internally, or if we hadn't been right next to each other. This problem is especially acute for cryptographic protocols, where any error tends to lead to some sort of opaque failure such as "couldn't decrypt" or "signature didn't validate". If you can't see the intermediate computational values (e.g., the keys or the inputs to the encryption), you're back to trying to guess what you did wrong (and good luck if it's the other side!).

More recently, the IETF has developed something of a system for this (thanks to Nick Sullivan from Cloudflare for kicking this process), starting with TLS 1.3 and now with QUIC. Basically, everyone gets in the same room (to the extent possible) and stands up their implementation and other people try to talk to it.^[2] There's a lot of back-and-forth of the form of "Hey, I'm getting error X when I talk to your implementation, can you take a look". The end result of this is an interoperability matrix showing which implementations can talk to each other and with which tests. For instance, here's the interop matrix for one of the later QUIC drafts:

Each cell is a single client/server pair, with the client down the left and the server across the right. The letters indicate which tests worked, with the color indicating how well things are going, the darker the better.

This is all an enormous amount of work, and it's important to remember that this is a best-case scenario in that the people writing the specification are trying very hard to make it as clear as possible and generally the implementors are trying to be helpful to each other. The situation with messaging is quite different: the gatekeepers could have provided interoperable interfaces at any time but chose not to. Instead, they're just being required to provide them by the DMA, so their incentive to make it work is comparatively low. Moreover, they may not even have their own internal documentation; in my experience it's quite common for engineering organizations to just embody their interfaces in code with minimal documentation which is insufficient to implement from.

The first 10% is often easy #

It's often relatively easy to get things to sort of work in simple configurations (what is sometimes called the "happy path"). For instance, the first real public demonstration of WebRTC interop between Chrome and Firefox was in early 2012, at a point where it just barely worked and needed handholding from Justin and myself. Firefox didn't work with Google Meet until 2018, which required changes on both sides. A particular issue was around multiple streams of audio and video (see here for background on the "plan wars of 2013").

In a similar vein, during the workshop Matthew Hodgson showed a demo of Matrix interoperating with WhatsApp via a local gateway,^[3] which serves as a good demonstration that interop is possible, but but as he mentioned himself, shouldn't lead anyone to conclude this is a trivial problem. Sending messaging back in forth is probably the easiest part of this problem, it's all of the details (group messaging, media, ...) that will be the hard part to get right and are also essential to it being ready for real users.

It gets better #

Note that the scenario I'm talking about here is mostly what happens with early protocol development and deployment. Once there are widespread open source implementations that are fairly conformant, you can just test against those implementations and debug them directly when you have a problem; of course, that's not likely to be the case for messaging interoperability, at least initially, and especially if the gatekeepers just publish their specs without any reference implementation.

The surface area is enormous #

The number of different protocols that need to be implemented in order to build a complete messaging system along with voice and video calling is extremely large. At minimum you need something like the following:

Messaging
- A protocol for end-to-end key establishment (e.g., MLS, OTR, Signal)
- The format for the messages themselves (e.g., MIME)
- A transport protocol for the messages (e.g., XMPP)
Voice and video. Everything above plus
- Media format negotiation (e.g., SDP)
- NAT traversal (e.g., ICE) if you want peer-to-peer media
- Media transport (e.g., RTP/SRTP)
- Voice and video codecs (e.g., Opus, AV1, H.264, etc.)

Every one of the things I've named above is a very significant piece of technology, often running to hundreds if not thousands of pages of specifications.^[4]

As a concrete example, let's look at WebRTC. At the time WebRTC was being designed, there was an existing ecosystem of standards-based voice and video over IP that used Session Initiation Protocol (SIP) for signaling and RTP/SRTP for media. Those protocols were in wide use but often not on an interoperable basis. Although many of the people who designed WebRTC had also been involved in building that ecosystem, there was also a feeling that many of those protocols were due for revision, so there was a fair amount of updating/modification. By the time the overarching protocol specification document JavaScript Session Establishment Protocol (JSEP) was published in 2021, the set of relevant documents had grown to 10s of RFCs running to thousands of pages. Moreover, these RFCs themselves depended on other previously published RFCs defining stuff like audio and video codecs (for instance, the specification for the mandatory to implement Opus audio codec is 326 pages long, though at least part of that is a reference implementation).

Of course, these are specifications for general purpose systems and so you could almost certainly build a single system that had less complexity. For instance, a lot of the complexity in WebRTC is around media negotiation: suppose that one side wants to send two streams of video and one stream of audio, but the other side only wants to receive one stream of video. An interoperable system needs to specify what happens in this case, but if you have a closed system you can just arrange that your software never gets itself into that state. There are quite a few other cases like this where you can get away with a lot less in a closed environment, but even then there will still be quite a bit of complexity.

At the same time, it's increasingly possible for small teams to quickly build quite functional voice and video calling systems. This apparent contradiction is explained by realizing that there are widely available software libraries (for instance, the somewhat confusingly named WebRTC library) that implement most of these specifications and provide an API that hides most of the details. The result is that as long as you're willing to take whatever that library implements, it's possible to build a functional system, but you're pulling in the transitive closure of all the specifications it depends on. The same thing is true for other protocols such as TLS, XMPP, Matrix, etc.

The key point to take home here is that actually having interoperability between the gatekeepers and competitive products requires nailing down an enormous number of details, even if those details are hidden behind software libraries. To the extent to which this uses the existing interfaces and protocols, then this is a somewhat more straightforward problem, but if a gatekeeper has built a largely proprietary system from the ground up, then the effort of specifying it in enough detail that someone else can build their own interoperable implementation—not to mention the effort of building that implementation—is likely to be very considerable.

It has to be implemented on the client #

If you want to have end-to-end encryption of the communications, then this means that much of the complexity has to be implemented on the client. Specifically:

You need to have end-to-end key establishment so the key establishment protocol needs to run on the client.
Text messages are encrypted and so they need to be constructed on the sender's client software and decrypted on the receiver's.
Similarly, audio and video need to be encoded and decoded on the client.

This is different from (for instance) e-mail, which is typically not end-to-end encrypted, or non-E2E videoconferencing systems, in that you can centrally transcode the media. For instance, if Alice can only send audio with the Opus codec and Bob can only receive with G.711, then the central system can transcode it, but if the data is end-to-end encrypted, that's not possible (the whole point of end-to-end encryption is that the central system can't modify the content). Instead, you need to ensure that every client has the necessary capabilities. It is possible to implement some of the system on the server. For instance, because the transport is generally not end-to-end encrypted (though it should be encrypted between client and server), you might be able to gateway transports between systems, such as if you want to connect an XMPP client to a system that isn't natively XMPP.

Gatekeeper Libraries #

Technically, gatekeepers don't need to actually publish their interfaces to achieve interoperability. Instead, they could just build a software library that implements their system. The idea here is that if I build EKRMessage and I want it to talk to WhatsApp, I just download their library and build it into EKRMessage. There will be some set of functions that I need to call (e.g., sendMessageTo() to send a message) to implement the interoperability.

The obvious advantage of this design is that it hides^[5] lot of complexity from the new implementor: the gatekeeper doesn't need to document the details of any of their protocols in the kind of excruciating detail I mentioned above; they just build that into their software. Of course now they have to document the library, but that's usually a lot easier (after all, this is why people use libraries). This kind of library can be a huge force multiplier: as I mentioned above, the existence of Google's WebRTC library has made it much easier for people to build powerful A/V apps. However, if every gatekeeper has their own library, this is a lot less attractive, as Alissa Cooper from Cisco pointed out in her workshop presentation. To just briefly sketch a few of the problems with this approach:

Code bloat. Each competitor application will need to build in a copy of every gatekeeper's library, which is extremely inefficient. As an example, the WebRTC.org library is over 800K lines of code. Imagine that times 5.
Code architecture. Each library is going to have its own particular API style, which is going to make architecting your app different. As a concrete example, consider what happens if one library is asynchronous and event-driven and another is synchronous and uses threads. Building an app that uses both cleanly is going to be architecturally difficult (typically you end up trying to force fit one control flow discipline into the other).
Portability. If the gatekeeper provides their library in binary (compiled) form, then it will only be usable on the specific platforms the gatekeeper builds it for. Even if they provide source code, that often will not work on one platform or another without work (portable code is hard!). Of course, someone could port the library, but if they aren't able to upstream them to the gatekeeper's source, then the porting work needs to be repeated whenever the gatekeeper makes changes. In addition to questions of platform, we also have to think about questions of language: if the library is written in Java and I want to write my application in Rust, I'm going to have a bad day.
Dependency. The competitor's application is dependent on the gatekeeper's engineering team, with little ability to fix defects—and all software has defects—and mostly has to wait for the gatekeeper to do it. This is true even if the library is nominally open source, because it's a huge amount of effort to find and fix problems in other people's code. Additionally, whenever the gatekeeper changes their library, you just need to update, even when it's a big change.
Security. The competitor is taking on the union of all the vulnerabilities in each library they bring in. This creates problems whenever a defect is found because every competitor needs to upgrade (this is always a problem with vulnerabilities in libraries, of course, which is one reason why people try to minimize rather than maximize their dependencies). Effectively, the security of the competitor's app becomes that of the weakest of the gatekeepers they interoperate with, which is obviously bad.

All of these problems of course exist any time you take a dependency on another project, which is why engineers are careful about doing it. However, in most of those cases the provider of the dependency wants you to use their code—you may even be paying them—and so is motivated to help. That's not really the case here. The bottom line is that I think this is a pretty bad approach, so I'm not going to spend much more time on it in this post.

Common versus Gatekeeper-Specific Interfaces/Protocols #

The big question here is whether gatekeepers will use (or be required to use) a common set of protocols/interfaces or whether they would be able to just dictate their own interfaces that anyone who wanted to interoperate with them would have to use. Often this was phrased as whether the gatekeepers would be required to implement "standards", but "standards" can mean anything from "this is what everyone does" to "this is ratified by the International Telecommunications Union (ITU)", so I'm not sure how helpful that term really is. The key question is whether everyone is going to do more or less the same thing or whether connecting to each gatekeeper will require doing something new. As I said above, I think this has some obvious technical advantages.

Implementor Complexity #

The biggest advantage of a common set of protocols/interfaces is that it makes life much easier for access seekers/competitors. If you need to build to entirely different interfaces for each gatekeeper, it's going to be a lot of work to add a new gatekeeper, which obviously doesn't promote interoperability on the broad scale, and is likely to lead to lower service quality because you have to spread your effort out across all the implementations. Alissa Cooper had a great slide showing what this looks like, where every competitor has a little—or actually not so little—copy of every gatekeeper's system in their app:

This model, where an app speaks a bunch of different protocols but tries to present a unified user interface (what Matthew Hodgson was calling a "polyglot app") used to be reasonably common back in the early days of instance messaging, where there were multiple open (XMPP, IRC) or semi-open (AIM), messaging systems. This means we know it's possible but we also know it's a lot of work. By contrast, with a single set of protocols/interfaces each app only has to have a single implementation that it can use everywhere and can focus on making it really good.

Clearer Specifications #

As described above, one big barrier to interoperability is lack of clear specifications. It's just incredibly hard to write an unambiguous document, and it's even harder when you're documenting some pre-existing piece of softwarethat never really had a written specification—as is most likely the case for many of the systems in this space—it's just way too easy to implicitly import assumptions about how your system actually behaves without clearly documenting them.

Standards development organizations like IETF and W3C have gotten a lot better at this over the years and have developed a set of practices that contribute to specification clarity. These include:

Early implementation and interop testing.
Automated test harnesses, both for interoperability and for conformance (e.g., WPT).
Widespread review using code collaboration tools (e.g., Github) that make it easy for people to report small issues.
Formal review and analysis for the most security critical pieces (for instance, TLS 1.3 had at least 9 separate papers published on its security before it was published).

These practices aren't a panacea, of course, and many IETF and W3C specs are still impenetrable, but in my experience the ones where the community really agreed they were important (TLS, QUIC, etc.) are reasonably clear. The main idea here is about getting as many eyes on the problem and from as many different perspectives as possible. This is only possible in an environment of collaboration across many organizations and it's hard to see how it will work when gatekeepers just publish specifications and throw them over the wall to competitors.

Developer Experience #

As discussed above, much of the experience of developing a protocol implementation is trying to interpret the specifications and figure out why your implementation is misbehaving. This is obviously much easier if there are open source implementations and a community of other implementors you can work with. If instead we're going to have gatekeeper published interfaces, then the gatekeepers will need to provide quite a bit of support to developers who want to talk to their systems. At minimum, this looks something like:

Detailed specifications: that really are complete enough to implement, ideally including example protocol traces and "test vectors" for the cryptographic pieces.
Public test servers: that developers can talk to. These need to be separate from production servers because testing in production is too dangerous. Moreover, they need to have a much higher level of visibility (at minimum very detailed logs) so that developers can see what is going wrong.
Live support: from engineers who understand how things are implemented and can help with debugging when log inspection fails.
Stable interfaces: that remain live once published, so that developers aren't constantly having to update their code.

Without this level of support it's going to be extremely difficult for competitors to make their code work in any reasonable period of time, let along update them as the gatekeeper makes changes.

Deployment Issues with Multiple Gatekeepers #

Most of the scenarios people seem to be considering involve one or more competitors interoperating with a single gatekeeper, e.g., Wire and Matrix talking to WhatsApp. This is a good first step and it's of course not straightforward, but it's really playing on easy mode, because the competitors have a real incentive to do whatever it takes to interoperate with any gatekeeper. What happens when someone on gatekeeper A wants to talk to gatekeeper B? If everyone just publishes their own protocols, then one of the gatekeepers has to implement the other side's version. It's not clear to me that this is required by the DMA. And if it is required, who will have to do it?

This issue is particularly acute in group message contexts. As a number of panelists mentioned, group messaging is now the norm and 1-1 is just a special case of a small group. Once you have large groups, you have the possibility of a group which involves more than one gatekeeper. Consider the case shown in the diagram below:

In this example, Alice and Charlie are on iMessage and WhatsApp respectively. Bob is on EKRMessage and is able to individually communicate with them because that client implements those interfaces. As noted above, this is inconvenient, but will work.

Now what happens when Bob wants to create a chat with Alice and Charlie? He can send and receive messages to each of them individually, but if neither WhatsApp nor iMessage implements compatible interfaces, then when Alice or Charlie sends a message, the other side can't receive it. Importantly, unlike the simple 1-1 case between gatekeepers, this looks to Bob like a defect in his messaging system, not like noncooperation by the gatekeepers. There's not much that Bob's client can do about it: it could presumably decrypt the messages and reencrypt them, but this destroys end-to-end identity, which is undesirable.

The point here is that in order to have interoperable messaging work well in group contexts, basically everyone has to implement the protocols of anyone who might be in the group.^[6] This will sort of work if there are a pile of protocols, but obviously it would be a lot easier and cheaper if instead everyone implemented something common.

Having a common protocol is even more important in videoconferencing situations: video tends to take up a lot of bandwidth and sending an individual copy of the media to each receiver can easily overrun consumer Internet links. Instead, large conferences typically use what's called a "star" configuration in which each endpoint sends one copy of the video to a central server (a media conferencing unit (MCU)) which then retransmits it to each receiver. But if a group with N gatekeepers means that I need to send in N different formats, then this will be dramatically less efficient. However, this is even true to some extent for messaging: the new IETF Messaging Layer Security (MLS) protocol was designed to work well in large groups, but won't work as well if you have to do pairwise associations.

Identity #

Identity presents a special problem for reasons I've discussed previously. Those posts have more detail, but briefly each endpoint needs to be able to discover and verify the identity of every other endpoint. As with everything else, this can be done either with a common protocol or with pairwise implementations of gatekeeper protocols. However, the situation is more complicated here because many messaging systems use overlapping namespaces.

In particular, it's quite common to use phone numbers (E.164 numbers) as identifiers, as (for instance) both iMessage and WhatsApp do. This raises a number of questions:

When someone from iMessage sends a message to someone from WhatsApp, how does their identity appear?
How do messaging apps know which service to use when given a bare E.164 number?
What happens if someone has an account on two phone-number using services.

There are several possible approaches to addressing these issues (as discussed in the above-linked posts) but we're going to need to have some kind of answer, and if each system is left to solve it itself, there is likely to be a lot of confusion.

I did want to call out one particular risk: it's natural—at least to some—to want this to be as seamless as possible, for instance by using phone number identifiers and automatically identifying the right service to use, but this increases the attack surface area so that multiple providers can assert a given identity. There are potential ways to mitigate this (see previously), but they would actually need to be specified and deployed. This is also an area where it would be advantageous to have a single solution everyone agreed on, both because it's hard to get this right, and because it would make it easier to address questions of who owned which identity.

Timeline #

One of the big concerns that I've seen raised about having a system based on common protocols is that the DMA sets a very ambitious timeline and that standards can take a long time to develop. There certainly is some truth in this, but the good news is that many of the pieces we need already exist (indeed, we often have several alternatives):

Function	Protocols
End-to-end key establishment	MLS, OTR, Signal (and variants)
Identity	X.509, Verifiable Credentials, OIDC
Messaging format	MIME, Matrix
Message transport	XMPP, Matrix
Media format negotiation	SDP
NAT Traversal	ICE
Media Transport	SRTP
Voice encoding	G.711, Opus
Video encoding	VP8, AV1

Some of these pieces are in better shape than others—I'd really prefer not to use SDP if I can avoid it!—and they don't all fit together cleanly, so it's not just a simple matter of mixing and matching, but it's also not like we're starting from scratch either. Moreover, the pieces that are earliest in the timeline are also the ones that are the best understood.

My sense is that the best way to proceed is to have what might be called a hybrid approach: use standardized components where they exist and temporarily fill in the gaps with proprietary interfaces specified by the gatekeepers while working to develop standardized versions of those functions. Once those versions exist, then we can gradually replace the proprietary pieces. The highest priority here should be getting to common formats for the key establishment and everything inside the encryption envelope (messages, voice, video), because those are the pieces where incompatibility causes the biggest deployment problems, as discussed above; fortunately, these are also some of the most baked pieces and—at least in the case of voice and video—where I expect there is a lot of commonality just because there are only a few good codecs.

I do think it's true that it's probably easier to get to some level of interoperability—especially at the demo level—by just having gatekeepers publish interfaces, but it's a long way from something to real reliable interoperability (we learned this the hard way with WebRTC), and there's going to be a long period of refining those interfaces and the corresponding documentation. That's time that could be spent building out common protocols instead, with a much better final result.

Final Thoughts #

Having multiple non-interoperable siloes is clearly far from ideal and it's exciting to see efforts like the DMA to do something about that. We know it's possible to build interoperable messaging systems and we've got multiple worked examples going back as far as the public switched telephone network and e-mail. Even WebRTC is partially interoperable in the sense that multiple browsers can communicate on the same service but not on different services. To a great extent our current situation is due to a particular set of incentives for gatekeepers not to interoperate; the way to get out of that hole is to give them the incentives to build something truly interoperable.

Some current bridging systems actually rely on the user having a copy of the gatekeeper's app on the local system and remote control that app. This doesn't seem like a very good solution for reasons which should be obvious. ↩︎
For client/server protocols, people will often stand up a cloud server endpoint. For extra credit, you can have an endpoint which will publish connection logs so that the other side can see your internal view on what happened. ↩︎
I don't think that local gatewaying is a good technical design because it requires terminating the encryption from the gateway in the local server and then re-encrypting it to the user's client, which destroys a lot of information, such as end-to-end identity. This can be a useful prototyping technique, but I don't think it's a great way to build a production system. ↩︎
Though the IETF's practice of using 72-column monospaced ASCII does make things longer. ↩︎
In the "information hiding" sense of avoiding the consumer having to think about it, rather than keeping it secret, though of course they might also want to keep the details secret. ↩︎
Technically you can get away with less than a full mesh, by having some kind of tiebreaker for each pair, but it's going to be fairly close to a full mesh. ↩︎

Network-based Web blocking techniques (and evading them)

2023-02-09T00:00:00Z

Via Joseph Lorenzo Hall, Patrick Breyer, and EDRI, I see that the EU's Internet Filtering requirements (sometimes called "chat control") are continuing to move forward. The legal language is a bit hard to wade through, but it appears to require Internet Service Provider (ISPs) to block specific content on Web sites, identified by URL.

Article 16 lays out the scope of blocking order:

The competent authority shall also have the power to issue a blocking order requiring a provider of internet access services under the jurisdiction of that Member State to take reasonable measures to prevent users from accessing known child sexual abuse material indicated by all uniform resource locators on the list of uniform resource locators included in the database of indicators, in accordance with Article 44(2), point (b) and provided by the EU Centre

And then Article 18 lays out requirements for user notification and redress:

Where a provider prevents users from accessing the uniform resource locators pursuant to a blocking order issued in accordance with Article 17, it shall take reasonable measures to inform the users of the following:

(a) the fact that it does so pursuant to a blocking order;

(b) the reasons for doing so, providing, upon request, a copy of the blocking order;

(c) the users’ right of judicial redress referred to in paragraph 1, their rights to submit complaints to the provider through the mechanism referred to in paragraph 3 and to the Coordinating Authority in accordance with Article 34, as well as their right to submit the requests referred to in paragraph 5

Unfortunately, as EDRI observes, this kind of filtering is not really technically practical in today's Web. In this post I talk about the technologies which are used for Web filtering, as well as some of the privacy and security technologies which make that sort of blocking harder. This post is intended to be self-contained, but you might find previous posts on tracking and browser privacy features (tracking and blocking are closely related) and IP concealment useful background.

Get EG in your mailbox #

If you like what you're reading here, you can, as they say "smash that subscribe button" to get the newsletter version delivered right to your mailbox.

Threat Model #

In many security situations there's pretty broad consensus on who is the attacker (e.g., the person trying to steal your credit card number), and who is the defender (the person who doesn't want their credit card stolen), and traditionally in the design of security protocols we think of the network as the attacker and the job of the protocol to be to defend you against the network. However, in this situation, the entities trying to block certain content usually think of themselves as the defenders, either because they are trying to block content which is illegal (such as Child Sexual Abuse Material (CSAM)) or because they want to control the use of their own network (e.g., to protect it against malware-infected machines or to stop their employees from exfiltrating company secrets in what's called Data Leak Prevention (DLP)), and the endpoint trying to evade filtering as the attacker.

Debates in this area tend to quickly devolve into questions about the legitimacy of various kinds of blocking and how sympathetic participants are to them. In my experience such debates don't usually get very far and I don't propose to engage with them here;^[1] the purpose of this post is just to lay out the technical situation of that is and is not possible given the current and anticipated future state of the Web.

Note that it's not always the case that the interests of the user and the interests of the blocker are opposed. For instance, consider the case where the network wants to block access to sites which host frauds or malware: the user presumably doesn't want to download malware, and so would potentially benefit from the network preventing access. intended to protect the user from fraud and malware. However, these technologies are value neutral: the same mechanisms that might allow the network to block access to CSAM or malware also allow it to block access to Facebook or to Google search; the same goes for technologies for evading blocking.

Endpoint Status #

The most common and familiar situation is when the endpoint isn't really trying to evade blocking but also isn't actively cooperating with it, as is the case with most consumer devices. The software on the device usually implements some set of default protections (e.g., HTTPS), as discussed below, but they're ones that are suitable for full-time use, rather than fancy ones that would be expensive, inconvenient, or slow. They might also contain some filtering mechanisms, though usually ones that the vendor has judged users will want (e.g., Safe Browsing) and in many cases these can be disabled:

Another quite common case is one in which the device is managed, for instance, one used by employees of a company but which actually belongs to the company and where the company controls the software on the device (e.g., via Mobile Device Management (MDM). For obvious reasons, it's much easier for the network to control the behavior of managed devices. Most consumer devices are of course unmanaged; this didn't always used to be true for mobile devices, where it was common for carriers to install various kinds of software before selling them, but Apple's direct sales, their insistence on a standard software load, and the subsequent changes in industry practice mean that in many of not most cases smartphones are not meaningfully under control of the carriers. Many work devices are managed, but not all; of particular concern to many enterprises is what's called bring your own device (BYOD), in which people use their own devices for work purposes; unsurprisingly, employees are often unwilling to allow their employers to control the software on these devices and so in many cases they will be unmanaged.

On the other side of the spectrum, we have endpoints which are deliberately trying to avoid monitoring. This could be something the user wants, for instance because they are in a jurisdiction that restricts Internet access and are using something like a VPN or Tor. It could also be because there is malware on their machines. In many cases, that malware will want to talk to its command-and-control (CNC) servers. However, this software only needs to be able to talk some prearranged set of servers and thus doesn't need to speak standard protocols—though it might impersonate them!—and might share secret information with those servers. This makes evasion easier.

Blocking Techniques #

The difficult part of blocking traffic isn't really the blocking itself but rather knowing what traffic to block. It's fairly straightforward to just disconnect the Internet, but that makes the network useless. What you want is selective blocking in which you block only the traffic of interest and allow the rest of the traffic to pass through (conversely, many anti-blocking techniques are designed to degrade the visibility necessary for selective blocking, thus forcing the network into a position of blocking all traffic or none of it). There are a number of ways to get the information of what content the endpoint is trying to access.

DNS-Based Blocking #

One very common place to do blocking is at the DNS layer (see my series on DNS for background here). DNS-based blocking is very technically straightforward because the client directly asks the DNS server for the contact information (IP address) of the Web server it's trying to contact, so it's easy to add a filtering step. Moreover, there are a number of DNS providers (e.g., Umbrella/OpenDNS or Cloudflare) which offer filtered DNS servers. Umbrella will even let you configure which sites you want blocked. The DNS server has a number of options if a blocked domain is requested, including returning an error to the client or returning a bogus IP address which can then be blocked; in either case, the client will not be able to contact the ultimate server.

Network-imposed DNS-based filtering works because the network typically provides the DNS server used by endpoints (notifying them about it via DHCP). However, it's also possible for users to configure their devices to use a different server or for endpoint software to do its own resolution via a non-network resolvers. For instance, it's quite common for people to configure their devices to use Google Public DNS (8.8.8.8) or Cloudflare DNS (1.1.1.1),^[2] and Firefox is increasingly using DNS over HTTPS in a mode which bypasses the local resolver in favor of a "trusted recursive resolver" that has agreed to comply with Mozilla's policy requirements around user security and privacy. Obviously, malware can do the same.^[3]

Historically, if the user just pointed their device at a public resolver, the network could still do DNS filtering by intercepting the communication to the resolver. However, if DNS traffic is encrypted to the server, that prevents this kind of filtering. Ultimately, if networks want to enforce DNS-based filtering in these circumstances, they need to prevent connections to the public DNS resolvers, which, given that they run DNS over HTTPS, brings us back to the same problem of blocking Web traffic, at least for unmanaged endpoints; for managed endpoints, it's generally possible to just disable encrypted DNS; in fact Firefox does this automatically if it thinks the endpoint is managed.

Even where DNS-based blocking is effective, it's a fairly limited mechanism. Specifically:

It can only block on domain name and not URI.^[4] For instance, if you want to block https://example.com/contraband and not https://example.com/totally-cool, that's not possible because the browser just asks for the address of example.com.
It usually can't provide any notification to the user of what happened; the server can just make it look like the name doesn't exist or the server isn't offline. It's of course possible to provide the address of a server controlled by the network, but if the client is trying to connect via HTTPS, then this will result in a connection failure (more on this later), not a comprehensible message to the user.

On this second point, I've seen proposals for allowing the server to send back a more detailed error message telling the endpoint that a site was blocked, for instance.

  {
  "c": ["tel:+358-555-1234567", "sips:bob@bobphone.example.com",
        "https://ticket.example.com?d=example.org&t=1650560748"],
    "j": "malware present for 23 days",
    "s": 1,
    "o": "example.net Filtering Service"
  }

This is in theory possible, but there are several obstacles that prevent unilateral deployment by ISPs or enterprise networks. First, no existing Web client supports this new message, so at present they will just show a failure as described above. Second, if the browser uses the operating system resolver (as, for instance, Firefox does when it's not using DoH; Chromium uses its own resolver), then it will only be able to get this message once the operating system is updated to support it, which is likely to take a very long time. Finally, the browser would need to figure out some way to present the information so that it's clear what's happening and that it can't be used to fool the user into accepting the error message as coming from the valid site ("please enter your social security number here!"); this problem is presumably soluble if there is enough interest otherwise.

IP Filtering #

If you're not doing DNS-based filtering, your next opportunity to filter is at the IP layer. IP-layer filtering is exactly what you think it is: the network blocks connections to certain IP addresses. There are a number of possible alternatives here (drop the packets, send a TCP RST, BGP poisoning) but they all amount to the same basic idea, which is to render certain IP addresses inaccessible. Unlike DNS-based filtering, it's not straightforward for clients to just opt out of IP-based filtering: the network has to be able to see the server's IP address to deliver the packets, so if you want to bypass it, you need to get a new network.

Ignoring the Great Firewall #

In general, once the network has identified your connection for blocking, that's it but in at least one case, this was easy to avoid. China famously uses a blocking system often called the Great Firewall, which operated in part by sending TCP RSTs when it detected things it didn't like. This is cheaper technically than blocking all the packets. Some time back, Clayton, Murdoch, and Watson discovered that clients could just ignore the TCP RSTs, in which case the traffic would continue to flow. I don't know if this is still true.

On the other hand, IP-based filtering is even less precise than DNS-based filtering. Obviously, it can't see the specific resource you are connecting to, but it can't even always tell which Website is being accessed: it's very common for multiple Web sites to share the same IP address (for instance, every Github Pages site seems to have the same IP, as does every Substack that has an address ending in substack.com), and so you can't IP block one site without blocking others. Even in situations where there isn't IP sharing, but where many sites share the same hosting provider, the hosting provider can readily change which IP addresses correspond to which sites, making it hard for the blocker to keep up. Thus, IP blocking is good for blocking access to big sites which don't share infrastructure, such as Google or Facebook, but not so good for smaller sites.

Like DNS-based filtering, IP-based filtering isn't able to provide any feedback to the user about what went wrong: it just looks like a network failure. Unlike DNS-based filtering, I haven't even seen credible proposals for how to add such a function and most of the obvious avenues seem fairly unattractive to browser makers.

Content Analysis #

The next major approach is to inspect the application layer traffic (e.g., HTTP or TLS), and filter based on that. This is a very powerful technique when applied to HTTP because it allows you to see all of the data being exchanged, including the URL being requested and all of the content being returned, so you can do some fairly fancy filtering. For instance, you could not only check the URI but scan the returned content for malware or CSAM.

However, this sort of filtering is increasingly impractical because the vast majority of Web traffic is now encrypted, as shown in the figure below:

[Source: Let's Encrypt]

When the traffic is encrypted, the network can't see the content of the HTTP connection, which means it can't see either the URL or the response—this is the point of encryption!—so the amount of filtering possible is quite limited.

The main piece of information that the network can see is the hostname of the Web server. This is carried in two places in the TLS handshake:

In the Server Name Indication (SNI) field of the client's first message (the ClientHello). (This is the field that allows you to have multiple servers on the same IP).
In the server's Certificate message, although this may not be unique, as a server may have a certificate that covers multiple sites. For instance, there is a single "wildcard" certificate for *.github.io that works for any site ending in .github.io.

In TLS 1.3, the server's Certificate message is encrypted, which means that the only information about the server's identity available to the network is in the SNI in the ClientHello. You shouldn't be surprised to hear that there is now work underway to encrypt the ClientHello message to conceal the SNI, using a technology called (surprise!) Encrypted Client Hello (ECH). ECH hasn't been widely deployed yet, but it's under active development by browser vendors and some server operators, such as Cloudflare. If ECH is in use, then the network will not be able to use TLS to distinguish between any of the servers on the same IP address, reducing the filtering granularity to that of IP blocking.

Browsers and MITM Proxies #

MITM proxies are a difficult problem for browsers: generally you want to allow users to add their own trust anchors to permit so-called "Enterprise CAs" in which the enterprise has its own private names that it issues certificates for but it doesn't want to have publically accessible. This is still somewhat common, though arguably less necessary in the era of free certificates. However, it would be possible for browsers to detect and prevent the use of these enterprise CAs for any site which also had a public certificate, thus more or less preventing MITM proxies from working. However, the consequence of this would be to break the browser on any network which had such a proxy, which is obviously not a desirable outcome. The result is that we're in a not-great equilibrium that is hard to get out of without causing a lot of breakage.

MITM/Intercepting Proxies #

Many enterprise networks use what's called a "man-in-the-middle" or "intercepting" proxy. This is a network device which sits in between the client and the server, impersonating the server to the client and the client to the server. It decrypts the traffic between client and server, inspects it, and then re-encrypts it. "But wait" I can hear you say. "Isn't the whole point of TLS to prevent this kind of attack?!" Ordinarily yes, but the organizations who deploy these proxies also install their own trust anchors on the client, which allow the proxy to issue certificates which are acceptable to the client.

Obviously, this doesn't work in consumer settings where the network doesn't control the client. Of course, one could imagine a nation requiring users to adopt a new trust anchor, enabling them to intercept any connection, but this obviously has extraordinary risks in terms of surveillance. In the one case where a country went as far as trying it (Kazakhstan), browsers responded by explicitly blocking the trust anchor, so you couldn't install it.

Even in an enterprise, MITM proxies aren't really a great system: they're expensive to operate and because they have access to the plaintext of the connection, present a security and privacy risk to users of the system. There is also evidence that the implementation quality of these proxies is less good than that of browsers, which creates additional risks.

In order to address some of these issues, enterprises will sometimes (often?) configure their proxy to selectively decrypt traffic. The idea here is that the proxy looks at the SNI field and only decrypts traffic so some destinations (e.g., Facebook but not your bank). These enterprises (and the vendors who sell these devices) are worried about ECH because it has the potential to make this sort of selective decryption impossible. I don't believe that this is likely to be a problem in practice, however: if you are able to install your own trust anchor, you should also be able to configure the browser to disable ECH. Moreover, ECH information is delivered over DNS, so as long as you can control DNS (or, in the case of Firefox, disable DoH, which happens automatically when it detects a new trust anchor) you can just suppress the use of ECH.

I did want to flag one point here about this kind of selective decryption, which is that it only works if you are dealing with an endpoint which is standards compliant and sends the correct SNI value. If you are dealing with malware, it can put whatever it wants (e.g., www.bankofamerica.com in the SNI) but then connect to its own CNC server. Selective decryption based on SNI only works with clients which aren't themselves malicious, like Web browsers. This is true whether or not the client is using ECH. Note that this form of evasion only works because of prearrangement between the malware and the CNC server, so it's not deployable as a general mechanism for Web browsers.

Traffic Analysis #

In principle it's possible to learn about the content of on encrypted traffic by looking at packet size, timing, etc. For instance, the traffic pattern associated with watching video (a lot of big packets sent continuously to the client) looks very different from that associated with using Webmail (small, relatively intermittent, chunks back and forth). This approach is often called "traffic analysis". Goldberg, Wang, and Wood provide a good overview of the situation for website identification, and Cisco actually sells technology for doing this (academic paper by Anderson and McGrew here) that tries to identify malware.

My understanding of the current state of traffic analysis is somewhat powerful as an attack on privacy: you certainly can learn more about people's browsing behavior than people might want you to learn, and is useful as part of an enterprise threat response system that attempts to detect malicious behavior, but is less useful at distinguishing precise behavior (e.g., which exact images did someone view on a specific site). It's also comparatively expensive to operate technically, doesn't scale that well, and requires seeing more behavior over a longer period than other techniques (e.g., SNI), which can make a decision very early in the connection. Thus, my sense is that it's less useful for making large scale content-based decisions for things like CSAM detection or DLP.

Client-Side Agents #

It's also possible to install a piece of software (an "agent") on the endpoint itself that monitors that behavior of the device. These agents sit into a variety of different and somewhat overlapping categories (anti-virus, DLP, endpoint detection and response (EDR), etc.) but basically they all do the same kind of thing, which is to say spy on other programs and report back or otherwise act on behavior it thinks is suspicious. These agents typically have elevated privileges and so can in some cases observe the internal details of other programs, for instance, by actually injecting their own code (this is a persistent problem for browser vendors because it can negatively impact the stability of the product). For instance, this would allow the agent to see the plaintext associated with encrypted traffic, including the URI, the content, etc.

In some cases, this software is something that users install themselves (e.g., antivirus), but in others it's something that is required by their employers, schools, etc. In the latter case, it may be deployed in parallel with network monitoring techniques to provide multiple views of the same activity. For instance, you might have a client-side agent but also do MITM interception This approach provides defense in depth: if you have such an agent on your work computer, the natural way to avoid monitoring is to use an unmonitored personal device. Network-level monitoring can help detect this, even if it can't see precisely what's happening, though it's obviously far less powerful in an age of ubiquitous fast mobile Internet: people can just turn off the WiFi and bypass your monitoring.

In general, if you have a third party monitoring agent installed on your computer, it's safest to assume it can do anything at all on that device (Microsoft's "Immutable Law of Security #1"). In particular, if you have an agent on your computer that is operated by someone else, the safest assumption is that they have complete control of your computer. In some cases, these organizations will have policies about (for instance), what data they look at, but that doesn't mean that there is any technical enforcement mechanism that prevents them from violating those policies. The few times I've actually looked at this I came to the conclusion that there weren't any meaningful technical controls; it's possible someone has built something safer in this space, but given the current state of computer security it's a very difficult problem.^[5]

Client-side agents are a popular technique in enterprise settings, but actually requiring their installation on everyone's non-work devices seems like it would be a major policy change, and I don't think it's likely that the EU would require it (at least I hope not!).

VPNs, Proxies, etc. #

As should be clear from the above, in the absence of cooperation from the endpoint, the network only has fairly limited abilities to selectively block traffic. More or less all it can do is to block specific sites, but not control what content people access on those sites. As technologies like encrypted DNS and ECH become more common, even that level of blocking will start to become more difficult. It will still be possible to block large sites which have their own IP space (e.g., Facebook or Google), but it will be harder to block just one site hosted by a given service, such as one Github pages account or a single site hosted by a CDN.

Encrypted DNS and ECH are designed to be "always on" technologies which people can just use for their regular browsing; this means that the protection they can offer is limited. However, it is also possible to provide a higher level of protection at greater cost by proxying traffic to another network which is not subject to blocking/filtering. This is what technologies like VPNs, Tor, and iCloud Private Relay do (see here for an overview of these techniques). The only really feasible way to prevent people from bypassing blocking using these mechanisms is to block access to the proxy/relay/VPN service entirely, which you would typically do by the same kind of mechanisms I've been discussing above. I've also seen some research designs for making that kind of blocking more difficult (e.g., Telex), but that's out of scope for this post.

What is technically feasible #

With this technical background, we can now look at the EU proposal. Assuming I am reading it correctly (and EDRI reads it the same way), it seems to have two requirements that are technically problematic.

First, as noted above, it's not really possible to block based on a list of specific Uniform Resource Locators, but only on sites. It's not clear to me how useful this really is: if there are specific sites which are just acting as hosts for CSAM, then there are a number of potential avenues for having them shut down directly, rather than filtering at the customer level (this happens fairly often with sites which engage in various kinds of copyright and trademark abuse). The primary reason why URL blocking is useful is that it allows you to selectively block part of a site—though here too it's not quite clear to me why the authorities can't have that content taken down once they are aware of it—but as noted above, that kind of selective blocking is simply not practical to do at the network level once traffic is encrypted.

For similar reasons, it's also not really possible to provide notice to users as required in Article 18 because there's no channel for the provider to do so. In most cases the client will be trying to establish an encrypted channel to the server. The network can instead reroute that connection to its own servers, but those servers cannot properly authenticate as the server, so all they can manage to do is cause the browser to show the user an error, but can't control the error. Depending on exactly what the provider does, it might look like this:

or like this:

But what it definitely will not have is some message from the provider about why the site is being blocked; there's simply no mechanism to communicate that. It's presumably possible to invent something here, but it's not something that the providers can do unilaterally.

Final Thoughts #

This brings us to the broader point, which is that the network providers are simply the wrong place to situate this kind of blocking. A basic assumption of communications security is that the network is under control of the attacker and 30+ years of work has gone into protecting Internet traffic from potentially hostile networks. This work isn't done, but there's been a huge amount of progress and at this point it's really not practical to do effective fine-grained blocking of traffic without the cooperation or coercion of one of the endpoints.

This is also why I am using the term "blocking" instead of the common term "censorship", which while technically accurate in my opinion, tends to just get us into debates about the definition of "censorship" from those who think that certain forms of blocking are good and that the term "censorship" has negative connotations. ↩︎
However, data from Huston and Damas indicates that most of the use of the big public resolvers is due to ISPs pointing their users to them, rather than users configuring it themselves. ↩︎
The public debate about the use of DoH and DoT has sort of conflated use by browsers with use by malware. The problem with malware use of encrypted DNS exists because there are public DNS servers which offer encrypted service, independently of whether browsers use it. To the extent to which browsers make the problem worse it's because their use of those servers makes it less attractive to just block them entirely. ↩︎
I once explored a design where the DNS server would send the client a list of blocklisted URIs on the requested domain, but this of course requires the client to cooperate, so it's more like Safe Browsing than like a unilateral blocking mechanism. ↩︎
The basic problem here is that if you don't trust the system you are monitoring to behave correctly, then you need access to its internals to be sure that it's not lying to you about its behavior. But that access is inherently abusable. ↩︎

Internet Transport Protocols, Part I: Reliable Transports

2023-01-18T00:00:00Z

Most people who use the Internet just have some vague idea that it carries data from point A to point B (famously, through a series of tubes). Even people who regularly work on Internet systems tend to work with it through many layers of abstraction, without a clear understanding of the infrastructure components that make it work. This post is the first of a series about one such piece of infrastructure: the transport protocols such as TCP that are used to transmit between nodes on the Internet.

Background: Network Programming #

If you've done any programming of networked systems, you've probably written code that looks something like this:

socket = connect("example.com", 8080);
write(socket, "Hello");       
response = read(socket);
print(response);

Even if you don't have much experience with networking, this code should be fairly self explanatory:

The first line forms a "connection" to the server named "example.com" (see my series on DNS) for how these names work. 8080 is what's called "port number" and we can ignore it for now. This function returns an object called a "socket" which represents that connection.^[1] Conceptually, this is like dialing the phone and calling "example.com".
The next line writes the string "Hello" to the server. Note that because we already are connected to the server, we can just pass in the socket, rather than the address of the server.
The next two lines reads the response from the socket and then print it out. As before, we don't need to specify the server's address because that's encapsulated in the socket.

As another example, here's a simple server that works with this client. This server just takes whatever the client writes to it and sends it back in upper case. The main difference here is that instead of using connect(), the server uses accept() which tells the computer to wait for a client to connect to it on port 8080.

socket = accept(8080);
loop {
  result = read(socket);
  print(result);
  write(socket, result.toUpperCase);
}

If we run this client/server pair, we would expect the server to print:

Hello

And the client to print:

HELLO

Of course, the client could write multiple messages, like so:

socket = connect("10.0.0.1", 8080);
write(socket, "At midnight all the agents"); 
write(socket, "And the superhuman crew");
write(socket, "Come out and round up everyone");
write(socket, "Who knows more than they do");
...

In this case, we would expect all of the messages to be delivered and that they will be delivered in order, so that the server prints:

At midnight all the agents
And the superhuman crew
Come out and round up everyone
That knows more than they do

Rather than

And the superhuman crew
That knows more than they do

And the superhuman crew
Come out and round up everyone
At midnight all the agents
That knows more than they do

Again, just like a phone call.

What we're seeing here is just the programming interface, though, which is to say it's a set of abstractions that the operating system and the programming language provide to you to write your programs. They don't tell us anything about what's actually happening on the network. That's the subject of this post.

Background: A Packet Switching Network #

The Internet is what is known as a packet switching network. What this means is that the basic unit of the Internet is a self-contained object called an Internet Protocol (IP) packet or datagram. An IP packet is like a letter in that it has a source address and a destination address. This means that when you send an IP packet on the network, the Internet can automatically route the packet to the destination address by looking at the packet with no other state about either computer. A simplified IP packet looks like this:

The main thing in the packet is the actual data to be delivered from the source to the destination, also called the payload. The payload is variable length with a maximum typically around 1500 bytes. Using IP is very simple: your computer transmits an IP packet and the Internet uses the destination address to figure out where to route it. When someone wants to transmit to you, they do the same thing. Importantly, for reasons we'll see shortly, packet switching is unreliable: when you send a packet to the other end it might or might not get there ("packet loss"). Moreover, packets don't always arrive in the order they were sent ("reordering").

Circuit Switching #

The alternative to packet switching is what's called "circuit switching". In a circuit switched network, the basic unit of operation is a connection between two endpoints called "circuit". In a circuit switched system, you set up the circuit and then just start sending and everything goes to the entity on the other end of the circuit, like in a telephone call (more on phones later).

In the original telephone network, this was actually a literal electrical circuit:^[2] phone service came into your house on a pair of copper wires and when Alice wanted to call Bob, the central office would connect Alice's wires to Bob's wires (there's some more electronics here, but you can ignore this). Originally this was done by having an actual person at a switchboard, which is just a board with a bunch of jacks corresponding to each outgoing circuit from the central office. When Alice wanted to call Bob, she would ring up the operator and tell them who she wanted to call. The operator would plug a patch cable from Alice's jack into Bob's jack, like so:

When the first automatic switches were invented (an amazing story), they worked much the same way: you'd pick up your phone and dial and the equipment at the exchange would connect your wires to the wires of the person you were trying to call. From then on, signals just went from your microphone to their speaker and then their ears and vice versa. Circuit switching is conceptually convenient but has a number of inconvenient properties. In the simplest version, it doesn't allow you to talk to more than one person at once (the second caller gets a busy signal!) and even if you arrange to connect more than one person as in a conference call, you have no way of distinguishing who is who (ever had to ask who was talking?). But of course in a modern computer network your computer is constantly talking to multiple computers at once ("multiplexing"). This works badly with circuit switching but just fine with packet switching because each packet is self-identifying.

The problem with packets #

Packet switching has a number of nice properties, but small self-contained packets are very limiting for the obvious reason that most things that people want to send are more than 1500 bytes, whether they be videos, phone calls, or large files; even Web pages are almost always more than 1500 bytes. Moreover, because packet switching is unreliable and packets might get lost or delivered in the opposite order from the order they were transmitted in, if you just break up your file into a set of packets and send them over the network, the other side may not receive exactly what you sent.

In other words, what we really want is circuit switching, but what we have is packet switching. If you know any computer people, you've probably guessed what I'm going to say next because it's the standard thing to do: we're going to emulate circuits on top of packet switching to build what's often called a reliable transport protocol. What a reliable transport protocol does is provide a service that looks like a circuit (usually called a "connection") but built on top of the unreliable substrate of packet switching. Designing these protocols so they work well turns out to be very substantial undertaking and we've been basically evolving them for the past 40+ years, starting with TCP, which has been used from the early days of the Internet, is one such protocol and more recently with a newer protocol called QUIC, which is built on similar but more modern lines.

The world's simplest reliable transport #

The obvious thing to do here would just be to break up whatever data you want to send into a series of packets and send them to the other side. However, as should be clear from the above, this won't work reliably, because the packets might be lost or reordered, preventing the receiver from reconstructing the data. Thus the minimal set of problems we need to solve is:

Allowing the receiver to reconstruct the order that the data was sent in, even if the network reorders it.
Ensuring that data is eventually delivered from the sender to the receiver.

We'll take these one at a time, reordering first.

Reordering #

The reordering problem is fairly easy to solve: we just add a field to each packet which contains its number. The receiver just sorts the packets as they are received, and delivers them to the application once it has all previous packets. So, for instance, if the receiver receives packets in order 1 3 2 4 then it will behave like so:

Packet	Action
1	Deliver 1
3	Store 3
2	Deliver 2, Deliver 3
4	Deliver 4

This is actually not how TCP works, however. Instead, it numbers not packets but bytes. Specifically, each TCP segment (i.e., packet) comes with a sequence number which indicates the first byte in the packet, and a length, which indicates the last byte of the packet. This allows the sender to re-frame data when it retransmits it. For instance, suppose that a TCP connection is carrying typed characters: you want it to send each character as soon as it is typed, so each will be in its own packet, but if you need to retransmit a set of consecutive characters, it's more efficient to put them in their own packet. This only works if the packets contain an indication of which byte is which. For the purposes of this post, however, we'll think of packets as being fixed size and assume there's no reframing.

Packet Loss #

There are a number of reasons that packets can be lost in transmission. For instance, some network element could malfunction and drop them or damage them, or, as we'll see later, an element could explicitly drop them because they exceed available capacity. In either case, if a packet is dropped, the only thing for the sender to do is retransmit it, but how does it know whether to do so? In other words, how do we detect packet loss? One potential approach would be for the malfunctioning element to send some kind of signal indicating that it dropped or damaged the packet. But of course that signal itself might be dropped or damaged by the network. Additionally, the problem might be in a passive element such as a piece of wire which isn't able to send its own messages. Finally, if the problem is a malfunctioning element, then it might malfunction in such a way that it doesn't correctly send a message. In any case, no mechanism where the sender receives a message informing it of a lost packet will work reliably.

The end-to-end principle #

This is a case of what's called the end-to-end principle. The basic observation is that there are a lot of places for things to wrong between point A and point B, and so if you want to ensure that a piece of data arrives at point B, then trusting intermediate elements isn't enough; you need A and B to work together. This doesn't mean that you can't have reliability mechanisms between intermediate elements, but merely that they're not sufficient to guarantee delivery all the way to the other end. Rather, they act as an optimization that allows you to detect failures more quickly than they would have been detected by an end-to-end mechanism.

Instead of having a signal that a packet was dropped, we're going to instead have a signal that the packet was received, called an acknowledgments (often abbreviated ACK). When the receiver receives a packet, it sends an acknowledgment of receipt. This tells the sender that the packet got all the way to the receiver. Of course, acknowledgments have a number of obvious drawbacks:

They only tell you when a packet was received, not that it was lost, so the only way you know a packet was lost is by waiting until you expected to see an acknowledgment and then not getting one.
The acknowledgment can get lost in transit, so the packet might have been delivered, but this still looks like packet loss.

The reason to use acknowledgments is that they are robust: no matter what is going on in the middle of the network, if the acknowledgment is received, you know the packet got through. If you just keep sending until you get an acknowledgment, eventually the packet should get through (unless of course, the network is totally broken).

The way this works is that after the sender sends a packet it waits for a period of time (see below for how long) for the corresponding acknowledgment. If the timer expires before the sender receives the acknowledgment, then it retransmits the packet, like so:

In this diagram, the sender sends the first packet, which arrives successfully, and is acknowledged. However, the second packet gets lost in transmission. Eventually, the sender's timer expires and so it retransmits packet 2. This time it gets through and so does the acknowledgment, so everything is good.

In the simplest version of this protocol, the sender sends one packet at a time. Once that packet is acknowledged (potentially after one or more retransmissions), then the sender sends the next packet. This is what is called a stop-and-wait protocol, because the sender doesn't do anything until it hears from the receiver. The basic problem with this design is that it's slow. The reason for this is round-trip latency: the diagram above shows packets as being sent and received at the same time, but in practice they take some time to get from point A to point B: even on a very fast Internet connection, it can take a few milliseconds for a packet to get delivered, and if the server is around the world, latency can be on the order of a 100 milliseconds. If the sender is waiting for the receiver's acknowledgment, then it's just idle during this period, as you can see in the diagram below, where the sender has to wait for a full round trip before it can send the next packet.

The obvious thing for the client to do is just to send data as soon as it's available but this has two big problems:

The sender may be able to transmit the data faster than the receiver wants to consume it. Think about the case of streaming video: the sender could send the whole video to the receiver but this would be really inefficient because the receiver would have to store it all until it was ready to play, and the viewer might decide to only watch part of it.^[3] Even in cases where the user wants to receive the whole file, their device might not be able to process the incoming data as fast as the sender can transmit it.
The network might not be able to handle the data at the rate the sender can send it. This happens frequently in cases where the sending device is attached to a very fast local network but the end-to-end connection to the receiver is slower. As we saw before, this eventually will overwhelm the slowest network link in between the two endpoints.

We'll deal with the first problem in this post and the second problem in the next post.

Flow Control #

In order to prevent the sender from over-running the receiver, we need a flow control mechanism. The standard approach is for the receiver to advertise the total amount of data it is willing to receive at once (see buffering). The technical term here is the "receive window". The sender can send as many packets as it wants as long as they fit within the window, as shown in the diagram below.

In this diagram, the sender starts out by assuming the receiver's window is 1, so it sends a single packet. The receiver acknowledges this packet with the message ACK (1, window=4), which means "I have received all packets up to 1 and you can send up to packet 3" (this is called a "cumulative ACK"). The sender responds by sending packets 2 through 4, and then waits for the receiver's ACK. However, in the time that packet 3 is in flight, the receiver has received packet 2 and so it sends an ACK acknowledging it and advancing the window to packet 5. This isn't received until after the sender has send packet 4, but it is receives shortly thereafter, allowing the sender to send packet 5.^[4]

This mechanism is usually called "sliding windows", with the idea being that the window of data the sender can send is continuously sliding forwards as ACKs are received. In this example, the sender still has to wait briefly before it can send packet 5, but if the window had been slightly larger, then it might have been able to send continuously, with the ACK advancing the window being received before the sender was ready to send its next packet. This is especially true if the sender isn't sending as fast as its network will support, for instance if it's sending data that depends on user input.

Buffering #

At this point, you may have noticed that there's a lot of waiting here. For instance:

The sender can't transmit until it has room in the window.
Once the sender transmits, it has to wait until it receives an acknowledgment, because it might have been lost or damaged.
If the receiver receives packets out of order it has to wait to deliver the packets to the application until it has received the ones before them.
If the sender transmits packets faster than the receiving application, the receiving operating system needs to store the packets until the application is ready.

During these waiting periods, it's necessary to store (technical term: buffer) a copy of the packet. For instance, when the program asks the operating system to write something, but there's no available window, the operating system just buffers the packet until window is available. Moreover, during this period the system may be trying to send more packets, which it may or not be able to send immediately. For instance, if an application tries to upload a file, it may send 10 or more packets at once, which the sending system needs to slowly meter out as window becomes available. The sender needs a significantly-sized buffer to store these packets. Similarly, when a packet has been received out of order, it needs to be buffered until the earlier packets are available.

This isn't the only place that buffering happen: not all links on the Internet are the same speed, so it's common to have a situation in which network A wants to send faster than network B can send. In this case, the computer connecting those networks (a router) has to buffer the packets until space becomes available (often this is called a queue). In addition, the user's devices need input buffers where they store packets that have come in but the operating system or application has not yet had time to handle.

In general, all devices on the Internet have some level of buffering to deal with mismatches between the rate at which it receives packets and the rate at which it can handle them, whether that means processing them locally or forwarding them to some other device. Buffering allows these devices to deal with situations where the incoming rate temporarily exceeds the processing rate (which happens all the time) but the longer it goes on, the more packets have to be stored. Most devices maintain a maximum buffer size—if nothing else, limited by the total amount of memory on the device, but typically far less than that—and when that size is reached, then they have to drop packets; either by discarding some of the packets already buffered or by discarding the new packets (or both).

Retransmit Timers #

I've been handwaving a bunch about how the sender sets a timer and waits for the acknowledgment, but that doesn't tell us how long the timer should be. In general, we want the timer to be based on the round-trip time (RTT) between the sender and receiver, which is to say the time it takes a packet to go from sender to receiver, the receiver to respond, and the respond to make it back. If we set the timer shorter than the RTT, then the ACK won't make it in time and the sender will retransmit even if packets aren't lost; if we set it much longer, then we're waiting too long to declare packets lost, which slows down the connection. In practice, you want the retransmit timer to be somewhat longer than the RTT because there's some variation in network speeds, etc., but not too much longer. There's a long literature on how to set the retransmit timer, which I won't go into here.

There's just one problem: we don't know the round trip time, because it's not a property that the sender can see directly. Instead it's a function of the speed of all the network links in between the sender and receiver. Even if I have a fast network, I might be connecting to someone with a slow network. It also depends on how heavily loaded they are at any given moment, because I'm competing for network capacity with other users, which means that it can change over time. Worse yet, RTTs can vary dramatically: the RTT from my house in Palo Alto to the nearest Cloudflare server is about 10ms. The best-case RTT from Australia to the US is around 150ms (300,000 km/s is not just a good idea, it's the law). If you pick a single value for your retransmit timer, you're going to have seriously suboptimal performance on many networks.

The way that transport protocols handle this is to measure the round trip time during the connection by looking at how long it takes the other side to send an ACK. For instance, if you sent packet 10 at T=2000ms and you get an ACK for it at T=2050ms, then the estimated RTT is 50ms. Each time you get an ACK, you update the RTT estimate. The typical approach is to maintain a smoothed estimate (effectively a weighted moving average) of the recent measurements to average out the noise in each individual measurement while also favoring more recent measurements. Of course, you don't have any measurements at the time you start transmitting, so the typical approach is to use a somewhat conservative starting point (QUIC uses 333/ms), but obviously if the path between you and the other side has a low RTT, you want to update that as soon as possible.

Set-Up And Tear-down #

So far I've just covered the steady-state case where the sender is already magically communicating with the receiver, but in practice, but how do we get into this state, and how do we stop?

Set-up #

In most transport protocols, there's some kind of initial setup handshake before data is transmitted. For instance, here's what TCP's handshake looks like:

As I said above, TCP doesn't number packets, but instead labels each byte with a sequence number. So, what's going on here is that the client sends an empty SYN (for synchronize) packet with sequence number 1234. The server acknowledges it with its own SYN packet with sequence number 8765), and if it wants can send data to the client at this point (though this isn't the usual thing). Upon receiving the server's SYN, the client can also send traffic, starting with sequence number 1235. In the same packet, it acknowledges the server's SYN. Why is this necessary, though? Why not just start sending? And why not just start the sequence number at 1 (or, as C programmers would expect, at 0)?

The problem here is that it's possible for there to be two separate connections between client and server. Suppose, for instance, that a client initiates a connection and sends some data over it, and then ends it and starts a new connection, as in the diagram below.

If a packet from connection 1 is delayed on the network for a long period of time, it may be received by the server after connection 2 starts and accepted as part of that connection. Disaster! The way TCP handles this is by having the new connection start with a sequence number which is intended not to overlap with valid sequence numbers from the previous connection. Sequence number selection is actually a somewhat complicated topic that I won't get into here, The 3-way handshake is needed to ensure that the client and the server agree on the initial sequence numbers for the connection. Otherwise, you could have a situation where the server was acting based on a delayed SYN from a previous connection, leading to problems (I'll spare you the details of the pathological cases).

The problem with the 3-way handshake in TCP^[5] is that the client has to absorb a full round trip before it can send anything, which is a real performance cost. This wasn't really seen as a big deal when TCP was first designed, but as Internet speeds have increased generally and latency has become a big deal, it's become much more important. It's possible to send data on the first packet as long as you can guarantee that the server can distinguish this connection from others. For instance, QUIC does this by having the client choose a long (minimum 64 bits) random connection ID value, which distinguishes this connection from all other connections (I'm simplifying here, as the QUIC connection ID logic is also quite complicated), thus allowing the server to know that this is a new connection and not a replay. There's still a setup handshake, but the client and server are able to send during it, which saves round trips. I plan to cover the complexities of getting this right later, but wanted to mention it here for context.

Tear-Down #

What happens when the endpoints are finished communicating? In principle, they can just stop sending, but then what? The problem here is that both sides have to keep state: in order to be able to process Alice's packets, Bob needs to remember the last packet she processed so that she knows whether a received packet is a replay (to be discarded) or new data (to be processed). This takes up memory and eventually Bob is going to want to clean up. But how does Bob know when Alice is really done and so it's safe to clean up versus Alice just went quiet for a while?

The obvious thing to do here is to have an in-band signal that says that the connection is closing. This signal would itself be acknowledged, so it would be delivered reliably. This is what TCP does, but experience with newer protocols such as QUIC has shown that this is not always the best approach. This is another topic I plan to cover in a future post.

Common Themes #

Aside from the technical details, you should be noticing a few high level themes.

First, the design of these protocols doesn't really depend on any information about the internals of the network. It could be built out of copper wire, optical fiber, microwave links, two tin cans and a string, or all of the above. Similarly, you don't need to know how fast any individual link is, how big the buffers are in the routers along the path, etc. From the perspective of the transport protocol, the Internet is just this opaque system where you put packets in one side and they come out the other end. So, when we measure the RTT, for instance, we're just measuring the aggregate RTT of the system as a whole.

Second, none of this needs any cooperation from the elements in the middle. This means that (1) it's robust against any technical changes in the network and (2) you can make changes to the transport protocol without first having to change those elements. These are key properties for deployability: the Internet takes a really long time to evolve, and if we needed to change every element between point A and point B before we could use a new transport protocol on that path, we'd be waiting a very long time.

Finally, we constantly have to think about what happens if the network misbehaves in some way, for instance by dropping our packets or delivering them way out of order. A properly designed transport protocol has to be robust to all reasonable kinds of network misbehavior—the bar for what this means has gone up over the years to include active attack—and operate properly, or at least fail safely. This is just the price of trying to build a reliable system out of unreliable components.

Next Up: Congestion Management #

What we have so far is basically a simplified version of what TCP was like in 1986, when the Internet link between Lawrence Berkeley Labs (LBL) and UC Berkeley (about 400 yards apart) abruptly suffered what's come to be known as "congestion collapse", in which flaws in the TCP retransmission algorithms caused the effective throughput of the link to drop by a factor of about 1000. In the next post, I'll be talking about congestion collapse and how to avoid it.

The term sockets goes back to the original BSD sockets programming interface which was commonly used on early Internet systems and is now nearly universal. ↩︎
Ironically, in the modern phone network, it's fairly likely that we're carrying the data over some packet-based transport, very often IP. ↩︎
Yes, I know that in practice it's common to actually download smaller chunks of the video. ↩︎
Note that it's usual not to acknowledge ACKs, otherwise you get into a situation where the sides are just ping-ponging ACKs at each other. ↩︎
TCP does have a new mode called TCP Fast Open which allows sending immediately, but this is comparatively modern and there are a number of deployment challenges. ↩︎

Surprise, blockchains won't fix Internet voting

2023-01-09T00:00:00Z

You'll notice that in my post on end-to-end voting I never mentioned the word "blockchain". However, there's been quite a bit of interest in the "crypto"^[1] community around somehow using the blockchain to "fix" voting. For instance, here's Binance CEO Changpeng Zhao arguing back in 2020 that it will lead to more secure elections with faster results:

If there is a blockchain based mobile voting App (with proper KYC of course), we won't have to wait for results, or have any questions on its validity. Privacy can be protected using a number of encryption mechanisms.
— CZ 🔶 Binance (@cz_binance) November 5, 2020

And here's Ethereum founder Vitalik Buterin endorsing the idea:

The technical challenges with making a secure cryptographic voting system are significant (and often underestimated), but IMO this is directionally 100% correct. https://t.co/J0qHiN2bbk
— vitalik.eth (@VitalikButerin) November 5, 2020

See also Buterin's more extensive defense of this position here, which argues for the blockchain-as-bulletin board design. I address some but not all of his points below.

Spoiler alert: I think this is wrong, in two separate ways.

First, blockchains are not really a useful element in Internet voting: they don't solve the basic security problems in the system, and are worse than the existing technologies they would replace.

Second, the basic premise that we need Internet voting in order to fix our existing voting systems is largely misguided: it's true that we see a lot of problems with those systems in practice, but it's also quite possible to use paper-based systems to run an election that produces quick results which can be independently verified. To a great extent, the operational problems that have gotten so much press are the result of conscious decisions made by policymakers. Moreover, at our current level of technology Internet voting has serious vulnerabilities that we just have no real idea how to overcome.

Blockchains are not the solution to Internet voting #

Let's dispose of the obvious point first: the big problems in the security of Internet voting stem from the need to secure software (and keying material) on voters' devices. A blockchain doesn't really do anything to address this. Moreover, the fact that we fairly routinely see successful attacks on crypto infrastructure as well as theft of crypto currency, including from crypto investors (and maybe even core Bitcoin developers???)—who you would expect to be sophisticated—does not exactly suggest that the cryptocurrency community has discovered the secrets to key management and to building secure cryptographic software. And of course, even if they had, that software has to run on commodity platforms which of course have their own security problems; if end-user devices are compromised, then you can't trust the cryptographic voting software on top of them even if that software is perfect.

The difficulty of getting ordinary people to use cryptography correctly isn't some surprising piece of news. There's decades of papers on how hard cryptographic software is to use (see here and then here). In fact, here's Zhao just last month saying that that 99% of people can't adequately handle manage their own keying material for their crypto:

For most people, for 99% of people today, asking them to hold crypto on their own, they will end up losing it.”

and:

“Most people are not able to back up their security keys; they will lose the device [...] They will not have the proper encryption for their backup; they will write it on a piece of paper, someone else will see it, and they will steal those funds,” he explained.

But this is precisely what we are asking people to do in order to do any kind of Internet voting (with or without a blockchain). The security of these systems depends critically on the security of the keying material used to authenticate each user. If people can't safely do that for the keys to manage their money, then why should we expect them to do so for a key they only have to use twice a year?

They aren't even a useful element #

OK, so blockchains don't solve the basic security problem with Internet voting, but maybe they are a useful component? Again, I think the answer is "no". The obvious place you might want to use a blockchain is as the "bulletin board" for an E2E system. The bulletin board needs to be (1) publicly accessible and (2) have public consensus on the contents. Given that the point of a blockchain is to provide consensus about which coins have been spent, this seems like a natural fit. The idea here would be that you would submit your ballot as a record on the blockchain (just as you would a record of a spending transaction). Any records which had been included as of the date of the election (or some other deadline, presumably) would then be treated as "on the bulletin board" for the purposes of the rest of the protocol. You'd of course need all the rest of the apparatus of end-to-end verifiable voting like the provable mix, etc., but maybe the blockchain would be useful as the bulletin board.

While possible in theory, this doesn't really get you much in practice. First, the verifiability properties of a blockchain do not map well onto what you need for an election. Second, this use of a blockchain in this context has a number of practical problems, as discussed in a quite thorough report by MIT researchers Park, Specter, Narula, and pioneering cryptographer (and co-inventor of the RSA public key algorithm) Ron Rivest.

Verification #

The distinguishing feature of blockchain type systems is that they are designed to be "zero-trust", in the sense that you don't need to trust a central authority to maintain the integrity of the log. The specific property that the blockchain is guaranteeing that everyone has consensus on:

Which transactions are in the log
What order they occurred in

The details of how it accomplishes this are out of scope for this post (I've been working on a post about this, but I'm not happy with it yet), but the key insight to have is that the reason you need this kind of system is that the transactions in the log do not themselves provide all the information you need to verify them. Specifically, while they are typically digitally signed and so you can verify they are authentic, but you need the blockchain to tell you what order they occurred and to ensure that people don't conceal transactions.^[2]

E2E voting is similar in that you don't trust the voting authority but different in that all of the information it publishes is self-authenticating, so you don't need some separate mechanism to ensure it was correctly recorded. Specifically:

You can verify that all the input votes are valid by checking their signatures (this is true of cryptocurrency systems too).
You can verify that the mixing was conducted correctly by checking the proofs of shuffling.
You can verify that the votes were decrypted correctly by checking their proofs.

The only thing you can't directly verify from this information is that votes weren't incorrectly excluded from the original input set, but a blockchain doesn't really assist you here, because it's just a record of what people claimed happened. Instead, what you need is for the authority to publish the input set in some way that everyone can see and that allows people to challenge the input set.^[3] Specifically, the authority publishes the set of signed encrypted ballots to the bulletin board and then:

Voters who believe that their votes were improperly excluded can challenge that exclusion.
Observers who believe that a vote was improperly included (e.g., the signature is invalid, or the voter is ineligible) can challenge that vote.

This does require that everyone agree on the contents of each bulletin board, but you don't need the blockchain to provide it because the election officials can just post it on their Web site. Well, mostly.

Partitioning Attacks #

The reason for the "mostly" is that you can't check whether all the votes that are supposed to be present actually are, because you don't know who voted. Rather, you are counting on other people having checked that their votes appear on the bulletin board (or people checking for them). If that bulletin board is just a Web site then it's theoretically possible to mount what's called a partition attack.

Suppose the election officials want to suppress Alice's vote. If they just exclude it from the bulletin board, then Alice might catch them. Instead, they selectively exclude it, by creating two copies of the bulletin board:

The main one they use for the actual count that excludes Alice.
A bogus bulletin board that includes Alice.

When Alice goes to check her vote, the election officials send Alice the bogus version, and so her checks succeed. However, when anyone else checks the bulletin board, they send the real copy.

This is actually a very hard attack to mount in practice because any number of things can go wrong. First, if Alice checks the final totals, she'll see that they don't match. Even if she's lazy, this depends on being able to perfectly detect when Alice is checking as opposed to someone else; as there is no reason to authenticate this transaction, that's difficult. You could use the IP address, but what if Alice votes from her phone and checks from her laptop?

Moreover, this attack is easy to defeat as long as you have any consensus mechanism at all. You certainly don't need anything as fancy as a blockchain, though because we already have numerous mechanisms for election officials to communicate authoritatively with the public in ways that ensure that everyone gets the same information (e.g., by having that information broadcast on television or published in the newspaper). All they need to do is publish the hash of the bulletin board via one of these mechanisms and then everyone can verify that they have the same bulletin board contents.^[4]

The point is that this is not a situation which needs distributed consensus; it just needs regular consensus. The whole system has to be centrally operated anyway, and that central authority is a natural mechanism for establishing consensus.

Practical Problems #

The details of how blockchains work are outside of the scope of this post, but briefly, a blockchain is a public list of transactions, with every transaction appearing—or at least attested to—by the blockchain. It is maintained by a set of servers who are responsible for checking the validity of transactions and appending them to the public log. In what's called a "permissionless" blockchain, these servers are just operated by ordinary people (or at least in theory, in practice of course it takes a lot of resources to be relevant) and there aren't any special trust relationships with those servers. At a very high level the process looks something like this:

The user (voter) generates a candidate record that it wants incorporated into the blockchain.
The user's software then sends the record to some set of other network nodes.
Those nodes propagate that record to other nodes until all—or at least most—of the other nodes in the network have a copy.
One or more network elements select a set of outstanding records and incorporate them into the blockchain. Note that I've totally omitted how this happens. For our purposes, it's magic.
The extended blockchain is propagated to the rest of the network.

The result is that everyone knows by looking at the blockchain which records are in the consensus and which are not (this part is magic too).

As Park et al. observe, there are a number of things which can go wrong here. For instance:

The nodes that the user submits their record to could decide not to propagate it to other nodes, thus preventing a given user from voting.
The nodes responsible for selecting the set of outstanding records could omit a specific record, either unintentionally (because it gets lost) or maliciously (to suppress a given user's vote).
An attacker could attempt to mount a denial-of-service attack on the network to prevent it from coming to consensus. Park et al. suggest a specific attack scenario which exploits the fact that in some networks the user has to pay to have their transactions included in the blockchain, and the nodes have discretion about which transactions to include (and can favor the higher bidding ones) at times when the incoming transaction rate exceeds the throughput of the network.^[5] If the network is shared with other applications like financial transactions, an attacker could potentially flood the system with transactions in an attempt to starve out legitimate votes.
An attacker might be able to exploit defects in system elements or the associated protocols to globally or selectively mount denial-of-service attacks on an election.

The bigger picture here is that blockchains don't provide a guaranteed level of service and that the actual delivered level of service depends on network elements which are untrustworthy and potentially malicious. This opens up a lot of opportunities for attackers to interfere with election outcomes even if they aren't able to actually forge votes. They don't need to be completely successful, either, they just need to have a big enough impact to swing a close election. Of course, some of these attacks are possible with centrally operated systems, but at least in those systems you know who to blame for outages (and remember, I'm not saying that Internet voting is good, even with centralized systems!).

I could go on here, but if you're really interested, you should read the MIT report. The authors do a valiant job of trying to design a blockchain-based voting system using coins as votes, but honestly it's just a mess, with all the problems I've described here and more (this isn't a critique of the authors; their point is that it's a bad idea, so it's proof by contradiction.) The bottom line is that blockchain technology just isn't a good fit for this application.

Solving the wrong problem #

Finally, the whole argument here kind of rests on a misdiagnosis of the situation, namely that the problem with conventional voting systems is that they are inherently (1) slow to get results and (2) open to questions of validity, and hence that we need Internet voting to solve these problems.

Speed #

It's entirely possible for conventional voting systems to produce rapid results (though in all fairness, not as fast as an Internet-only system). It's true that there have been a number of recent elections where it took a number of days to determine the winner, as more votes trickled in. In some cases, candidate A looked like a winner early but was the eventual loser when all the votes were in, which has caused a lot of suspicion among people who didn't understand what was happening. However, many jurisdictions actually are able to resolve elections quickly. For instance, Florida mostly got same-day results in 2022.

To understand what causes delay, it helps to understand the logistics of voting. The consensus best choice in the voting security community is optically scanned (opscan) paper ballots. These can be counted in one of two ways:

Precinct count: The ballots are fed into a machine in the precinct which counts them immediately and then can report the results.
Central count: The ballots are sent back to election central where they are scanned.

Precinct count systems can deliver results immediately upon poll closure, with some potential risk to voter privacy (you have to trust the machine not to record the order of ballots and their contents). With systems like this, you can get a count on election night (pending verification, as below). Central count machines obviously take longer to report values, but modern central count scanners can count hundreds of ballots per minute, so it's not implausible that you could get an election night count with an acceptable cost, as Florida already does.^[6]

There are a number of reasons why elections can be slow to resolve, but one of the main ones is absentee/mail-in ballots. For instance, in California, ballots can be postmarked on election day, so you need to wait days for all of the ballots that were mailed to be delivered. In some jurisdictions, you can't even start counting absentee ballots until election day, which means you need to count a lot of ballots right away. A number of jurisdictions have both of these problems: in Mississippi ballots can be processed up to 5 days after election day if they are postmarked on election day and you're not even allowed to start checking the signatures on them until election day! As noted above, if you have the right policies you can get answers reasonably quickly.

It's certainly true that ballots received over the Internet could be tallied instantly, so in that respect we would expect Internet voting to be faster, but this only works if we require everyone to vote over the Internet, which has the potential to really disenfranchise a lot of people (people who can't afford modern devices, those who aren't comfortable with new technologies, etc.). If a significant number of people still vote mail-in with paper ballots, then you still have the problem. The bottom line here is that if we want to prioritize rapid election results at the cost of making it harder to vote remotely (and while for many people an app would be easier, for some it would be harder), then we know how to do it; it's a choice to have slow election results.

It's also important to note that this is all about preliminary results. Full verification takes time, both with paper-based systems and for end-to-end verifiable systems. For paper-based systems, this is because the risk-limiting audit or hand count is manual. In end-to-end verifiable systems, the cryptographic pieces can be checked immediately, but you need to give time for people to challenge the initial vote input set (and specifically to object that their vote was not included). Until that's happened, you have no way of knowing that the voting system didn't just exclude a lot of voters.^[7]

Disputes about validity #

From a technical perspective, election validity comes down to the ability to demonstrate to a third party—ideally to any third party, but in practice to some set of third parties that are collectively trusted^[8] by the electorate—that each phase of the election was correctly conducted, or at least that the inevitable errors were insufficiently large to affect the final result.

For ordinary elections, verifiability is provided by a combination of observability—at least in principle—for the manual processes and double-checking for the inherently unverifiable electronic processes (if any).^[9] This second feature is typically described using the concept of software independence (SI), defined by Rivest and Wack as follows:

A voting system is software-independent if an undetected change or error in its software cannot cause an undetectable change or error in an election outcome.

The intuitive reason for SI is that we know computers to be very insecure—and multiple reviews of electronic voting systems have found serious vulnerabilities—and that their operations are opaque, so any voting system shouldn't depend on trusting them.

With a hand-marked paper ballot system, you have some set of processes to ensure that only registered voters vote, but you still need to verify that the tabulation is performed correctly. If you count the ballots by hand, we're back to observability, but if you count them by machine, then you need a double check. This can be provided by using using a risk-limiting audit, in which a sample of the ballots is publicly counted. Of course, if there is real doubt or the margin is very close then you can do a full hand count, but in either case the entire counting process can be made verifiable (though in practice, RLAs are nothing like universal). They key point here is that if you follow the right practices, then even a complete compromise of the scanner will not lead to the wrong result. If you use ballot marking devices instead of hand-marking the ballots, then this does not completely provide SI: if the BMD is compromised then the attacker can have it record the wrong result; some voters will check and catch the error, but others won't and for those voters the attack will succeed. The counting process is still verifiable, of course.

Similarly, end-to-end verifiable systems provide SI for tabulation by making it possible—at least in theory—for someone to write their own system from scratch that will verify the election. However, if users are voting on their own devices, then any compromise of those devices can completely compromise the device, and there's no plausible way to detect or recover from this form of attack, which is even worse than with BMDs. Imagine what happens in an election where it's discovered that even a small number of user devices had been compromised; how would you have confidence in the result? As noted above, using a blockchain doesn't help with this at all.

Even if we confine our attention to the parts of the system that are independently verifiable, actually convincing yourself that the election was correctly conducted can be a pretty challenging proposition. A full hand count is directly verifiable if you watch the whole thing, and while the idea behind a risk limiting audit is simple, knowing how many ballots to count involves some reasonably complicated math. The situation with any end-to-end verifiable system is dramatically worse in that not only is the math very complicated, even the logic takes thousands of words to explain. It's pretty hard to see how explaining that votes are correct because they are digitally signed and then mixed in a way you can check by verifying a zero-knowledge proof is going to put to rest any questions of validity.

You'll note that above I said that from a technical perspective validity disputes comes down to third party verifiability. The bigger problem here is that many election disputes don't come down to technical questions at all, because most people people aren't going to research the details of how elections are run—how many people still think that there was tabulation fraud in Georgia, even after a full hand count?—and end up making decisions on other grounds, using motivated reasoning or based on who they trust more. It's hard to see how any set of technical mechanisms will really convince everyone, though I'm especially skeptical that arguments based on fancy cryptography will do the job.

The Bigger Picture #

As I said in my original post on end-to-end verifiable voting, voting isn't just a technical problem: it's embedded in a system of social practices and it's those social practices which make the problem complicated (again, I encourage anyone interested in voting to actually go serve as an election worker). It's of course possible to improve voting technology, but most proposals for how we could radically improve everything using new technology X fall down when you realize that X don't take into account those existing operational realities. This is largely the case with Internet voting. The problem with using blockchains for Internet voting is simpler, though: it doesn't solve any problem that can't be solved with other, simpler technology. Of course, that could also be said of a number of other proposed applications of blockchains, which, to quote Mark Nottingham are not magical.

The scare quotes are here because there is of course a pre-existing use of the term "crypto" to mean "cryptography". ↩︎
The reason this is important is that you need to prevent "double-spending" attacks where people use the same cryptographic token to pay two people. ↩︎
The analogous check in a blockchain-based cryptocurrency system is that the payee verifies that a transaction is recorded on the blockchain before they believe they have been paid. ↩︎
This is actually how pre-Bitcoin timestamping systems were designed. ↩︎
The Bitcoin maximum transaction rate is famously low, though other networks do better. ↩︎
Handwaving alert: The Interscan Hipro can scan 300 pages per minute and costs under $200,000. Los Angeles is probably the biggest county in the US with almost 6 million registered voters: if you had about 40 scanners you could do all these counts in less than 10 hours at a capital cost of less than $10 million (of course, there are lots of other costs to consider). ↩︎
Remember that many registered voters don't actually vote, so you need some way of distinguishing the case where people didn't vote from the case where their votes were discarded. ↩︎
By which I mean that for the vast majority of voters, there is at least one verifier they trust, even if not all voters trust the same verifier. ↩︎
Outside the US, hand counting is common, but in the US, it's pretty much necessary to use machine counting for logistical reasons. ↩︎

How to securely vote for (or against) Elon Musk

2022-12-24T00:00:00Z

Note: this post contains a bunch of LaTeX math notation rendered in MathJax, but it doesn't show up right in the newsletter version. You may want to instead read the version on the site.

Earlier this week Elon Musk ran a poll for whether he should step down as head of Twitter. As of this writing, the poll stood overwhelming (57.5 to 42.5) against Musk.

Should I step down as head of Twitter? I will abide by the results of this poll.
— Elon Musk (@elonmusk) December 18, 2022

Unsurprisingly, there have been claims of voter fraud (via "bots") as well as concerns that Musk would retaliate against people who voted that he should step down. Twitter polls are just some code on a Web site, and so are trivially insecure in any number of ways, including:

There's no way to validate who voted.
There's no way to externally verify that the votes were accurately counted.
It's trivial for anyone in control of Twitter servers to see who voted which way.

These weaknesses follow directly from the way that Web sites are built: your browser is just running a program that is provided by the server, and you vote by sending your vote to the server, so of course the server can see your vote and lie about it. The typical way to address this is with physical countermeasures like having people vote with paper ballots or having some kind of paper trail of how people voted. However, over the past 25 years or so it's become possible to build a voting system that is entirely remote (i.e., that doesn't involve you voting in person or sending any physical object anywhere) and yet provides strong privacy and security guarantees [terms and conditions apply].

These technologies are typically called cryptographic or more recently end-to-end (E2E) voting systems. There has been an enormous amount of work in this area; in this post, I'll be describing a simplified version of a pair of papers by two of the pioneers of the field, Josh Benaloh and Ben Adida (Helios).

Background #

Before we get into end-to-end systems, it's useful to review how a typical paper ballot system works. You can find a more detailed description in previous posts and a description of the requirements here. Typically you have preprinted paper ballots which lists the choices for each contest. For instance:

The voter marks the ballot with their selection and then submits it for tabulation. In many (though not all systems) the ballot is then mixed with other ballots and shuffled (or maybe at least shaken around a bit) so that the order in which people voted is not preserved. For instance, they might be put in a cardboard box, shuffled, and then carried to some central place for tabulation. The ballots are then tabulated and the totals reported.

This system has a fairly straightforward verifiability story if you can observe the process: you can observe who voted and that each voter only got a single ballot. As long as the chain of custody for ballots is secure (i.e., the ones that go into the box are the ones that are counted) and then can observe the counting process, then you can verify that the totals are right, and can have confidence in the whole election. The privacy story is similarly straightforward: the ballots are shuffled and as long as people don't mark them in a distinguishing fashion—a big assumption!—then it's not possible to associate a ballot with a given voter (what's called "k-anonymity").

How Not to Build a Cryptographic Voting System #

Building an end-to-end voting system is more or less a matter of replicating these properties without the paper. This is harder than it sounds. The basic problem is that there is a tension between two properties:

Ensuring the integrity of the ballot through the entire process.
Protecting the anonymity of individual votes.

If you're willing to give up either of these properties, then the problem becomes fairly straightforward.^[1]

Secure but Non-Anonymous Ballots #

If you're willing to sacrifice anonymity, then you can just have signed ballots. The way that this works is that each user has a cryptographic key pair (this of course imports all the usual problems with cryptographic identities, but let's assume that those are solved). In order to vote you sign your ballot and submit it to the election administrators.

Knowing Who Voted #

It may come as a surprise to people that we would publish who voted, but this is actually a fairly common feature of real-world systems. For instance, when I worked the polls in Santa Clara County, you would cross people off the voter sheet when they voted and then periodically post a copy of the sheet; this is a useful transparency measure but also allows campaign workers to know where they should focus their get out the vote measures.

The administrators post all the signed ballots on some public bulletin board. This allows anyone to verify who voted and file challenges^[2] in case of irregularities, such as:

Their vote wasn't included
Someone voted who shouldn't have
Someone voted multiple times

Once the challenges are complete, everyone can then verify the tabulation for themselves. Of course, this also lets anyone know exactly how everyone else voted, which is bad.

Anonymous but Insecure Ballots #

On the other hand, if you don't care about the integrity of the election, you can use standard techniques to mix the ballots. For instance, you can have a series of proxies arranged in what's called a mix network (mixnet), as shown below:

The idea here is that you have a series of independently operated proxies. Each voter recursively encrypts their ballot, first to the tabulator, then to proxy 1, and then to proxy 2. They then send their ballots to proxy 1, which decrypts them, shuffles them, and forwards them to proxy 2. Proxy 2 does the same and forwards them to the tabulator, which finally decrypts them. The end result is that the tabulator gets a list of ballots but is unable to determine which ballots correspond to which voter or what order they were cast in: it receives them in random order and because encryption was removed at each layer, there is no way to match the contents of a given ballot to the encrypted version that was cast. This property holds as long as at least one of the proxies is honest, and you can of course have an arbitrary number of proxies.

Unfortunately, any one of the proxies or the tabulator can tamper with the election results. It's obvious that the tabulator can do this because they have the final ballots, but the proxies can do the same by replacing the genuine ballots with fake ones. You could of course have the voters sign the ballots, but then this obviates the point of shuffling them because the voter's identity will be available to the tabulator. Even if you sign them before they are sent to the first proxy, that proxy has to strip the signatures.

Building a Real Design #

The underlying problem with the mixnet scheme is that it doesn't do anything to ensure that the ballots that come into the mixnet are the ballots that come out of it. In an ordinary paper-based system, this is provided by physical properties: you verify that the box is empty at the start of the election and you can have confidence that paper ballots won't change in transit or create new ballots via spontaneous generation. However, the proxies are much more complicated than cardboard boxes and they can readily create, modify, or delete ballots.

What we need is some way to verify the integrity of the mixing system, or more precisely, a way for a mixer to prove that it has executed the mixing correctly, which is to say that there is a one-to-one relationship between the ballots that were put into the system and those that came out. I describe how to build such a mixer below.

Re-Encryption #

In order to do this we first need a new primitive, which is a way to reencrypt a value encrypted to Alice so that the ciphertext (the encrypted value) is different but the plaintext (what you get when you decrypt) is the same. Importantly, you need to be able to do this without knowing the encryption key or the plaintext (it's trivial to do otherwise by just decrypting and reencrypting). In the simple proxy design above we just solved this problem by using nested encryption, but for reasons that will shortly become apparent, that doesn't work here, so we need a new primitive.

You can find details of how to implement reencryption here but we can just assume that we have some function $R$ that performs this operation. Specifically, given a ciphertext $C$ and a randomizer $r$, we can compute:

$$ R(r, C) \rightarrow C' $$

Such that:

$$ Decrypt(C) = Decrypt(C') $$

We can create a mixer by using re-encryption instead of removing one layer of encryption, as shown below:

Without knowing the $r_i$ randomization values, it's not possible to associate the output ciphertexts with their corresponding inputs.

Unlike nested encryption, re-encryption has the nice property that you can re-encrypt the same ciphertext multiple times without any help from the sender. So, for instance, you can just add another mixer stage without having to add another layer of nesting. More importantly, you can can create an arbitrary number of equivalent ciphertexts from the same initial ciphertext. We use this fact below.

Provable Mixing #

Re-encryption-based mixing is a more flexible design than nested encryption, but this still leaves us trusting the mixer. However, once we base our mix on re-encryption we can prove that the mix was performed correctly.

It's of course trivial to prove that the mix was performed correctly if you're willing to reveal the mapping itself: you just publish which inputs correspond to to which outputs as well as the reencryption factors $r_i$ and anyone can verify for themselves that the ostensible inputs result in the right outputs. What we want to do however is prove that the mix was performed correctly without revealing the mapping between inputs and outputs, which is obviously harder. Fortunately, there is a clever trick we can use here, due to Sako and Kilian.

The basic idea is that instead of mixing the ballots once, the mixer instead does so twice, creating two alternative mixes. It publishes both of them, identifying (arbitrarily) one as the output and the other as what's called the "shadow" mix. The diagram below shows the situation, with dashed arrows to indicate that observers are unable to see the mapping:

Now consider the case where the mixer cheated and replaced value $V_1$ in the output with a new value $V_a$. Assuming everything else was correct, the shadow mix can be in one of two states. First, the shadow mix can be correct, which is to say that it contains $V_1$ rather than $V_a$. In this case, there is a 1-1 mapping between the input values and the shadow mix, but no 1-1 mapping between the shadow mix and the outputs.

Alternatively, the shadow mix can be incorrect and contain $V_a$ rather than $V_1$. In this case, there is a 1-1 mapping between the shadow mix and the outputs but not between the shadow mix and the inputs.

Either way, if the mixer cheated, there will either not be a mapping from the input to the shadow or from the shadow to the output (of course, it's possible for both to be wrong). If the verifier then randomly challenges the mixer to reveal either of the mappings, there is a $1/2$ chance that the mixer will be unable to do so (obviously, if the mixer discloses both, then this is the same as disclosing the full mapping to the output, but because the shadow mix is shuffled with respect to the input-output mappings, neither of the mappings to the shadow mix tells you anything about the input-output mappings). On the other hand, if the mixer has behaved honestly, it can reveal either mapping when challenged.

Given this design, if the mixer cheats they have a $1/2$ chance of getting caught, or, to look at it another way, a $1/2$ chance of getting away with it. It's straightforward, however, to make that chance arbitrarily small, just by having the mixer create more than one shadow mix. The verifier then asks them to reveal one half of the mapping for each of the shadows—but never both halves for a given shadow. Each of these challenges has a $1/2$ chance of detection, so if you have $n$ challenges, the chance of successful cheating is $2^{-n}$, which quickly gets very small; somewhere between 80 and 100 shadows is easily sufficient.

Note that this does not prove that the ballots were actually randomly^[3] shuffled, merely that they map 1:1 between input and output. The standard way to ensure shuffling is to have multiple mixers: as long as any one is honest and actually shuffles, the result will be random.

A non-interactive proof #

The obvious problem here is that this proof of correctness is interactive which is to say that it requires someone to actually generate the challenges. If you're not that person you just have to trust them. This is still better than nothing because the verifier could be separate from the mixer, but it's possible to do better still, creating a non-interactive proof that the mix was done correctly.

The intuition to have here is that the interactive proof works by forcing the mixer to commit to the outputs before they get to learn the challenge. This prevents the mixer from creating a specific dishonest mapping that will pass a specific known challenge, which is quite easy (just make the challenged side correct). However, we can achieve the same effect by making it impossible for the dishonest mixer to control the challenge for a given set of mappings. We do this by computing the challenge using a hash of the outputs. I.e., the mixer:

Computes the output and $n$ shadow mixes $S_1, S_2, ... S_n$.
Hashes those to produce a string of at least $n$ bits ($H_i$)
Publish the output and shadow mixes and also for each bit of the hash $H_i$, publish either the mapping from the input to the shadow (if the bit is 0) or from the shadow to the output (if the bit is 0).

The verifier then recomputes the hash over the outputs and checks that the mixer has provided valid mappings for the indicated side.

It's natural to wonder whether the mixer could compute a mapping such that the hash has the right set of challenges. In principle yes, but because the hash is an unpredictable function of the mappings, they have to first compute the mapping and then check whether it works. You have a random $2^{-n}$ chance of getting a matching mapping, and so you just have to keep trying; it costs about $2^{n/2}$ operations to find an appropriate input, which is prohibitive if $n$ is large enough.

The problem of turning an interactive proof into a non-interactive one occurs all over cryptography, and this hashing technique, called the Fiat-Shamir Heuristic, is the standard solution.

Tabulation #

Once we have a shuffled set of encrypted ballots, we need to count them. At a high level, this is simple: the election officials decrypt them to reveal the original ballots. They publish those ballots and anyone can then tabulate them themselves. However, there are two subtleties we need to consider here.

Verifiable Decryption #

The first problem we have is that the election officials could simply lie about the contents of the ballots. I.e., they could say that a vote for Alice was actually a vote for Bob. This actually turns out to have an easy answer: it's possible for them to create a proof that they correctly decrypted the ballot. The details are a bit complicated but don't really matter: the bottom line is that it's possible to publish the following triplet:

The encrypted ballot $E(V_i)$
The decrypted ballot $V_i$
A proof that $E(V_i)$ decrypts to $V_i$.

Once verifiers have checked that the proofs are correct, they can then tabulate the decrypted ballots.

Multiple Decryption Keys #

The second problem is that election officials might decrypt the encrypted ballots when they are initially posted on the bulletin board, this learning how everyone voted. As noted above, these ballots are signed and so are easy to attribute.

The standard approach to mitigating this threat is to encrypt each vote with multiple keys, so that you need multiple election officials—or even some trusted third party—to do the decryption. This means that they all need to collude in order to violate user privacy by decrypting ballots before they are shuffled. This is cryptographically straightforward (see here for one way to do it with ElGamal Encryption). Note that even if election officials do all collude, this still doesn't threaten the integrity of the election.

Putting it all Together #

We now have the makings of a complete system, shown in the figure below:

The election process looks like this:

To cast their ballot, each voter encrypts it with the public key(s) of the election officials and then signs it with their own private key. They post it to the bulletin board.
Once the ballots are all cast, the mixer strips the digital signatures (thus anonymizing them), shuffles the ballots, and posts them along with the proof of correct shuffling to the bulletin board. This can be the same bulletin board or a separate one.
The election officials take the shuffled ballots, decrypt them, and posts the decrypted ballots along with their proofs of correct decryption.

In order to verify the election, you take the following steps:

Check the signatures on the ballots. This ensures that the right set of voters cast their votes and that they attest to the contents.
Check the proof of shuffling. This ensures that the the shuffled ballots correspond to the ballots you verified the signatures on (though you can't tell which ones are which).
Check the proof of decryption. This ensures that the plaintext ballots correctly match the encrypted ballots.

Any observer can take these steps without any help from the voting officials.

Different Tabulation Methods #

In principle you can use any tabulation method you want, but things get a little complicated if you want to do anything fancy, because you want to prevent voters from being able to prove how they voted (a property called "receipt freeness"). The reason is that they might then be able to sell their vote (or be coerced into voting a certain way). If the ballot is complicated, the voter can encode their identity by voting a certain way in "down-ticket" (less important) parts of the ballot (an attack called "pattern voting"). It's easy to address this for less important contests (e.g., you are paid to vote in the Presidential election but encode your identity in your votes on local judges), by just having each contest voted separately, but for voting systems where you have multiple votes in each contest that need to be considered together such as single transferable vote (STV) the situation becomes more complicated^[4]. There are E2E designs, which work for STV, but they are more complicated.

If all the steps complete successfully, you have then verified that the output decrypted ballots have a 1:1 relationship with the signed ballots that you verified and hence with the ballots you expected to be cast, and therefore that the election was conducted correctly. You can then tabulate the ballots in the usual way and verify that the totals match what you expected, thus verifying the entire election.

Against Internet voting #

E2E voting is an amazing technical achievement, but despite that, the broad consensus of people working in voting is that Internet voting is a bad idea, even using E2E systems. Why the incongruity? The reason is that voting isn't just a technology, but rather is embedded in a whole election system, and it's in that context that E2E voting falls short.

Voting Device Security #

The first problem is that unlike (say) hand-marked paper ballots, cryptographic voting systems require users to vote on some sort of computer (you weren't planning to do elliptic curve math in your head, right?). There are two main ways for this to work:

You can use the same types of electronic polling place devices that people use for voting now.
You can vote on your own device (e.g., your phone).

But this means that you're trusting that computer to actually correctly cast your votes and computers are incredibly hard to secure, as we've seen this repeatedly in the elections context, where third party audits have repeatedly shown that even temporary access to polling place voting devices is sufficient to subvert them (see, for instance the reports from the California Top-to-Bottom Review back in 2007). The situation isn't much better with personal devices, which need regular updating to address a constant stream of discovered vulnerabilities (just as an example, here's the list of security issues in the latest iOS release; any major system has a similar list).

There has been some good work on allowing users to verify that their votes were correctly (see for instance Section 4 of Benaloh06). This is a more complicated problem than it seems because it's important to avoid providing the voter with a receipt that could be used to prove how they voted (as opposed to that they voted). Otherwise, this receipt can potentially be used to enable vote buying. Of course, these approaches ultimately require the voter to use some other computer to verify whatever proof the voting device spits out.

The Uses of Statistical Evidence #

If you have a voting system that introduces biased errors, it might in principle be detectable. Suppose that the system changes 1% of votes from Smith to Jones but leaves the Jones voters alone. If some fraction of voters check their votes, then we'll see more corrections of Jones votes than Smith votes, though we'll still see some Smith corrections, because some voters will accidentally vote Smith when they meant to vote Smith. You could imagine running some kind of hypothesis test to determine whether you saw an unexpectedly high rate of Smith → Jones errors, but even if that came up significant, it's not clear what you'd do with this information, because voter errors aren't unbiased either. To take a famous example, there's fairly strong evidence that the design of the 2000 Palm Beach Florida presidential ballot lead to systematic erroneous votes for Buchanan when the voters meant to vote for Gore. So, even if you had evidence that there was an unexpected rate of errors, it's not clear what you would do about it (in the case of Florida, nothing).

Even with these systems, you are left with roughly the same situation as with a Ballot Marking Devices (BMDs): users can in principle verify their votes but often do not. Specifically, suppose a machine is programmed to change 1/100 votes from Smith to a vote for Jones by acting as if the user had pressed the wrong button (ever make a typo on your phone?). If a user does verify their vote, the attack succeeds, and if the user does check, the machine allows them to correct it. Because users can and do accidentally vote for the wrong person, this type of attack is very hard to distinguish from voter error. Studies of Ballot Marking Devices (BMDs) by Bernhard et al. found that if left to themselves around 6.5% of voters (in a simulated but realistic setting) will detect ballots being changed. There is some good news here, which is that with appropriate warnings by the "poll workers" the researchers were able to raise the detection rate to 85.7%, though it's not clear how feasible it is to get poll workers to give those warnings. Given that checking a paper ballot is much easier than checking a cryptographic ballot, we should expect a fairly low rate of checking.

I do want to note that we are starting to see some interest in adding E2E to paper-based election systems, as in STAR-Vote or with Microsoft's ElectionGuard. This seems like a potentially good idea in that it augments the security of the paper-based system. However, in systems with no paper trail, such as voting from user's phones, then we're left just depending on the security of the device itself. The threat to be most concerned about here is an attack that compromises a large number of voter's devices, and through them the integrity of the election. Even if you subsequently managed to gather evidence that this had happened on a large scale, figuring out what to do after would be a political nightmare.

Operational Challenges #

Even if we assume that the voting devices are uncompromised, the actual logistics of building an operational E2E system are extremely challenging. Elections themselves are complicated to run—if you want to get a real sense of this, I recommend serving as a poll worker—and there are a lot of things that can go wrong; adding a bunch of complicated critical-path technology creates a lot of new opportunities for failure.

Server Infrastructure #

For example, if you want to have an Internet voting system you need some servers which accept the votes. What happens if those servers go down on election night or—worse yet—are attacked? These protocols are designed to be resistant to misbehavior by the voting servers in the sense that they can't tamper with the results, but this doesn't address attacks designed to prevent users from voting. Typical paper-based elections have mechanisms for addressing this kind of failure: if the voting machines fail, you can fall back to paper; if the electronic poll books fail, you may have paper records; if those paper records are unavailable, people can file provisional ballots and you can sort it out later. However, these mechanisms all depend on people already being in the polling place; if they're at home and things fail, the election can completely fail.

It's also possible to selectively mount attacks, for instance by having a compromised server reject only certain people's votes or by mounting a denial-of-service attack on certain precincts; this is a particularly powerful form of attack in the United States, where voting is managed locally, and so you could attack the infrastructure of a county that leans to one political party but ignore the infrastructure of a county that leans the other way.

Client Infrastructure #

As discussed above, in order to successfully vote on the Internet, the voting software needs to run on a device controlled by the voter. Even if we ignore attack, this is a prime opportunity for things to go wrong: here we have a piece of software which needs to be developed at low cost, run on more or less every device that anyone might have, needs to operate essentially perfectly, and only gets used once of twice a year. This is a tall order for any software shop.

Real-World Voter Authentication #

In the real world, voter authentication is actually quite lax. In many jurisdictions you can vote just by giving your name. While some jurisdictions require showing photographic ID, detecting fake IDs is not really that straightforward, especially for people who (again) have to do it on one day a year. In vote-by-mail systems, authentication is performed by mailing you a ballot (thus trusting the USPS) and then (hopefully) checking your signature on the ballot. Despite all this, the rate of voter fraud is very low. Probably a lot of the reason here is that it's hard to conduct this kind of in-person fraud at scale. But of course, this is not the case for Internet-based attacks.

To make matters worse, we have the problem of voter authentication. For obvious reasons, we need each voter to prove that they are authorized to vote, which means giving them some kind of credential. There are a lot of options here (give them a digital certificate, mail them a code, etc.) but whatever you do, they are all subject to voters losing their credentials. In ordinary Website authentication, we usually allow users to reset their passwords via e-mail or SMS, but for obvious reasons that's not OK here (allowing Gmail and T-Mobile to have the ability to impersonate a huge fraction of voters really undercuts the value of E2E voting). Here too, we're stuck with a situation where the failure happens at the worst possible time, and recovery entails actually going somewhere.

Implementation Complexity #

Next, we have to contend with the problem of implementation complexity. Even the best E2E voting systems are fairly complex, and the systems they need to be embedded in are even more complex. This means that even if we have a system design which is secure, we still have to worry about implementation errors, both of the protocol itself and of the rest of the infrastructure. So far, there hasn't been that much Internet voting, but serious errors have been found in several early systems. See for instance, the analysis of the Scytl system by Haines, Lewis, Pereira, and Teague^[5] and of the Voatz mobile voting system by Trail of Bits, so the situation is not encouraging.

Voter Comprehension #

Finally, we have the problem of voter understanding. It's not enough for the election just to produce the right result, it must also do so in a verifiable fashion. As voting researcher Dan Wallach is fond of saying, the purpose of elections is to convince the loser that they actually lost. With paper-based ballots, the chain of reasoning for how the election was decided is relatively straightforward: ballots go into the box and you count them. Despite this, we've still seen extensive attempts to question the resulting count, as in the 2020 US Presidential Election.

By contrast, the security of E2E voting depends on some fairly complicated cryptography that practically nobody understands. I've just spent 4000-odd words on this topic and it's only that short because I didn't explain how any of the actual cryptographic pieces work and just focused on the system logic; if you want to convince yourself that ballots were cast correctly you have to not just have a surface understanding of the cryptography but also have confidence that the mathematical problems it's based on are really hard. We don't even know for sure that that's true in the classical setting and we know that they're not if someone ever builds a big enough quantum computer. Try explaining that to your average voter.

Final Thoughts #

My point here is not to criticize E2E voting, which is an amazingly cool technology. The problem is that it's necessary but not sufficient for Internet voting, which requires correct operation of systems which are not covered by the cryptography, all under very challenging conditions. However, E2E does have two important use cases: first, it's potentially useful as an additional measure of security for in-person paper-based systems, such as Ballot Marking Devices. Second, there are lots of low to medium-stakes situations where people are already voting over the Internet using systems which are tragically insecure. These elections would be much safer and more private if they used E2E systems, even if those systems were still imperfect. So when do we get our E2E secure Twitter polls?

I owe this observation to Hovav Shacham. ↩︎
Note that you can have the contents of the ballots encrypted to prevent selective challenges against people who voted a certain way. ↩︎
Benaloh observes that you actually don't need to randomly shuffle, them: you can just sort the output values, which will destroy any order. ↩︎
By contrast, Approval Voting can be implemented by having each candidate be treated as a separate ballot. ↩︎
This analysis also includes an interesting example of an attack resulting from misuse of the Fiat-Shamir heuristic. ↩︎

One does not simply destroy a nuclear weapon

2022-12-05T00:00:00Z

In a recent article the NYT reports that in the US when nuclear weapons are retired they aren't destroyed but just stored:

Typically, nuclear arms retired from the U.S. arsenal are not melted down, pulverized, crushed, buried or otherwise destroyed. Instead, they are painstakingly disassembled, and their parts, including their deadly plutonium cores, are kept in a maze of bunkers and warehouses across the United States. Any individual facility within this gargantuan complex can act as a kind of used-parts superstore from which new weapons can — and do — emerge.

...

“It’s important to keep these parts around,” said Franklin C. Miller, a nuclear expert who held federal posts for three decades before leaving government service in 2005. “If we had the manufacturing complex we once did, we wouldn’t have to rely on the old parts.” He added that other nuclear powers can and do make new atomic parts.

I'm not really surprised that the weapons aren't being destroyed because it's incredibly hard to do so in a meaningful fashion; it's not like guns where you just melt them down or something. However, seeing why requires an understanding the physics of the situation, so let's start there.

Thanks to Wikipedia, which was indispensible in gathering the background detail for all this. I also can't recommend enough Richard Rhodes's The Making of the Atomic Bomb, which provides a very clear account of the physics of nuclear weapons, as well as the history of the Manhattan Project.

Backgrounder: Atoms, Elements, and Isotopes #

This section is elementary but important material on the structure of matter. If you know what an "element" and an "isotope" is, you can skip this.

Essentially all ordinary matter—the stuff you are made of and encounter on a daily basis—is composed of atoms. An atom is composed of three basic subatomic (i.e., smaller than atoms) particles:

Positively charged protons
Negatively charged electrons
Non-charged neutrons

At a super-simplified level, an atom is like a miniature solar system, with a nucleus at the center, consisting of protons and neutrons, and the electrons orbiting around it.^[1] Atoms have the same number of electrons and protons, which renders them neutrally charged. An atom can also gain or lose an electron to become an ion, which is something we'll need to know later.

The chemical properties of an atom are dictated by the number of electrons, and because the number of electrons is the same as the number of protons in the nucleus, the number of protons also dictates those properties. Every atom with a given number of protons in the nucleus (the atomic number) thus has the same chemical properties (the technical term here is element). Each element has a name and a one or two letter symbol. For instance, hydrogen's symbol is "H", oxygen's is "O", etc. There are 100 or so elements, but of course many more chemicals because you can combine elements in a lot of different ways.

Finally, this brings us to neutrons. It's possible to have different numbers of neutrons in the nucleus of an atom, even with the same number of protons. For instance, you can have three different flavors of hydrogen atoms:

Name	Number of Neutrons
Hydrogen	0
Deuterium	1
Tritium	2

Because the neutrons have no impact on the charge of the nucleus, they also have no influence on the number of electrons, which means that all three types of hydrogen have basically the same chemical properties; they just have different masses. The term for different flavors of the same element is isotope, as in "deuterium and tritium are two different isotopes of hydrogen". It's standard to refer to isotopes by the total combined number of neutrons and protons in the nucleus, so, for instance, deuterium is H-2 (H for hydrogen).^[2] Many elements exist in multiple isotopes in nature, though in many cases one isotope is common and the others are rare.

Brief Overview of the Physics of Nuclear Weapons #

I said above that chemical reactions don't create or destroy atoms, but it's possible to have nuclear reactions which do exactly that. There are several such processes.

Atomic Decay #

Many atomic isotopes are unstable, which means that they will spontaneously decay into other isotopes by emitting some other particle. For instance, the element uranium-238 decays by emitting an alpha particle (another name for a helium nucleus, containing two protons and two neutrons), reducing the atomic number by two (the two protons) and the atomic weight by four (the two protons plus the two neutrons) and giving you the element thorium-234. Thorium is itself unstable and decays by emitting a beta particle (another name for an electron, see radiation) to give you protactinium-234m.^[3]

Different isotopes decay at different rates. The standard way to define this in terms of what's called a "half-life", which is to say the amount of time it takes half of the atoms in a given sample of an isotope to decay (alternatively, the time after which there is a 50% chance that a single atom has decayed). Shorter half-lives mean that an isotope is more radioactive (because there are more decays per second); longer half-lives mean that they are more stable. It's possible to have isotopes with very long half lives, on the order of thousands of years. Note that atomic decay is effectively a memory-less process, which is to say that if you start from X units of an unstable isotope, it takes the same amount of time to get from X to 1/2 X as it does to get from 1/2 X to 1/4 X.

In addition to releasing particles, this process releases energy, in various forms, including:

Kinetic energy from the new atom and the emitted particle moving faster than they were before. These particles then interact with the surrounding material, producing heat.
Radiation in the form of x-rays, neutrons, etc.

This means that radioactive isotopes tend to be warm or even hot. In fact, it's possible to exploit this effect to power devices for long periods of time in what's called a radioisotope thermal generator (RTG). RTGs are a common way to power spacecraft, for the obvious reason that you can't easily get out there and change the batteries.

One thing to notice here is that this is a one-way process, with unstable elements decaying to produce other lighter elements and energy. Eventually, the process terminates when some relatively stable isotope is produced, at which point you have a stable system and a bunch of heat: see also the second law of thermodynamics. It's also possible to go from lighter to heavier products, as we'll see below in the discussion of fusion.

Radiation #

You'll often hear that various isotopes are radioactive and that they emit radiation. In this context, radiation is more or less the generic term for "stuff emitted by various kinds of atomic processes that you probably don't want to come into contact with".

Unfortunately, the names of various types of radiation are incredibly confusing, dating from a time period where the physics of nuclear energy was poorly understood. When some new form of radiation was discovered physicists would tend to give it a name that just reflected that it was something new, hence "X-rays" (with the "X" indicating unknown) and alpha, beta, and gamma radiation, names (according to Wikipedia, based on the degree to which they penetrated matter). Now, of course, we understand the actual physics a lot better, but the old names persist. As a practical matter, you'll hear about the following:

Name	What it actually is
Alpha	Helium nuclei (two protons and to neutrons)
Beta	Electrons
Gamma	High energy photons (i.e., light, but outside the visible range)
X-rays	High energy photons, but typically lower energy than Gamma
Neutrons	Neutrons

These are all bad for you, but different levels of bad. None of them will turn you into The Hulk.

Fission #

Atoms can also undergo fission in which the nucleus splits into two smaller nuclei, some other particles such as neutrons, x-rays, etc. Most relevant to us are the following fission reactions, which we'll discuss shortly:

Uranium-235 can break up into (typically) krypton-92 and barium-141
Plutonium-239 can break up into (typically) zirconium-103 and xenon-134

I say "typically" because fission is kind of a non-deterministic process: the new nuclei need to have a mass that adds up to the original mass (minus whatever other particles were emitted) but there's some variation in which elements are produced. The following figure shows the distribution of fission products for some common fissile isotopes:

[From nuclear-power.com]

It's possible for atoms to spontaneously undergo fission (more on this later), but more commonly it's the result of external forces. Specifically, if a neutron impacts the nucleus of an atom it can attach itself to the nucleus, creating a new isotope that is one unit heavier. If this isotope is unstable (as is reasonably likely, because you're perturbing an isotope which is currently stable) it can undergo fission.

Chain Reactions #

Here's what we know so far:

When an atom undergoes fission, it can emit neutrons
When a neutron hits an atom, it can cause it to undergo fission

When you put these two facts together, you can have what's called a chain reaction in which one atom undergoes fission and produces enough neutrons to cause two atoms to undergo fission; in turn those atoms emit more neutrons, and we have an exponential growth process which results in the rapid release of very large amounts of energy, in other words, an atomic bomb.

The figure below shows the process, also helpfully showing that Uranium doesn't always decay into the same pieces.

[Image by MikeRun from Wikimedia]

Note that it's not necessary for every neutron to impact a nucleus in order to get a chain reaction as long as on average the fission of one atom results in the fission of more than one other atom. Nuclear reactors work by modulating the number of neutrons that effectively impact other atoms, thus keeping a stable reaction rate rather than one that is explosively exponential. Describing how that works is outside the scope of this post, however.

Fusion #

It's also possible for two light atoms to come together to form one heavier atom, in a process called fusion. The most relevant case for us is that two hydrogen atoms (atomic number 1) can fuse to form one helium atom (atomic number 2). This is what happens in the sun, but can also be exploited to build a much bigger bomb than a pure fission bomb. More on this later.

Making an Atomic Bomb #

Once you have the insight from the chain reaction, it's a pretty straight shot to the idea of an atomic bomb, and physicist Leo Szilard famously invented it while waiting at a traffic light:

"In London, where Southampton Row passes Russell Square, across from the British Museum in Bloomsbury, Leo Szilard waited irritably one gray Depression morning for the stoplight to change. A trace of rain had fallen during the night; Tuesday, September 12, 1933, dawned cool, humid and dull. Drizzling rain would begin again in early afternoon. When Szilard told the story later he never mentioned his destination that morning. He may have had none; he often walked to think. In any case another destination intervened. The stoplight changed to green. Szilard stepped off the curb. As he crossed the street time cracked open before him and he saw a way to the future, death into the world and all our woes, the shape of things to come"...

[Quote from Richard Rhodes's "Making of the Atomic Bomb"]

It was almost 12 years from that moment when the first atomic bomb was tested at Alomogordo New Mexico. This test was the result of three years of work by over 100,000 people and an investment of over $23 billion in current dollars, in the form of the US Manhattan Project. This raises the question: if it's so straightforward, what took so long?

In order to get exponential growth you need to have on average more than one neutron emitted from the first fission event to create fission in some other atom. Otherwise, you get an exponential decay process where the chain reaction goes toward zero and nothing much happens. If you just have a small number of atoms, then the most likely thing is that a neutron will just be emitted outside your fissile material and not contribute to the chain reaction. You need a certain minimum amount of material in order to get the probability of subsequent fission high enough that you get exponential growth. This amount is called the critical mass and depends on the specific properties of the element you are using, and in particular (1) how many neutrons it emits when it undergoes fission and (2) how likely it is that when a given neutron hits an atom it will result in a new fission event. The critical mass also depends on the shape (geometry) of the fissile material, with a sphere being the ideal shape because it has the maximum volume to surface ratio, which minimizes the chance that the neutrons will just be uselessly expelled from the surface.

OK, so we just need to collect enough material and presto, we have a bomb. Unfortunately, it's not so simple:

Getting enough of the right material is hard.
As soon as you start to assemble the material into a critical mass, it starts reacting, and so if you do it wrong, the energy emission will cause it to explosively disassemble, which isn't fun if you're nearby, but produces a much smaller bang than you were looking for (a "fizzle").

Let's look at each of these in turn.

A Materials Problem #

First, we have the problem of the right material. It quickly became apparent that there was only one suitable natural element: uranium.

Uranium #

Recall that I said above that the uranium-235 nucleus (U-235) can easily undergo fission. Fortunately for us, but unfortunately for the purposes of making an atomic bomb, the 99% of the uranium in the world is not uranium-235 but rather uranium-238 (U-238), which does not readily undergo fission when bombarded by neutrons (instead, it tends to form U-239, which eventually decays but doesn't undergo fission, we'll want this information later). This presents a problem because it means that most of the neutrons emitted by U-235 fission don't lead to more fission events and you don't get exponential growth, hence no bomb. Or, more precisely, the critical mass of natural uranium was improbably large, between 10 and 44 tons (calculation by Rudolf Pierls, as cited by Rhodes). Not something you could drop from a plane.^[4]

However, what people eventually realized was that if you had just U-235, or even mostly U-235, then it was possible to sustain a fission explosion with much less mass (the first uranium bomb, Little Boy, used 64kg of uranium). So, now the problem just becomes enriching the uranium so that you have a higher fraction of U-235 than in natural uranium (Little Boy used 80% U-235).^[5] This is where things start to get hard.

Traditionally, there are two main ways to separate out a mixture of two substances:

Via chemical processes. For instance, this lab experiment describes how to separate out the components of a common headache medicine into acetylsalicylic acid (aspirin), salicylamide, and caffeine, by taking advantage of the fact that each component reacts differently with different reagents.
Via physical processes. For instance, given a mixture of alcohol and water (e.g., wine) you can increase the alcohol concentration in the mixture by heating it and collecting the vapor to produce brandy; this takes advantage of the fact that alcohol has a lower boiling point than water and therefore the vapor has more alcohol than the original liquid.

Unfortunately, because U-235 and U-238 are both isotopes of uranium, they behave chemically identically, so chemical processes are more or less impractical.^[6] This leaves us with physical processes, but because the weight of the respective molecules differs by only 1.2%, the physical behavioral differences are very small as well, which makes any physical separation process very inefficient. Eventually, the physicists on the Manhattan Project settled on two main approaches.

Gaseous Diffusion #

In this approach, you create a gaseous form of uranium hexafluoride and allowed it to slowly diffuse across nickel barrier with very small perforations. Because the U-235 molecules are slightly lighter than the U-238 molecules, they move across the membrane slightly faster, with the result that if you stop partway the resulting mixture on the far side has slightly more U-235 than the starting mixture. Because this process is so inefficient, you need multiple stages in which the output of one stage is fed into another. To make matters worse, the uranium hexafluoride is fiendishly reactive and toxic, so very hard to work with. The result is difficult industrial chemistry on a giant scale.

Multiple Lines of Attack #

One thing that Rhodes does a great job of bringing out is the extent to which the Manhattan Project involved pursuing multiple lines of attack on the problem of building an atomic bomb, with the hope at least some of them would work. Some failed, of course, but at the end of the day, they had two entirely different routes that succeeded, with the result that the two bombs that were eventually dropped used totally different technologies: uranium "gun-type" devices and plutonium implosion devices. Similarly, they purused three independent technologies for uranium enrichment, of which two turn out to be really useful.

Electromagnetic Separation (mass spectrometry) #

The intuition here is that if you ionize the uranium atoms so that they have an electrical charge (I told you we'd come back to ions) you can then accelerate them with an electric field. If you then apply a transverse (perpendicular) magnetic field, then the ions will follow a curved trajectory, as shown below. Because the U-235 ions are slightly lighter they will follow a slightly tighter trajectory; you can then effectively set up a bucket and collect them. Of course, it will be a very small bucket because you are literally separating one atom at a time.

[The original diagram for electromagnetic separation.]

The scale of both of these processes was truly enormous. Rhodes again:

The United States was critically short of copper, the best common metal for winding the coils of electromagnets. For recoverable use, the Treasury offered to make silver bullion available in copper's stead. The Manhattan District put the offer to the test, Nichols negotiating the loan with Treasury Undersecretary Daniel Bell. "At one point in the negotiations," writes Groves, "Nichols ... said that they would need between five and ten thousand tons of silver. This led to the icy reply: 'Colonel, in the Treasury we do not speak of tons of silver; our unit is the Troy ounce.'"

The Manhattan project eventually ended up using both of these processes, gaseous diffusion first, and then electromagnetic separation.^[7]

Even the more modern process involving high-speed centrifuges involves a fairly significant investment. However, there is a more easy way to get the fissile material you need to make an atomic bomb.

Plutonium #

Uranium is the only natural material suitable for making a bomb, but element 94 (plutonium) works fine as well well. Plutonium has two very convenient properties:

It's relatively easy to make with nuclear reactors because it's the result of U-238 reacting with a neutron (see above). So all you need is a nuclear reactor and some U-238 and you've got plutonium. In practice, reactors never run on pure U-235, so they always produce plutonium, even if it's treated as a waste product. Of course, you can design your reactor to optimize the production of plutonium.
Because plutonium isn't just an isotope of uranium it's relatively easy to chemically separate from the U-238 it was created in. I say relatively because both plutonium and uranium are highly toxic and the whole mess is intensely radioactive, but fundamentally it's just chemistry; no need for gaseous diffusion or mass spectrometers. Plutonium itself comes in several isotopes, but the isotope you get the most of, Pu-239, is the one you want for making bombs.

For these two reasons, modern atomic bombs generally use plutonium rather than uranium.^[8]

Assembly #

Once you have your fissile material you need to assemble it into a critical mass. This is a challenging process, because, as noted above, once you start to bring the material together it starts reacting even before the critical mass is assembled. If you do it wrong, the energy emission will cause it to explosively disassemble, but with a much smaller bang than you were looking for (a "fizzle"). In order to make a bomb, you need to bring the material together very fast so that you get a lot of fission before the critical mass disassembles itself (i.e., explodes). Even so, you typically only get a fairly small proportion of the material reacting, but the reaction is so energetic that you still get a big explosion.

Gun-Type Devices #

Uranium bombs are comparatively simple to build, using what's called a "gun-type" assembly mechanism, as shown below:

[Diagram by Dake, Papa Lima Whiskey, and Mfield from Wikipedia ]

This diagram shows a full weapon, but just focus on the gray area in the center that represents the "physics package", i.e., the atomic bomb itself, not the stuff needed to deliver it. Basically, a gun type bomb is what it sounds like: you have a hollow "bullet" made of uranium and you shoot it down a long barrel (originally literally made from a cannon) at a cylindrical "target" also made of uranium. When the cylinder contacts the target and surrounds it the result is a critical mass, resulting in an explosion. This all happens very quickly: you don't even need to have something to stop the bullet because the brief period when the target is passing through the bullet is enough. And of course, once the reaction starts, the whole thing will explosively dismantle itself anyway.

Implosion Devices #

You cannot, however, build a plutonium-based bomb using a gun-type mechanism. Reactor-manufactured plutonium is mostly Pu-239 but contains a small fraction of Pu-240, which has a relatively high rate of spontaneous fission. This rate is sufficiently high that as the bullet and cylinder start to assemble a critical mass, the reaction will start and the mass will prematurely disassemble, with the result that you get "fizzle" rather than a successful explosion.

Instead, plutonium bombs are built using what's called an implosion system, as shown in the diagram below:

[Diagram by Ausis via Wikipedia]

In an implosion device you have a spherical core (sometimes hollow and sometimes solid)^[9] called the "pit". It's surrounded by explosives which compress the pit in a spherically symmetrical pattern, thus forming a critical mass which holds together long enough to produce an explosion.

An implosion device is much less straightforward to build than a gun-type device, in large part because it's hard to get the explosives in the form of shaped charges to actually symmetrically compress the pit. As a comparison point, the world's first nuclear explosion was a test of an implosion-type bomb. The physicists at the Manhattan Project were so confident that the gun-type bomb would work that the first one ever detonated was the bomb dropped on Hiroshima, without any live testing at all.

Once you know how to do it, however, plutonium is much more convenient as a material to use for weapons because, as noted above, it's so much easier to obtain. Moreover, at this point it's fairly well understood how to build implosion devices, to the point where non-experts have famously designed plausible weapons without recourse to classified information. And of course, at this point 9 total countries have successfully built nuclear weapons (the US, Russia, the UK, France, China, India, Pakistan, North Korea, and Israel). In other words, the really hard part of building a nuclear weapon is getting the plutonium in the first place.

Thermonuclear Weapons #

Everything I've written so far is about fission type weapons, which are the original atomic bombs. However, modern weapons are frequently what's called "thermonuclear" devices which are based on both nuclear fission and nuclear fusion (aka "hydrogen bombs"). The details are of course complicated, but briefly, fusion takes place under conditions of very high heat and so you use a fission explosion (the "primary") to initiate the fusion reaction (the "secondary"). For reasons that are out of scope of this post, fusion bombs can be made much more powerful than fission-only bombs.

They're also substantially more complicated to design, because, like implosion devices, you have to ensure that they have time to fuse before they disassemble themselves. Wikipedia has a good primer on the design of thermonuclear devices. Richard Rhodes's Dark Sun contains a much more in-depth treatment of the history and design of thermonuclear weapons.

Disposal #

After 4500+ words, we're finally ready to address the question we started with, which is to say, how one disposes of unwanted nuclear weapons. As described in the aforementioned NYT article, the current practice in the US is mostly to disassemble them and to store the parts, making it possible to reassemble them later into similar weapons. The article leans kind of heavily on the fact that this is surprising (true!) but does eventually list three reasons why it might not be a good idea:

The parts themselves (principally the pits) are a safety hazard.
There are "security" issues, presumably that someone might steal the parts and make their own weapon.
That this doesn't really put them beyond use and so isn't a real reduction in the number of weapons because the US could readily make new weapons if it chose to.

There are two primary assets that we might need to concern ourselves:

The plutonium pit itself
The rest of the weapon

The situation with the rest of the weapon is simpler so let's look at that first.

The Rest of the Weapon #

The parts of the weapon other than the pit give you a head start on building a new weapon in two ways. First, if you just disassemble the weapon into pieces then it's (presumably) comparatively straightforward to reassemble them back into a functional weapon. You might also be able to reassemble them into a similar weapon though based on what I know, you would want it to be reasonably similar to the original. In either case, this is almost certainly easier than manufacturing all new parts and the necessary associated supply chain.

[A lightly modified version of Midjourney's output for "ikea instructions for assembling a nuclear weapon, diagram, black and white, detailed, realistic --v 4"]

Second, the parts embody the knowledge about how to build a new weapon. As noted above, while it's helpful to have this for building a fission device, at this point this is something that can be reproduced fairly readily. However, thermonuclear bombs are significantly more complicated to design and quite easy to get wrong, so it would definitely be helpful to have a reference design to start from. The fusion component also seems to involve some isotopes of hydrogen (tritium and deuterium), so it would be modestly helpful to have that but my understanding is that it's not that hard to get your hands on these isotopes. Deuterium in the form of "heavy water" (i.e., heavy hydrogen and oxygen) is readily available from chemical supply houses. So, while the article says "the nuclear warhead is the bullet-like cylinder at the back. It holds the plutonium pit and the hydrogen fuel, which gives the bomb its vast powers of destruction", my sense is that the hydrogen fuel part is pretty easy to obtain.

But of course none of this is very useful if you don't have the pit, which is necessary to start the whole thing off. It's also fairly straightforward to destroy these components, as they're fundamentally just hardware. Not so, for the pit.

The Pit #

The pit presents two problems. First, even without the rest of the components, the plutonium pits can be reused to make new weapons, either with a similar geometry to the current weapon, or melted down and formed into the pit of a new weapon with a new geometry. We know from experience that once state-level actors get access to enough plutonium to build a bomb they generally succeed. Of course, non-state-level actors might have a much harder time building a bomb from raw plutonium.

Second, it's extremely difficult to destroy plutonium effectively (some weapons are built out of highly enriched uranium and that can just be diluted in U-238 and used for reactors). Obviously, you can melt it down, but that just leaves you with a chunk of subcritical plutonium which someone can re-form into a new weapon. The plutonium is highly toxic, so you can't just grind it up and scatter it around without causing huge environmental impacts (watch Chernobyl if you want to get a sense of what I'm talking about here). You can't burn it because then you're going to have oxidized plutonium in the air, which you don't want people inhaling, and while you can of course use chemicals to dissolve it, vitrify it, etc. you're still left with an equivalent amount of plutonium, just bonded to some other stuff, and so it's just a matter of (potentially highly unpleasant) chemistry to get it back out again. In other words, it's precisely the properties of plutonium that make it attractive to build nuclear weapons out of that make it so hard to dispose of.

It's also very difficult to store because while an individual weapon may not be a critical mass, if you have tens or hundreds of weapons you have to worry about them getting close enough to worry about accidentally assembling a critical mass just from proximity, which, would of course, be bad.

Of course, this isn't news to policymakers. As the NYT article says:

The Clinton, Bush and Obama administrations all made plans — with costs in the billions of dollars — to get rid of excess plutonium stocks, which grew rapidly after the Cold War because of arms disassembly. But no strategy has so far succeeded.

The best available option appears to be seems to be to turn the plutonium into what's called "mixed-oxide fuel" (MOX) and then using it to fuel nuclear reactors. Unfortunately, this doesn't work super well for a number of logistical reasons, for instance that many reactors can only use MOX for some of their fuel; and of course we have an unbelievable amount of plutonium lying around, not just from existing nuclear weapons but also from the operations of normal nuclear reactors, which, as noted above, create plutonium. The FAS report I linked above is from 1993 and states that "There is almost 1000 MT of reactor Pu (R-Pu) in existence now, with the amount growing by about 100 MT per year." (disposal of plutonium waste is one of the big problems with nuclear reactors). So, the situation is really quite difficult even if we ignore disassembled weapons, which actually tend not to be that big (recall that the pit weighs on the order of a few kg).

Final Thoughts #

I don't want to spend too much time playing media critic here, but I don't feel like this article did that great a job of putting things in context. The implication of this article is that the US isn't really serious about disarmament and so it's storing all the nukes in pieces but not really destroying them in order to have ready access later, and that this creates all sorts of hazards. I'm sure that's true to some extent, but I think it's also necessary to realize that actually destroying them is a lot harder than it sounds and even if you were to do about the best we know how to do and totally destroy all of the hardware other than the pits, you'd still be left with a large amount of fantastically dangerous stuff which has to be guarded for the next 100,000 years or so. The critique that this material isn't being guarded does seem like a reasonable one, but it seems like guarding it better is the solution that we're left with.

In reality this whole orbiting thing is kind of nonsense because actually they occupy this probability space of locations, but we don't need to get into quantum mechanics here and for our purposes we can just live with a classical-type picture. ↩︎
Technically, this is called the atomic mass. Protons and neutrons have approximately the same mass, but electrons are much lighter, so the mass of the atom is basically the mass of the protons and neutrons. ↩︎
This "m" isn't an error. I didn't know about this, but apparently this is actually a higher energy state of protractinium-234, which decays more quickly. Thanks, Wikipedia! ↩︎
Famously, the Oklo Mine had a self-sustaining reaction in natural uranium, though with the help of water as a "moderator" (out of scope again, I'm afraid). ↩︎
The uranium with lower than normal U-235 is known as depleted uranium and is used in various military applications because it is very dense. ↩︎
There actually is now a chemical process called Chemex that takes advantage of some slight differences in chemical properties due to the change in atomic mass. ↩︎
There were actually three separate processes, with thermal diffusion being used to make slightly enriched uranium which was then enriched much more with gaseous diffusion. Thermal diffusion isn't very efficient and was eventually abandoned. ↩︎
Note that you still need the ability to enrich uranium to reactor grade levels so that you can run the reactor to make the plutonium. ↩︎
The original pits were hollow, but as I understand it more modern designs just use a solid pit and rely on the explosives to compress the plutonium enough to make a subcritical mass critical. ↩︎

Can we agree on the facts about QWACs?

2022-11-25T00:00:00Z

Disclaimer: Like the rest of the material on EG, these are my opinions and not those of my employer.

Over at the day job I've been spending quite a bit of time dealing with the proposed eIDAS Article 45.2, which would require browsers to accept *Qualified Website Authentication Certificates (QWACS) issued by certificate authorities approved by European Union member states. A lot of the discussion here has either been in private or by press release, neither of which is very helpful in understanding the issues at play here. I'm a strong believer that we should be able to agree on facts, even if we can't agree on the best way forward, so in that spirit, this post attempts to lay out the technical situation.

Apologies in advance that some of this material is a bit basic and repetitive, but I wanted to have something self-contained.

Background: HTTPS and the WebPKI #

In order to have a secure connection to a Web site via HTTPS (e.g., https://educatedguesswork.org) it is necessary to both encrypt the traffic and authenticate the site. The encryption happens via TLS,^[1] but TLS depends on the server having a public key which is authenticated via a certificate. The certificate in turns binds that key to the server's identity. The server uses the private key associated with that public to complete the TLS handshake, thus proving that it is the correct owner of that identity. Without the certificate, your browser could just be forming a secure connection to an attacker.

These certificates are issued by certificate authorities (CAs) (also, "certification authorities"), who are responsible for validating the server's identity, issuing the certificate, and revoking it if something goes wrong (e.g., the server's key is compromised). But of course, we can't have just anyone stand up a CA: because a CA is responsible for attesting to server identities, a malicious (or just badly operated) CA could misissue certificates (i.e., issue them to the wrong people), allowing attackers to impersonate servers to clients and steal whatever information the user is sending to the server. It's important to realize that every CA is trusted to attest to any server's identity, and so the entire system depends on all the CAs behaving correctly.

The way things work in practice is that the client has a list of CAs that it trusts to issue certificates: if a certificate isn't approved by one of those CAs,^[2] then the client will reject it. For instance, here's what happens when Firefox encounters a certificate from an unknown CA:

In principle it's possible for the user to ignore this warning, but in practice it's a really bad idea and browsers have gotten increasingly aggressive about discouraging users from doing so. As a practical matter, you can't really run a secure Web site without a valid certificate, by which I mean one which is issued by a CA that is trusted by every major browser.

There isn't just one list of valid CAs: Each of major browser vendors has their own "root program", in which they evaluate CAs and determine which they trust (Mozilla, Chrome, Apple, Microsoft). As a practical matter, a CA needs to be accepted by all four of these programs in order to issue certificates; otherwise its certificates won't be accepted by a major browser which is pretty bad news. Unsurprisingly, then, there is a fair amount of coordination between the root programs. Specifically:

The CA/Browser Forum (CABF) sets a common floor of requirements (the Baseline Requirements) that all CAs have to conform to.
The Common CA Database (CCADB) maintains a common set of records for CAs.

And, of course, the root program operators talk to each other informally, especially in cases where some CA appears to be misbehaving and it is necessary to determine how best to handle it. For instance, due to a large set of issues with the Symantec CA, the root programs worked together between 2016 and 2018 to gradually distrust Symantec.

The server's identity #

I've said that the certificate contains the server's identity, but not what that identity consists of. The most common scenario is that certificate just contains the domain name of the server. So, the certificate for https://educatedguesswork.org would contain the name educatedguesswork.org. When the browser connects to the site, it verifies that the domain name in the certificate matches the domain name it is trying to connect to. In practice, certificates often contain other information, such as the organization to which it was issued, but the browser does not use it to establish the connection.

As an example, here's the "subject" information from Twitter's certificate:

Field	Value
Common Name	twitter.com
Organization	Twitter, Inc.
Location	San Francisco
State	California
Country	US

Distinguished Names and SubjectAltName #

The reason for this goofy name structure is that when certificates were originally designed, the idea was that they would identify people, not computers, and that everyone would have a distinct name (technical term: "distinguished name") and so this geographic and organization information was useful to distinguish people with the same personal name. When this structure was adapted for use with SSL/TLS, the "Common Name" field was repurposed to contain the domain name. In modern certificates, another field, Subject Alternative Name (SAN) is preferred. The SAN field can contain an arbitrary number of names. For instance, Twitter's cert contains twitter.com and www.twitter.com.

The only part of this that the browser cares about is the "Common Name" field, which contains Twitter's domain name. It just ignores the rest of the fields, and you actually have to work a bit to see them at all: in Firefox you can get to them from the "lock" icon, but in Chrome you have to go into the developer tools.

Not only is the domain name the only thing that matters, but clients will accept any certificate with that domain name. This is not a design defect but rather a critical element of building an operational system. For instance, it's very common to operate a site using multiple servers and use some load balancing mechanism to direct clients to specific servers (this is the only realistic way to scale to very large numbers of users). If you have a lot of these machines, then there are significant operational challenges in maintaining them, and it's common to have more than one certificate (e.g., one certificate per machine or per data center). These machines might even be operated by different entities, for instance you could contract with multiple content distribution networks. From the user's perspective these are all one service and you want things to operate smoothly, which means that the user doesn't notice if one Web request goes to server A and one to server B. This requires the browser to treat multiple certificates as if they were the same site: all that matters is the domain name (see my post for more on this concept).

Domain Validation #

Because the only thing that matters is the domain name, that's all that most CAs check. Moreover, it's very difficult (i.e., expensive) to verify that a specific person is entitled to use a specific domain name, so instead what CAs do is check that you have control of the domain. This is called a Domain Validation (DV) certificate. The most common thing to do is shown in the figure below.

The way that this works is that the operator connects to the CA and asserts that they control a given domain. The operator then asks them to prove that they control it by placing a random challenge somewhere on their Web site. The operator then goes to the site directly and checks that the file exists The reasoning here is that because the challenge is under the CA's control and is random, then the only way it could get onto the site is if the operator put it there. One nice feature of this design is that it is easy to implement for the CA and even more importantly for the site operator, who presumably controls what goes on the site. It is also easily automated, with protocols such as ACME.

Domain Control and Web Site Structure #

Note that I haven't said where the file should be. Some Web sites allow unprivileged users to create files on the site (e.g., your pictures on Instagram). If the CA allowed the user to put the file anywhere, then it would be possible to attack such sites. The verification protocol needs to be designed to use a location that essentially no site uses for user-controlled content. One possibility (used by the ACME protocol is to use the /.well-known path which is supposed to be only available to site operators.

If you're thinking that this design is weirdly circular, you're right: the purpose of HTTPS is to protect you against an attacker who controls the network, but this type of domain verification is completely at the mercy of an attacker who controls the network. And in fact, there have been attacks based on control of the network, specifically by controlling the BGP routing protocol to deliver traffic to the attacker's server. The main countermeasure to this is for the CA to verify the challenge from multiple locations on the network (a technique called multi-perspective validation), which works because it's harder to hijack BGP across the entire Internet than just against one location (and of course the CA's network is probably better secured than your average Starbucks network). In addition, because certificates are recorded in Certificate Transparency logs, it is possible to detect misissuance and revoke the certificates or even distrust the CA if necessary. There are other designs for domain validation (e.g., using the DNS), but they aren't really much more secure unless DNSSEC is used.

In any case, DV certificates are by far the most common type of certificate on the Internet, because they are cheap to issue, and, as mentioned before, work just fine. They are so cheap to issue, in fact, that the Let's Encrypt^[3] Certificate Authority gives them away for free.

Extended Validation #

Because DV certificates just validate the domain name, they don't actually tell you what organization you are talking to. From the perspective of the Web browser, this is just fine, because its job is to ensure that the site you are going to matches up with the link you clicked on or the site name you typed in, but from the user's perspective it's less than ideal. There are two basic problems here:

It's not necessarily obvious which real-world organization is associated with a given domain name. For example, the official site of the United States White House is whitehouse.gov but whitehouse.com is a porn site.
Even if you do know what domain to expect, humans are notoriously bad at comparing two strings. For instance, educatedguesswork.org is this site, but would you really notice if you went to educated-guesswork.org? Similarly, microsoft.com and micros0ft.com are different sites.

The result of these weaknesses is that users are susceptible to "phishing" attacks, in which an attacker sends you a message (e-mail, SMS, etc.), allegedly from your bank, PayPal, etc. asking you to log in and do something, but with a link to their site that has a similar name to the entity they are impersonating. Then when you log in and enter your password, they can steal it and log on to your account on the real site.

In response to phishing attacks and concerns about the general weakness of domain validations, a new kind of certificate called an Extended Validation (EV) certificate was created in 2007^[4] Unlike with DV certificates, before issuing an EV certificate the CA validates the actual organizational name of the applicant, e.g., by checking business records. That name goes into the certificate and then can be displayed to the user, for instance like this:

[Original image from Bleeping Computer]

The idea here is that the user knows they want to go to (say) Stripe, and so they check for "Stripe" in the URL bar.

EV certificates were one of those plausible ideas that were worth a try but turn out not to work, for two distinct reasons.

Users Don't Check. The basic premise of EV is that users will look at the UI and behave differently when the EV indicator (the company name) is displayed. Unfortunately, this seems not to be the case. Chrome's Security team does a good job of summarizing the research in this area, but the TL;DR is that if you remove the EV indicator for sites, most people don't seem to notice or behave differently.
Names Aren't Unique. Organizational names are generally scoped by jurisdiction, which allows an attacker to register a company with the same name as the company they are impersonating and then get an EV certificate. In one famous incident, security researcher Ian Carroll got an EV certificate for "Stripe Inc." by registering a legal entity in a different state and then applying for an EV cert.

Browser vendors don't like unnecessary UI clutter, especially in the area of security, and between 2018 and 2019, browsers removed the EV indicators in the main UI. This of course dramatically reduces the incentive that sites have to get EV certificates because users have to go to a lot of trouble to find out that a certificate is EV, which it seems very likely they won't do. Understandably, this didn't make the CAs very happy, especially because EV certificates are quite a bit more expensive than DV certificates, which can be obtained for free from Let's Encrypt.^[5] By contrast, EV certificates can cost upward of 100/year. At present, only a very small percentage (well less than 1%) of the certificates in use on the Web are EV.

Arguments for EV Security #

I want to briefly address two arguments you will sometimes hear for why EV certificates are more secure than DV. I don't think either of these really hold up.

Phishing is Mostly DV #

Back in 2018, researchers from Entrust Datacard and Comodo published an analysis of the certificates used for phishing sites. They report that the vast majority of sites used for phishing are DV (unsurprising because most certificates are DV) but also that a lower fraction of EV certs are used for phishing than of DV certs:

Type	Percent of Phishing Sites	Overall Percent
EV	.05	.7
OV	.13	5.0
DV	99.82	94.3

The authors conclude that "EV sites are safer than OV and DV", which is likely true, but this shouldn't lead you to conclude that EV prevents phishing. Phishers need to register a lot of domains and have an incentive to use the cheapest certificates they can get. Because DV certificates are cheap (free) and work fine they naturally use them. If response rates for EV were much better than DV, however, we would expect to see more use of EV for phishing. In other words, yes, EV sites are less likely to be phishing sites, but because users largely don't notice the EV indicators (note that this research was published before they were removed, so this is not an argument for their reinstatement), then we shouldn't conclude that EV actually reduces phishing.

It's important to recognize that just getting an EV certificate doesn't reduce phishing at all. What you need is for users to know that you have an EV certificate and refuse to go to sites they think are yours if they don't have an EV cert. That's the part that's breaking down here.

DV Misissuance #

The other argument I sometimes hear is that because EV certificates have a more stringent issuance process it's harder to get a fake one for a domain you don't control. This is no doubt true, but unfortunately it doesn't meaningfully increase security as long as DV certificates still exist. The reason for this is, as I mentioned above, that the browser will accept any certificate with a domain name in it as valid for a given site, so if an attacker can get a misissued DV certificate for example.com then they can impersonate example.com (including stealing passwords, cookies, etc.) even if example.com has an EV certificate. Even worse, they can most likely do so while preserving the EV indicator.

Consider a simple Web page which consists of one HTML file and one JavaScript file. The way this page loads is shown below:

The client first loads the HTML page, which contains a reference to the JavaScript, and the client then contacts the server again to load the JS.

Now consider what happens when you have an attacker with a valid DV certificate, as shown below:

They allow the client to contact the real server, which authenticates with the EV certificate. Then when the client goes to load the JS from the server, the attacker gets in the way and impersonates the server with its misissued DV certificate and sends its own JS. Because JS can do anything on the page, this is the same as if the attacker had served the entire page, but because whether the EV indicator is shown depends only on where top-level HTML was loaded from, the client still displays the EV UI.

It's important to realize that this isn't just a bug in the browser UI, it's a reflection of the basic way the Web works, which depends on the origin as the basic unit of identity and these two certificates reflect the same origin. Note that even if for some reason browsers radically changed the Web security model, you'd still have a problem because most sites load scripts from totally different origins (e.g., Google analytics) and the browser has no way of knowing if they should be EV or not.

eIDAS and QWACs #

This brings us to the EU's eIDAS regulation. eIDAS stands for "electronic IDentification, Authentication and trust Services", though I've only ever heard it called eIDAS. eIDAS is generally concerned with establishing stronger online identity structures, but one specific provision is directed towards something called a Qualified Website Authentication Certificate (QWAC). A QWAC is more or less the same as an EV certificate, except that they are issued by what's called a Qualified Trust Service Provider (QTSP), [Updated Trusted -> Trust. Also changed TSP to QTSP throughout. It's conventional to call them TSPs, but this is clearer.] which is a CA that is authorized by EU member states [Updated: member states, not the EU.] to issue certificates defining legal identity.

The original version of the eIDAS regulation was published in 2014, and contained language defining QWACs, but did not require support for them in browsers. Browsers mostly chose to ignore this language and while quite a few of the QTSPs in the EU list are also trusted by browsers, no major browser has special EV-style UI for QWACs. This was perceived by proponents of QWACs as not meeting their objective of having QWACs be used (unsurprisingly many of the proponents of QWACs work for QTSPs). eIDAS is currently being revised and the current proposal contains language that would mandate that browser support them.

While I'm not a lawyer it's generally understood that the revision would require browsers to:

Display the QWAC identity data.
Support certificates issued by authorized [Updated: EU-authorized to authorized] QTSPs regardless of whether those QTSPs were accepted into the browser root program.

From the perspective of a browser, the first of these requirements is bad, but the second is much worse.

Mandatory UI #

As discussed above, browsers removed EV certificates because there was good evidence that they didn't work, and QWACs are basically the same as EV certs, so a requirement to support them isn't great. The text of the regulation itself is a little vague on this point—as I understand it, it will then be fleshed out in "implementing acts"—but at least one possibility is that browsers would be required to support some common QWAC UI (presumably designed by the EU in cooperation with CAs). For instance, here's a 2021 presentation by Chris Bailey from Entrust on this topic that includes the suggestion that not only should browsers have common UI, but that they should be required to warn users whenever they submitted a form on a cert with a DV site!

Obviously, this precise proposal would have a very negative impact on any site which used DV certificates, which is good if you are a company that sells DVEV certificates [Updated], but not so good for the Web as a whole. More generally, though, designing a good browser user interface is very difficult: you need to pack a lot of information into a very small amount of screen real estate, leaving room for the site itself. This is a difficult problem at the best of times (look how upset people got when Firefox removed the ability to make the browser navigation UI take up slightly less vertical space), and it will not be improved by having to implement a UI designed to create as sharp a distinction as possible between QWAC and non-QWAC certificates.

QTSP Inclusion #

As described above, browsers have a well-established set of mechanisms for determining whether a CA should be accepted for the purpose of authenticating Web sites. These mechanisms include ensuring audits and over the past decade have gradually improved the quality of the WebPKI ecosystem, for instance by transitioning away from SHA-1 certificates, adding requirements for Certificate Transparency and functional revocation mechanisms, and limiting certificate lifetime so that it's possible to evolve the ecosystem in a reasonable time. Mozilla, in particular, operates an open root program where decisions are discussed on a public mailing list allowing all stakeholders to weigh in.

If browsers were required to accept any QTSP that was approved by the EU, this would of course allow those QTSPs to bypass the browser's requirements, with two major impacts:

Browsers would be required to accept new QTSPs that did not currently meet their requirements.
Browsers would be prevented or delayed in distrusting QTSPs when evidence of misbehavior was found.

Note that this is different from EV certificates, where the CAs were managed in the same way as DV certs and had to meet the browser root program requirements.

A mismatch between the browsers and the EU need not necessarily result from the EU doing anything wrong: governments have their own incentives, including considering the interests of companies in their jurisdictions, and their judgments about what's best might not match those made by browser vendors. For example, the Certinomis CAs was removed from Firefox but is still on the EU QTSP list.

Of course, a mandatory CA could also be used by a state-level actor for surveillance. We have already seen attempts by Kazakhstan and Mauritius to require users to install their own trust anchors. Mauritius eventually dropped their plans, but Kazakhstan actually deployed their trust anchor and browsers had to eventually blocklist their trust anchor to protect users. This was actually a much easier case to handle because users had to install the trust anchor themselves and so the damage was limited: if browsers could be required to trust specific trust anchors that were controlled by state-level attackers, then they might not be able to protect users against state-level surveillance.

Alternative Designs #

From the browser's perspective, the central security problem with the design of QWACs is that (like EV certs), they are attesting to two separate pieces of server identity:

The domain name, which is consumed by the browser and used to determine the origin of the site.
The legal identity of the server, which is consumed by the user (though of course parsed by the browser so that it can display it to the user).

It's the ability of the QTSP to attest to the domain name that creates the possibility for QTSP misbehavior to allow for interception of user traffic.

Multiple Lists #

One possibility for addressing this threat to separate out those functions. The simplest way to do that is by having two lists:

The browser's existing CA trust anchor list.
A separate QTSP list managed by the EU.

When a browser encountered a certificate, it would first check that it was valid according to its normal procedures against the standard trust anchor list, just as with DV certificates. If those checks passed, then the browser would allow the connections. The browser would also check to see if the certificate was a QWAC and if it had been issued by a valid QTSP and if so it would show the QWAC UI with the appropriate identity information. The impact of this design is that the browser can ensure that the QTSP is correctly attesting to the server's domain name—and remove it if it misbehaves—but does not have to assess whether the QTSP is adequately verifying the server operator's legal identity; even if it completely fails at that, attackers will not be able to intercept connections.

Multiple Certificates #

Having multiple lists mostly addresses the security problems with QWACs, but leaves some operational problems. Specifically, because QWACs require validation of real-world identity, they cannot be automatically issued, whereas DV certificates can. This means that DV certificates are comparatively cheap and easy to deploy and can be integrated with server automation. But if you already have a DV deployment, then switching over to QWACs/EV can be a big lift. If you want QWACs to succeed, than this is likely to be a real drag on deployment.

Once you've decided to have two lists, it's natural to have two certificates as well: an ordinary DV certificate which attests to the domain name and a QWAC which attests to the legal identity. As noted above, this has relatively similar security properties to a single certificate but superior operational properties because you can layer a QWAC on top of the DV cert; this gives you increased flexibility and also means that if something goes wrong with the QWAC your site still works.

There are a number of different designs for two certificate systems, but the big design question is whether it's necessary for the server to prove that it has the private key for the QWAC during connection establishment (it already has to prove it has the private key for the DV connection). Intuitively, it would seem like this was necessary, but it turns out not to be because of the "mixed content" properties mentioned above. Basically, even if you require the server to prove that it has the QWAC key on a given connection, an attacker with a valid DV certificate for the domain can just intercept a subsequent connection and thus impersonate the server. Usually, the site will consist of a combination of HTML and JavaScript, so if the attacker allows the HTML to be served by the legitimate site and then intercepts the connection for the JS, the QWAC UI will even be displayed.^[6]

Once you have this insight, the obvious design is to have a mechanism for binding the QWAC to the domain name that is in the DV certificate. This binding can be either direct, with the domain name in the QWAC, or the QWAC just having a key that is used to sign an endorsement document that contains the domain name. The site then presents the DV certificate and the QWAC and the browser validates the DV certificate and checks that the domain name matches in the DV cert matches that binding. This is a familiar concept outside of the Web: when you go to the airport you present your ticket which has your name but not your picture and your photo ID which has your name and your picture, but no information from the airline. The security person verifies that the names match and uses the photo ID to verify that it's really you.

The diagram below shows how this might work in practice, in Mozilla's two-certificate proposal, called "portable QWACs":^[7]

With two certificates, the server obtains a DV certificate as usual, which it can use to serve TLS connections without doing anything else. Subsequently, it can obtain a QWAC, which it uses to sign the endorsement document binding the company name (from the QWAC) to the domain name (in the DV cert). When a client subsequently connects, it uses a TLS extension to indicate that it supports QWACs and the server provides the QWAC and the endorsement document in its handshake (in the EncryptedExtensions message). The client verifies the DV cert, the endorsement document, and the QWAC, and if everything checks out it completes the connection and shows the right UI.^[8] Of course, this is just one way of building a two certificate design; for instance the QWAC and endorsement document could be sent in an HTTP header instead.

Final Thoughts #

At the end of the day, the main impact of the proposed regulation is to dictate how browsers build their UI and maintain their root stores, including preventing them from enforcing their existing rules for CAs. The major rationale for this is to pave the way for QWACs, which, are basically the same as the EV certificates that we've tried and discarded. However, it's worth noting that at least some of the CAs seem to want to restrict the ability of browsers to impose their own standards on certificates at all, even for DV certificates. For instance, a recent presentation by Chris Bailey from Entrust suggests that:

Browsers bring all extra browser rules for consensus and approval under the CA/Browser Forum for industry standards which are audited under ETSI and WebTrust

Similarly, in a recent white paper European Signature Dialog writes:

Today, all certificate issuers must not only provide annual conformance audits to Mozilla, but they also meet additional browser rules. But the additional browser rules are entirely subjective and may exist to promote the browser’s proprietary commercial interests — another example of US big tech setting the rules for Europe.

Also, additional browser rules are not reviewed and approved by the internet ecosystem (e.g., the Certification Authority/Browser Forum (CABF), where all other certificate issuer rules are reviewed and approved by ballot of all the members, not just one browser).

The browsers have been asked to bring their additional rules to the CABF for approval by the internet ecosystem, but the browsers have refused and are holding on to exclusive power by themselves. This should stop, and certificate issuers, including QWAC issuers, as well as the EU should have a say in all the certificate rules.

This reflects longstanding tensions between the CAs and the browsers over who should determine the rules for certificates, with the browsers viewing themselves as stewards of their users' privacy and security and the CAs wanting more of a voice in governance.^[9] It's certainly understandable why CAs would want more control of how browsers run their root programs; it's less clear why it's in the interest of users for them to have it.

Or now sometimes QUIC, which uses a lot of the TLS infrastructure. ↩︎
Either directly or transitively, for instance by having a CA sign a certificate for another CA. ↩︎
Full Disclosure: I was part of the originating team of Let's Encrypt and Mozilla is currently a "Platinum Sponsor". ↩︎
There is also something called an Organization Validation (OV) certificate, which is partway between DV and EV. As far as I can tell, there's never been any OV-specific UI in the main UI, so it's not clear to me what the point is. ↩︎
Let's Encrypt does not offer EV certificates because they aren't able to automate issuance and the whole premise of LE is to make certificate issuance so cheap that it can be done for free. ↩︎
I've heard suggestions that sites ought to be able to send back an HTTP header that told the client that it ought to expect that all resources on a site be associated with a QWAC. This is technically possible but a big deployment hassle if you have multiple servers or if you include resources from other sites, such as ads or Google analytics. ↩︎
I am one of the authors of this proposal. ↩︎
If the DV cert doesn't check out, the client has to terminate the connection, but if the QWAC or the endorsement document are invalid, it can either terminate the connection or complete it but without the QWAC UI. The latter choice is obviously more robust to failure. ↩︎
A recent report by the German Bundeskartellamt provides some background on these tensions with respect to Chrome in particular, and helps give a sense of how the CAs view the situation. ↩︎

First impressions of Bluesky's AT Protocol

2022-11-06T00:00:00Z

The first generation of Internet communications was dominated by largely decentralized—and barely managed—communications systems like USENET and IRC, built on documented, interoperable protocols. By contrast, the current generation is highly centralized, built on a small number of disconnected siloes like Twitter, Facebook, TikTok, etc. In light of recent events, it should be clear that this is not an optimal state of affairs, if only because what information people have available to them shouldn't depend on which billionaires own Facebook and Twitter.

Over the years there has been a lot of interest in building social networks with a more decentralized architecture, such as Mastodon and Diaspora. These don't have no users, but I think it's fair to say that they haven't really displaced Twitter in the public conversation. A few years ago Twitter's Jack Dorsey announced a project called Bluesky, which was intended to design and build such a system.

Twitter is funding a small independent team of up to five open source architects, engineers, and designers to develop an open and decentralized standard for social media. The goal is for Twitter to ultimately be a client of this standard. 🧵
— jack (@jack) December 11, 2019

Mastodon, ActivityPub, and the Fediverse #

I mention Mastodon here and that's what people seem to be using but technically Mastodon is a piece of software that implements Twitter-like functionality. Unlike Twitter, however, Mastodon can talk to other servers using the W3C ActivityPub protocol, including to servers running different software than Mastodon. The collection of servers that federate (or at least can federate) via ActivityPub is called the Fediverse, but realistically you're likely to be using Mastodon.

While there wasn't any technology at the point Dorsey made this announcement, it got a lot of interest anyway because Twitter using such a standard actually would be a big deal and make it a lot more likely to succeed. A few weeks ago, almost three years later, Bluesky published the initial draft of what they are calling ATProtocol (as in @-sign) or (ATP) which is described as "Social networking technology created by Bluesky". Let's take a look!

Overview #

Unsurprisingly, ATP seems principally designed to emulate Twitter, though presumably you could adapt it to be more like Facebook or Instagram. The basic idea behind ATP is that each user has an account with what's called a personal data server (PDS), which is where they post stuff, read other people's posts, etc. These PDSes communicate with each other ("federate"), with the idea that this provides the experience of a single unified network, as shown below:

This is basically the obvious design and it's more or less what's been envisioned by previous systems, such as those based on ActivityPub. You can run your own PDS, but it seems more likely that most people will use some pre-existing PDS service, so most PDSes will have a lot of users.

Polling Versus Notifications #

There are two basic designs for the situation where node A is waiting for something to happen on node B:

Polling in which A contacts B repeatedly and asks "anything new"
Notifications in which A tells B what it is waiting for and B sends it a message when it actually does.

Polling systems aren't very efficient when events are infrequent, because B faces a tradeoff between timeliness and load: if it checks infrequently, then it won't learn about new events until long after they happen. If it checks frequently, then most of those checks are wasted and there is a lot of unnecessary load on both machines. In these cases, notifications are a lot more efficient because messages only need to be sent when something happens. On the other hand, when the time between events is very low compared to the acceptable latency for detecting them, then polling can work reasonably well.

For instance, in order to have an average detection latency of 1 second A needs to poll every 2 seconds (assuming events happen randomly). If events happen about every 100 seconds, then 98% of those checks are wasted. On the other hand, if events happen on average every .1 second, then almost every check will retrieve one or more event, and polling can be efficient.

The way this seems to work in practice is that when Alice wants to post a microblog entry (a "blue"? a "sky"?), she posts it to her own PDS. If Bob is following Alice, his PDS somehow gets it from Alice's PDS. It's not clear to me from the specs whether this is done by having Alice's PDS notify Bob's PDS or by having Bob's PDS poll. You probably want some kind of notification system, especially if there are going to be small PDSes, but the documents don't seem to specify that in enough detail to make it work. Similarly, when Bob decides to like one of Alice's her posts, he notifies his PDS and other PDSs, including Alice's pick that up. It appears that when he wants to follow Alice, he notifies his PDS, which notifies Alice's PDS which (I think) only succeeds if Alice's PDS agrees.

As I said above, this is mostly kind of the natural design, but there are two somewhat less obvious features.

Portable Identity #

In most distributed systems that I've seen, identity is tied to the server that you use. For example, if you use Gmail and your address is example@gmail.com, then you can't just pick up your email account and move it to Hotmail. With some work you can move the emails themselves but your address will be example@hotmail.com. The situation is a little more complicated than this because it's possible to use Gmail to host your own domain, in which case you could transfer it to another service, but all the addresses in the same domain share the same service; you can't have example@example.com be on Gmail and doesnotexist@example.com be on Fastmail.

The existing federated social networking systems I've seen seem to share this property. For instance, if you have an account on mastodon.social then your identity is effectively example@mastodon.social; this allows a user on (say) mastodon.online to refer to you as https://mastodon.online/@example@mastodon.social, which admittedly looks kind of awkward. Note that this is hidden a bit by the UI because you can just refer to people on your own server by unqualified names. For instance, https://mastodon.online/@example is shorthand for https://mastodon.online/@example@mastodon.social.

ATP allows you to have a persistent identity that is portable between PDSes. It does so by introducing the computer scientist's favorite tool, another layer of indirection. The basic idea is that your identity is used to look up which PDS your data is actually stored on; that way you can move from PDS to PDS without changing your identity. The stated value proposition here is that if a PDS decides to block you then you just move to a different PDS and you can take all of your posts and followers with you.

Account portability is the major reason why we chose to build a separate protocol. We consider portability to be crucial because it protects users from sudden bans, server shutdowns, and policy disagreements. Our solution for portability requires both signed data repositories and DIDs, neither of which are easy to retrofit into ActivityPub. The migration tools for ActivityPub are comparatively limited; they require the original server to provide a redirect and cannot migrate the user's previous data.

In order to make this work, each user's identity is associated with an asymmetric (public/private) key pair which is then used to sign their data (posts, likes, etc.). That way when they move their data from PDS A to PDS B, you can tell it's them by verifying the digital signature over the data.^[1] In fact, at some level the PDS is just a convenience, though an important one: if you got their data by any mechanism at all, you could always tell it was correct by verifying the data.

Scaling #

The messaging fan-out of a system like Twitter is quite different from those of other federated messaging systems like instant messaging (and to some extent e-mail). Although there are groups, IM is mostly a person to person activity, with any given message being sent to a relatively small number of people. The situation with e-mail is somewhat more complicated, with most messages sent by individuals going to a small number of people (more below on marketing communications). Tweets, by contrast, tend to be sent to large groups.

As an example, I'm a relatively small-scale Twitter user, but I have over 1000 followers, which means that every time I push the tweet button I'm notifying all of those people. It's not unknown to have over 100 million Twitter followers like Elon Musk or Barack Obama. By contrast, even Gmail workplace users can't send to more than 2000 users in a single message, and only 500 of those can be outside of Gmail. So, the dynamics here are totally different. If you want to send to a large number of people, e.g., for marketing or mailing lists, then you would typically use a specialized e-mail sender like Sendgrid or Mailgun.

This level of fan-out already presents a bit of a challenge for a federated system: if I have 1000 followers on 500 different PDSs, then my PDS needs to contact each of them every time I tweet. This isn't necessarily infeasible, but if I have a million followers spread over 10,000 PDSes, the situation starts to get somewhat worse in terms of scale. We should of course expect that there will be significant concentration in the PDS market, just like with e-mail, with a few large PDSes having most of the users and then a long tail of small PDSes.

In addition to the high level of fan-out, Twitter provides functionality that covers large number of messages. In particular, it's possible to search for messages by content, hashtag, etc., and Twitter promotes "trending" tweets to you. These functions require access to the entire database—at least the public database—of tweets. Obviously, receiving the entire database of (6000+ tweets per second) is prohibitive for a small device, so it won't be possible for every PDS to offer this service.

ATP proposes to address this by having a two-level system, with a second layer of "crawling indexers" who have access to all the data and can offer a personalized view, as shown below:

[Source: ATP docs]

As above, the documentation is pretty vague on how this is supposed to work. Indeed, the diagram above and somewhere around 100 words in the docs are about all there is, so I can't tell you how it's supposed to work. With that said, the reference to "crawling" is surprising: for efficiency reasons you don't really want this kind of service to act like an ordinary PDS but rather to have special APIs that allow it to get a full feed of what's happening, and even better some directory-type mechanism for identifying all the PDSes in the world, but I don't see anything like this in the API docs (please point me at this if I'm missing it).

A Bit More Detail #

I don't want to get too deep into the details of ATP, but it's worth taking a closer look at a few of the pieces of the system.

Identity System #

As noted above, the way that the handle system works is that you start with a "handle" that's expressed as a hierarchical name rooted in the DNS, e.g., @alice.example.com. In a conventional system like e-mail or Jabber, this would actually be expressed as alice@example.com but because this is supposed to be like Twitter and Twitter already uses the @-sign to indicate usernames—e.g., to distinguish them from hashtags—you have to either have names with two @-signs, like @alice@example.com like Mastodo—or two different separators—or omit the separator between the actual username and the domain it lives in. To parse these names, you just remove the first label and treat it as the user name (note that this means you can't have a . in your user name).

This creates some ambiguity about whether an identifier is a domain name or a user name (e.g., what's web.example.com). In principle, if it has an @-sign in front of it, it's a user name, but of course people aren't consistent about that kind of thing, and the name is perfectly legible without it. Moreover, because domain names are hierarchical, it's possible to have a situation where the same identifier is both a username and a domain name, e.g., if there is a user alice on the domain example.com but there is also a subdomain alice.example.com. This can't happen with e-mail addresses because the interior @-sign provides a boundary, but that's not true here. In general, this just doesn't seem like that great a design choice, though it's not a disaster.

In order to resolve an handle, you do an RPC query to the endpoint associated with the domain name of the handle. This returns a DID. That DID can then resolved to obtain the public key associated with the user. As described above, that key is used to sign the user's data.^[2]

ATP supports two flavors of DID—out of the 50+ variants currently specified (this kind of profiling is necessary if you want to have DID interoperability):

did:web, which just means that you do an HTTPS fetch to a Web site to retrieve the DID document (i.e., the public key).
A new DID form called DID placeholder (did:plc)), which consists of a hash of a public key which can then be used directly or sign new public keys to allow rollover (see my long post for more on this topic). As an aside, it's not clear to me how you actually obtain the DID document associated with a did:plc DID, as the public key isn't sufficient to retrieve it. There's apparently a PLC server, but is there only one? If not, how do you find the right one? This all seems unclear.

Obviously, the security of the did:web resolution process depends on DNS security, but even if you use did:plc, the handle resolution process depends on the DNS. This means that an attacker who controls the DNS or the handle server for a given DNS name can provide any DID of their choice, thus bypassing the cryptographic controls that did:plc or any similar mechanism use to provide verified rollover. Suppose that Alice's handle is @alice.example.com and this maps to did:plc:1234: because an attacker doesn't know the private key associated with this DID, they can't get it to authorize their public key, but if they can gain control of example.com then they can just remap @alice.example.com to did:plc:5678, and relying parties won't even get to the rollover checks.

There seems to be some implicit assumption that clients (or other PDSes) will retrieve the DID associated with a handle and then remember it indefinitely, though it's not quite explicitly stated:

The DNS handle is a user-facing identifier — it should be shown in UIs and promoted as a way to find users. Applications resolve handles to DIDs and then use the DID as the stable canonical identifier. The DID can then be securely resolved to a DID document which includes public keys and user services.

I'm not sure how realistic this is: retaining this kind of state is a pain and so it will be natural to treat it as soft state by caching it but not worry to hard if it gets lost because you can always retrieve it. In any case, a basic assumption of a system like this is that new PDSes—and users—will be constantly joining the system, and if the handle domain is compromised they will get the wrong answer, in which case you'll have a network partition in which some users and PDSes have the right key and some have the wrong key.

More generally, it's not clear what the overall model is. Specifically, is the handle → DID mapping invariant once it's established or is it expected to change? If the former, then it won't be possible to transition from did:web to did:plc, or—as the name "placeholder" suggests—to transition from did:plc to some new DID type, because there will always be some clients who have permanently stored the old DID and thus you will never be able to abandon it. On the other hand, if it's not invariant, then you need some mechanism to allow clients/PDSes to get updates, such as having a time-to-live associated with the handle resolution process (potentially based on HTTP caching). In either case, ATP should either build in some certificate transparency-type mechanism to protect against compromise of the handle servers or just admit that the security of ATP identity depends on the DNS, in which case you don't need something like did:plc and could presumably skip the DID step entirely and just store the public key and associated data right on the handle server. Either way, this is the kind of topic that I would ordinarily expect to be clearly defined in a specification.

In any case, I don't think that this mechanism completely delivers on the censorship-resistance aspect of portability: it's true that you can move your data from one PDS to another, but because your handle is still tied to some server you're vulnerable to having that server cut you off. Even if some servers have cached your handle mapping, many won't have and so the result will be a partial outage. It's true that it's probably cheaper to run a handle mapping server than a PDS, so you might be able to run that but outsource the PDS piece, but it also seems likely that most people will just run them in the same place, so I'm not sure how much good this does in practice.

RPC Protocol #

At heart, ATP is a fairly conventional HTTP request/response protocol with a schema-based RPC layer on top of it. The idea is that new protocol endpoints are specified by JSON schema which define the messages to be sent and received and can then be compiled down to code which can be called by the user. They docs give the following example of a schema:

{
  "lexicon": 1,
  "id": "com.example.getProfile",
  "type": "query",
  "parameters": {
    "user": {"type": "string", "required": true}
  },
  "output": {
    "encoding": "application/json",
    "schema": {
      "type": "object",
      "required": ["did", "name"],
      "properties": {
        "did": {"type": "string"},
        "name": {"type": "string"},
        "displayName": {"type": "string", "maxLength": 64},
        "description": {"type": "string", "maxLength": 256}
      }
    }
  }
}

This generates an API which can be used like so:

await client.com.example.getProfile({user: 'bob.com'})
// => {name: 'bob.com', did: 'did:plc:1234', displayName: '...', ...}

This is all pretty conventional stuff. I know that there are a lot of opinions in the Web API community over whether it's better to have this kind of RPC-style interface or a REST-style interface in which every resource has its own URL, but I don't think anyone would say it's a make-or-break issue; it's not like you can't make this kind of API work.

I'm more concerned by the fact that the API documentation is so thin. As a concrete example, here's the entire definition of the data structure "feed":

export interface Record {
  subject: Subject;
  createdAt: string;
}
export interface Subject {
  uri: string;
  cid: string;
}

What do these values mean? We might infer that createdAt is a date, but maybe not? What are the semantics of Subject.uri? Who knows?

I'll have more to say about this later, but for the moment I would observe that this is a pretty common pattern in systems that were built by writing software and then documenting its interfaces, rather than writing a protocol specification first and then implementing (though of course I don't know if that's what happened here). The result is that the specification just becomes "whatever the software does", and often the documentation is insufficient and you're reduced to reading the source code to reverse engineer the protocol. It's not awesome.

Access Control #

One thing that isn't clear to me is how access control is supposed to work. For instance, if I want to have a post that is only readable by some people how does this work? The situation is not at all clarified by the fact that the section on Authentication consists entirely of the word "TODO". However, ignoring the technical details, it seems like there are two major approaches, neither of which is really optimal.

A post is separately encrypted to each authorized reader.
The PDSs enforce access based on who is following a given user.

The first of these is straightforward technically, but operationally clunky as it requires not only knowing the public keys of all of your followers at the time you post, but also being able to go back and retroactively encrypt posts to new followers or when existing followers change their keys.

The alternative is less clunky, but requires a lot of trust in PDSes. To see why, consider the case when Alice is on PDS A and Bob and Charlie are on PDS B. Alice restricts here posts and Bob follows Alice but Charlie does not, so Charlie should not be able to see Alice's posts. However, when Alice posts something, it gets sent to PDS B, which then has to show it only to Bob but not Charlie. The obvious problem here is that Alice (hopefully) trusts her PDS but has no real relationship with PDS B; she just has to trust that it does the right thing (in Twitter, this is trusted by just trusting Twitter). This is basically a generalized version of the problem that Alice has to trust Charlie not to reveal her tweets, but it's obviously quite a bit worse in a system like this where there a lot of PDSes, where we end up with a distributed single point of failure in the form of exposure to vulnerabilities and misbehavior by every Alice where she has a follower.

Actually, the situation is potentially worse than this: what about PDS C which doesn't have any of Alice's followers? What stops it from getting Alice's posts? The documents don't say how this works, but at a high level, I think what has to happen is that PDS A has to verify that each PDS requesting a copy of Alice's posts has at least one user that follows Alice (presumably by working forward from the DIDs on Alice's follower list), which seems kind of clunky.

Thoughts on System Architecture #

When looking at a system like this, I usually try to ignore most of the details and instead ask "what is the overall system architecture"? The idea is to understand at a high level what the various pieces are and how the fit together to try to accomplish various tasks. In RFC 4101 I phrased this as being at the "boxes and arrows" level:

Our experience indicates that it is easiest to grasp protocol models when they are presented in visual form. We recommend a presentation format centered around a few key diagrams, with explanatory text for each. These diagrams should be simple and typically consist of "boxes and arrows" -- boxes representing the major components, arrows representing their relationships, and labels indicating important features.

For instance, it doesn't really matter whether the communications between client and server use RPC, REST, or something else, but what does matter is who talks to who, and when. Given this kind of architectural description, an experienced protocol designer can generally design something that will work, even if two designers wouldn't build exactly the same thing. It's much harder to go the other way, from the detailed description to the architecture. and worse yet, it tends to obscure important questions.

I think that this high level description is what the Overview is trying to provide, but it's really more of an introduction and leaves a lot of big picture questions unclear that would be easier to understand if it were a more complete description of how stuff worked. For instance:

How does a PDS learn about new activity on another PDS?
How do the "crawlers" learn about new PDSes and the content in them?
How does access control work, for instance, if a post is private?
What are the scaling properties of the system?
What are the security guarantees around identities and integrity of the data?
How do you handle various kinds of abuse? For example, suppose that someone sends abusive messages to others: does each PDS (or user!) have to block them separately or is there some kind of centralized reputation system?

As an aside, these questions would all be a lot simpler in a centralized system.

This isn't just a matter of presentation, but also of design. In my experience, the right way to design a system is to start from this kind of top-level question and try to build—and document—an architecture that answers this kind of question and only then design the specific pieces, in part because the details often to obscure issues that are visible at higher layers of abstraction (see, for instance, the discussion of DNS-based names and did:plc above). However, it also makes it easier for people to understand what you're talking about rather than forcing them to reverse engineer the structure of the system from the details, as is the case here. As noted above, I suspect this is a result of having a single implementation and then a spec which documents that implementation.

The Even Bigger Picture #

As the ATP authors acknowledge in theor FAQ there already is an existing federated social networking system based on ActivityPub, though in practice mostly centered around Mastodon. Mastodon seems to be having a bit of a moment now in the wake of the chaos surrounding Elon Musk's acquisition of Twitter:

For anyone wondering, Mastodon got over 70K sign-ups yesterday alone. Let's keep the momentum going! The "public square" of the web must not belong to any one person or corporation!
— Mastodon (@joinmastodon) October 30, 2022

Even so, it has a tiny fraction of Twitter's user base.

In general, experience suggests that it's pretty hard to start a competitive social network (Google Plus, I'm looking at you), but not primarily because it's technically hard. There are some real challenges in building a federated network, but building a non-federated system like Twitter is conceptually pretty easy, though of course operating at Twitter's scale is challenging. Rather, the issue is that because social networks are network effect products (it's right there in the name) and so the initial value of the network when it has few users is very low. This is especially true with something like Twitter where so much of the value is in the feed of new content, as opposed to YouTube (or arguably TikTok), where someone can send you a link and you can just watch that one video.

Far more than any of the technical details, what made Bluesky interesting when compared to (say) Mastodon was that it was designed under the auspices of Twitter with the stated objective of being used by Twitter. If Twitter actually adopted ATP, then suddenly ATP would have a huge number of users, getting you past the entry barrier that other new social networks have to surmount. However, I was always pretty skeptical that this was going to happen, for two reasons.

First, Twitter, like most other free services, makes money by selling ads. If there were some way easy way to stand up a service which interoperated with Twitter, including seeing everyone's tweets, but without showing Twitter's ads, that seems pretty straightforwardly bad for Twitter, which would then have to compete on user experience rather than on its user base moat.

Second, Twitter didn't need a fancy new protocol to allow for service interoperability; they could have just implemented ActivityPub. I recognize that there were technical objectives that ActivityPub didn't meet, but something is better than nothing and they could have used the time to develop something more to their liking and gradually migrated over. Obviously, this isn't ideal from an engineering perspective, but if what you wanted to do was get rid of Twitter's monopoly, then it would get you a lot further than taking three years to develop something new; when put together with Twitter's lack of an explicit commitment to use the Bluesky work this suggests that actually making Twitter interoperate was not a priority, even with the old Twitter management.

Of course, now Elon Musk owns Twitter, so whatever Jack Dorsey's intentions were seems a lot less relevant, and we'll just have to wait and see what, if anything Musk decides to do. Perhaps it will involve a blockchain.

Actually over a Merkle Search Tree over the data, but the details don't much matter here. ↩︎
The documentation gestures at using the DID for "end-to-end encryption", but doesn't specify how that would happen. Building a system like this in practice is fairly complicated, so more work would be neeeded here. ↩︎

How to hide your IP address

2022-10-17T00:00:00Z

As I mentioned previously in my posts on private browsing and public WiFi, if you really want to keep your activity on the Internet private, you need some way to protect your IP address (i.e., the address that machines on the Internet use to talk to your computer) and the IP addresses of the servers you are going to. There are a variety of different technologies you can use for this purpose, with somewhat different properties. This post provides a perhaps over-long description of the various options.

The Basics #

As usual, with any security problem, we need to start with the threat model. We are concerned with two primary modes of attack:

The server learning the user's IP address and using it to identify them or correlate their activity.
The local network learning which servers the user is going to.
The server using your apparent geolocation as determined from your IP address to restrict access to certain kinds of content (soccer, BBC, whatever).

Of course, whether you think this last item is actually a form of attack that should be defended against depends on your perspective and maybe how big a Doctor Who fan you are.

The basic technique for defending against threats (1) and (3) is to push the traffic through some kind of anonymizing relay:

As shown in the diagram above, the client connects to the relay and tells it where to connect. It then sends traffic to the relay, which forwards it to the server. The relay replaces the client's IP address with its own, so the server just sees the relay's address. In general, the relay will be serving quite a few clients, so the server will find it hard to distinguish which one is which (k-anonymity).^[1] This simple version clearly addresses threat (1), and, if the relay operator lets you select an IP address outside your own geographic region, threat (3). In order to defend against threat (2) you also need to encrypt the traffic to the relay so that an attacker on your network can't see which server you are connecting to and the traffic you are sending to it (see here for more on this form of data leakage). Ideally, you would also encrypt the traffic end-to-end to the server (using TLS or QUIC), but that's just generally good practice, not required for the privacy provided by the relay.

Relaying Options #

This basic design is at the heart of every relaying system, but the details vary in important ways. There are three major axes of variation:

The network layer at which relaying happens
The number of hops in the network
Business model

We cover each of these below.

Network Layer #

The first major point of variation is the layer at which the relaying happens. Understanding this requires a bit of background on how the Internet networking protocols work.

IP #

The most basic protocol on the Internet is what's called, somewhat unsurprisingly, Internet Protocol (IP). IP is what's called a "packet switching" protocol, which means that the basic unit is a self-contained message called a packet. A packet is like a letter in that it has a source address and a destination address. This means that when you send an IP packet on the network, the Internet can automatically route the packet to the destination address by looking at the packet with no other state about either computer. A simplified IP packet looks like this:

The main thing in the packet is the actually data to be delivered from the source to the destination, also called the payload. The payload is variable length with a maximum typically around 1500 bytes. The packet also has a next protocol field which tells the receiver how to interpret the payload (more on this later) and a length field so that it is possible to tell how long the entire packet is, including the variable length payload.

Using IP is very simple: your computer transmits an IP packet on the wire and the Internet uses the destination address to figure out where to route it. When someone wants to transmit to you, they do the same thing.

TCP #

If all you want to do is send a thousand or so bytes from one machine to the other, a single IP packet might be OK, but in practice this is almost never what you want to do. In particular, it's very common to want to send a stream of data (e.g., a file) which is much longer than 1500 bytes. At a high level, this is done by breaking up the data into a series of smaller chunks and sending each one in a single packet. But of course, life isn't so simple. For instance:

Packets might be lost, and must be retransmitted so that the receiver gets them.
Packets might be reordered, and the receiver must know which order to put them in.
In general, the network will not be able to handle an entire large file at once, so the data must be gradually transmitted over time. The sender must have some way to determine the appropriate sending rate.

The Transmission Control Protocol (TCP) is responsible for taking care of these issues. The details of TCP are far too complicated to fit in this blog post, but at a high level, the data stream is broken up into segments, each of which has a length and a sequence number, which tells you where it goes in the stream. Each segment is sent in an IP packet. When the receiver gets a segment it can look at the sequence number to reconstruct the stream and is able to detect gaps where packets are missing. TCP also includes an acknowledgment mechanism in which the receiver tells the sender which segments it has received; this allows the sender to retransmit packets which were lost as well as to adjust its sending rate appropriately.^[2] TCP requires setting up state between the two endpoints; this state is termed a "TCP connection."

There are of course other protocols besides TCP which can run over IP (for instance, UDP, mentioned later). This is why you need the "next protocol" field in IP: to tell the receiver what protocol is in the IP payload.

TLS #

TCP is a very old protocol and like most of the older Internet protocols, it was designed before widespread use of encryption was practical. This is obviously bad news from a security perspective, and eventually people got around to fixing it. The standard solution is to carry the data over Transport Layer Security (TLS). TLS basically provides the abstraction of an encrypted and authenticated stream of data on top of a TCP connection. As with TCP, you need to set up some state to use TLS, and that's called a "TLS connection". I can talk endlessly about TLS but I won't do so here.

UDP and QUIC #

Applications do not implement TCP themselves. Instead it's built into the operating system, specifically in what's called the operating system kernel, i.e., the piece of the OS that's always running and is responsible for managing the computer as a whole. The client application tells the operating system to create a TCP connection to the server, which creates what's called "socket" on the client side. The client writes data to the socket and the kernel automatically packages it up into TCP segments and transmits it to the other side, taking care of retransmission, rate control, etc. The kernel also reads TCP segments from the other side and makes them available to the application to read. Typically, the application implements TLS itself or more likely, uses some existing TLS library.

Why can't you write your own TCP stack? #

Obviously, you can write your own TCP stack (it's just software, after all) but the problem is that you can't install it, because on most operating systems, ordinary applications aren't allowed to write or receive raw IP datagrams. This is one of a number of restrictions on networking behavior that used to be used for security enforcement in a pre-cryptographic era. For instance, at one time it was assumed that if a packet came from a given machine address with a given "port number" (a field in the UDP/TCP header) it came from a privileged process (one that had operating systems privileges). There was even a whole system for remote login based on this where you could be on machine A and execute commands on machine B without authenticating. I know this sounds absurd now, but this was the situation from the early 80s to the late 90s, when we finally got proper cryptographic authentication (at least some of the time.)

This is convenient in that the application doesn't need to carry around its own TCP implementation, but inconvenient in that it's inflexible: suppose the application wants to make some change to TCP to make it more efficient? There's no way to do this without changing the operating system. By contrast, it's easy to change TLS behavior just by shipping a new version of the application. This became particularly salient in the late 2010s when people wanted to make performance enhancements to TCP but were unable to because the operating system didn't move fast enough. The solution was to invent a new protocol that could be implemented entirely in the application: QUIC.

QUIC is sort of like a combination of a fancier version of TCP and the cryptography of TLS (in fact, it uses many pieces of TLS internally). However, because it can be implemented entirely in the application, it can be changed very rapidly. Unfortunately, in most operating systems, applications are not allowed to write IP packets directly, and so QUIC runs over a protocol called the User Datagram Protocol (UDP). UDP is a very simple protocol which just lets applications send single units of data (datagrams) over IP. So, QUIC runs over UDP and UDP runs over IP.

The protocol stack #

It's conventional to talk about this as a "stack" of protocols and visualize it in a picture called a "layer diagram", like so:

I've also drawn on this diagram which pieces are implemented in the application and which are typically part of the operating system. When the application wants to write data, it starts at the top of the stack and data moves down to the network. As data comes in from the network, it moves up the stack towards the application.

In terms of the way the data appears on the network, each layer adds its own encapsulation, typically either before or after the data. The diagram below shows two examples. The first is data being sent over TCP, in this case the string "Four score and seven years ago". TCP adds its own header with the sequence number, etc. and then passes it to the IP layer, which adds the IP header with the source and destination addresses. The second example is the same data being sent over TLS. The TLS layer encrypts the data (shown by the crosshatching) and adds its own header. It then passes it to TCP, which adds its own header, etc. The receiving process reverses these operations.

Naming chunks of data #

You'll probably notice that I've been using the terms "packet", "record", etc. These are not interchangeable. One of the most annoying problems in networking is how to name a single unit of data like a packet (sometimes called generically a protocol data unit (PDU)). Each protocol tends to have its own term for this, partly just due to being defined by different people and partly because when you are working at multiple layers of the protocol stack it's a pain to talk about "IP datagrams", "UDP datagrams", etc. Here's my incomplete table of names for PDUs in different protocols:

Protocol	Name
Ethernet	Frame
IP	Packet (datagram)
UDP	Datagram
TCP	Segment
TLS	Record
QUIC	Packet (but it has things inside it called frames)
HTTP	Message
RTP	Packet (but they carry media frames)
OpenPGP	Packet
XMPP	Stanza

One thing that's important to know is that TCP and TLS provide the abstraction of a stream of data, not a set of records. What this means is that the application just writes data and the TLS stack or the TCP stack coalesces those chunks into one record (packet) or breaks them up at its convenience. The TCP stack might even send the same data twice with two different framings. For instance, suppose that the application writes "Hello" and then the kernel sends it in a single packet. While the packet is in flight, the application writes "Again". If both packets get lost, and the kernel kernel has to retransmit them, it might write them as a single TCP segment ("HelloAgain").

Which Layer #

With this as background, we are ready to talk about one of the big points of diversity: what layer are we relaying the traffic at? There are two main options, at least for relaying encrypted traffic.

Relay the IP-layer traffic
Relay the application layer traffic (i.e., the data that would go over UDP or TCP)

I cover both of these below.

Relaying IP Traffic #

Encrypting traffic at the network layer (IP) is one of the obvious ways to address network security issues, as it has the important advantage that once you have set it up, it secures all communications between two endpoints. Work on this goes all the way back to the 1970s, but the IETF started standardizing technology for this purpose in 1992 under the name IPsec. The original idea was actually not so much the kind of relaying system that I discussed above but rather that you would encrypt traffic between the two machines that were communicating with each other. So, for instance, say my client wanted to communicate with your server, we would take the IP packets we wanted to send, encrypt them, and send them directly.

Like the protocols we discussed above, IPsec is an encapsulation protocol, which means that to encrypt an IP packet from A to B we take the entire original packet, encrypt it,^[3] and then stuff it in another IP packet, like so:

In the scenario I was discussing above, the inner (encrypted) IP header and the outer (plaintext) IP header will have the same addressing information, but it's of course possible to have them have different addressing information, which is useful for creating what's called a Virtual Private Network (VPN). The motivating idea here is that you have two networks (say two offices from the same company) and you want to connect them as if they were in the same location. Inside the office, you trust that the wires haven't been tampered with (this is before WiFi) and so you don't encrypt all your data (I know, this sounds naive now), and so what you really want is just a wire connecting office 1 and office 2. This kind of private connection—what used to be called a "leased line"—is very expensive to buy and what you actually have is an Internet connection which lets you connect to everyone. But if you encrypt the traffic between office 1 and office 2, then you can simulate having your own private wire. Hence virtual private network. The typical topology looks like this:

In this scenario, you have two offices, each of which has a "VPN gateway" which detects traffic that is destined from office 1 to office 2 and encrypts it before sending it along. Other traffic, say to Facebook, is left untouched. When the packets are received at the far VPN gateway, it just removes the encapsulation and drops them on the network. The effect is as if there were a single network rather than two networks.

It's also possible to deploy this kind of thing in a simpler scenario where a single user VPNs into their office network, for instance if you are in a hotel working remotely, as shown in the diagram below:

The effect here is that it's like you were in the office, but you're actually not. But this brings up a real problem, which is that the remote user's machine doesn't have the right IP address: it has an IP address associated with the user's home or office (192.0.2.1 in the diagram above) but you want it to appear to be in the office, which means it has to have an office IP address (something starting with 203.0.112).

There are two major ways to make this work. In the first, the VPN gateway tells the user's device what IP address it wants it to have, and then the user's device puts that in the inner IP header, while having the outer IP header having the actual address. For instance, the inner (encrypted) IP header would have 203.0.11.50 and the outer (plaintext) IP header would have 192.0.2.1. The alternative is to have both headers have the user's actual IP address and to have the VPN gateway translate that address into an appropriate local address for the office network (and translate in the other way on the return trip). Note that in both cases, the gateway needs to do some work, in the first case to keep track of what addresses were assigned and to enforce that the client uses the right one, and in the second case to do the translation.

With that background, we can finally get to the problem statement that we started with, namely concealing user behavior. Unsurprisingly you can use the same technology as you use for remote access, with the difference that the VPN gateway is on the Internet directly rather than on some enterprise network, as shown below:

To the server, this just looks like the user is connecting from the VPN gateway, with whatever the IP address of the VPN gateway is. The client's local network just sees a connection to the VPN server, but doesn't know where the data is eventually going.

Here I've focused on IPsec, but it doesn't really matter which encryption layer protocol you use to carry the IP packets: they're just being encapsulated and transported end-to-end. In practice, one sees VPNs deployed with a variety of transport protocols, including DTLS, OpenVPN, WireGuard and QUIC. From the user's perspective, the properties of these protocols are largely the same. Most products that are labeled "VPN" protect traffic at the IP layer using one or more of these protocols.

Relaying Application Layer Traffic #

As mentioned above, the nice thing about protecting traffic at the IP layer is that it protects all the traffic on the system. However, the bad thing is that protecting IP layer traffic requires cooperation from the operating system. This has several undesirable consequences:

Your code isn't portable between operating systems.
Many operating systems require some kind of administrator access in order to install or configure something that acts at the IP layer.
You are often limited to whatever affordances the OS offers you. For instance, you may not easily be able to protect some traffic and not other types of traffic.

These issues can be addressed by relaying at the application layer rather than the IP layer. This can be implemented entirely in the application without touching the operating system; the application just connects to the relay (e.g., over TCP) and sends the traffic to the relay (hopefully encrypted to the server). The relay makes its own transport-level connection to the server and sends the application level traffic to the server, as shown below.

Note that in this diagram there are two TCP connections, one between the client and the relay and one between the relay and the server. The client connects to the relay over TLS and then over top of that creates an end-to-end TLS connection to the server (you could of course not encrypt your data to the server, but don't do that).

One of the big advantages of this design is that it makes it easy to relay some kinds of traffic and not others. As a concrete example, consider Safe Browsing, which leaks information about the user's browsing history to the Safe Browsing server. You might want to proxy Safe Browsing checks (which can be done very cheaply because there isn't much traffic) but not generic browsing traffic (which is much higher volume and hence more expensive). This is easy for the browser to do because it knows which traffic is which but is more difficult for an IP-layer system, which has to somehow distinguish different types of traffic. It's not necessarily impossible but it's significantly more work. For instance, if Safe Browsing uses a separate IP address from the rest of Google, then you could just relay that traffic, but if it shares the same IP address, then you will be encrypting people's search traffic as well.

A number of IP concealment systems relay at the application layer, including Tor, Apple's iCloud Private Relay, and Firefox Private Network. Typically, systems like this are referred to as "proxies". Apple's system is interesting in that it's implemented in the operating system mostly by hooking Apple's higher level networking APIs. Even so, it only works on Safari not other applications.^[4]

How many hops? #

Whatever the relaying technology, at the end of the day the relay needs to send traffic to the server, which means it has to know what server you're connecting to. But this creates a new privacy problem: you're connecting to the relay and then telling it which server to connect to. This means that while you've prevented the server from learning your identity, you still have a privacy problem with respect to the relay itself. The relay will have some privacy policy about how it handles this information (ideally, not keeping logs at all), but that's just something you have to trust them on. Even better would be to have some form of a technical protection.

The standard approach to providing technical protection here is to have multiple layers of relaying, as shown in the diagram below:

The way this works is that the client connects to Relay 1. It then tells Relay 1 to connect it to Relay 2. As with our single-hop system, that data is sent over the encrypted channel to Relay 1 and is itself encrypted to Relay 2. The client then tells Relay 2 to connect it to the server. The data to the server is thus encrypted three times by the client, in a nested fashion: once to the server, then to Relay 2, and then to Relay 1.^[5] Each hop strips off one layer of encryption and passes it to the next hop.

The result is that no single entity (other than the client) gets to see both the user's identity and the identity of the server it's connecting to. Here's what each sees:

Entity	Knowledge
Relay 1	Client address, Relay 2 address
Relay 2	Relay 1 address, Server address
Server	Relay 2 address, Server address

Note that if the two relays collude, they can together uncover the client's address and the server's address. However, if either is honest, then the client's privacy should be protected, as neither can easily collude with the server to learn this information: relay 2 because it does not know the client's address and relay 1 because (hopefully) the client's connection to relay 2 is one of many connections it has made to relay 2 during this time period. How well this last part works depends on the scale of operation of the system, how long the client leaves the connection up, whether it reuses the connection to relay 2 for connections to multiple servers, etc.

Of course, in order for this to work, the relays need to be operated by different entities. Otherwise there's no meaningful guarantee of non-collusion. This includes not being run on the same cloud service provider (e.g., AWS). Sometimes you'll hear about multi-hop VPNs but if the same company is providing both VPN servers, then this doesn't really help. One nice feature of iCloud Private Relay is that your account is with Apple but they arrange for multiple hops with different providers, so you don't need to worry about the details.

One important limitation of multiple hops is that it can have a negative impact on performance. In general, the routing algorithms that run the Internet try to find a reasonably efficient route between two locations^[6] and so you should expect that if instead of routing between point A and point B you route from A to C to B, then this will be somewhat slower (you'll often hear people use the term triangle inequality as shorthand for this). The more hops you do, the more likely it is you will have some kind of performance impact. This isn't a precise effect, but in general, you should expect to have some impact.

iCloud Private Relay is a two hop network, with the first hope being operated by Apple and the second hop being a large provider that Apple has contracted with (mostly Content Delivery Networks (CDN) like Cloudflare or Akamai). Both Apple and these CDNs have fast connectivity and good geographic distribution, which is intended to ensure high performance. Tor uses three hops, a "guard node", a "middle relay" and an "exit node". As discussed below, Tor relays are effectively volunteer services, so performance varies in practice.

Business Model #

Your typical VPN has a simple business model: you pay the VPN provider and then authenticate to them (e.g., with a password) when you connect. This isn't ideal for privacy because they know your name, contact information, and credit card number.^[7] On the other hand, as described above, they already know your IP address and which sites you're going to, so it's not clear how much worse this makes things.

With Private Relay, however, this would create a real problem: it's not so bad with the first hop relay because that gets your IP address anyway, but if you authenticate to the second relay with your identity, then you've ruined everything and you might as well be back with a single hop system. In order to address this problem, Apple uses anonymous credentials generated using blind signatures to authenticate to the proxy, as shown below:

Briefly, the way this works is that the client connects to Apple and authenticates to it using its iCloud account. Apple then issues an anonymous credential that doesn't contain the user's identity. This credential can be provided to the relay to authorize use of the service. In order to prevent Apple from linking up these two activities the credential is blinded (essentially encrypted) when Apple generates it, and then the client unblinds it before sending it to the relay (see here for more detail on how this kind of credential works). This design allows the proxy to know that you are authorized to use the service but not to see who you are.

Tor is different from either of these because it's a free service, operated by members of the community (you can donate to people who run relays). This creates some unpredictable performance consequences because there really isn't much in the way of a Service Level Agreement (SLA). It also makes it somewhat hard to assess the actual privacy guarantees, because some of the Tor nodes might be run by people you don't trust or who are actively malicious. Obviously, with iCloud Private Relay you have to judge for yourself how much you trust Apple and its partners, but at least you have some idea who they are.

Summary and Final Thoughts #

IP addresses are an important and highly effective tracking vector and if you want to browse privately you need to do something to conceal your IP, and this mostly means relaying. Any relaying system will conceal your identity from the server, as long as your provider isn't colluding with the server. Any one hop system necessarily means that you are trusting the provider not to track your behavior and not to collude with the server. Depending on how you feel about your local network and its privacy policies, a single hop system might or might not be an improvement (see Yael Grauer's article in Consumer Reports for more on this). A multi-hop system has a much better privacy story because misbehavior by a single relay is not sufficient to compromise your privacy.

The technical details of how the system works (IP versus application layer, mostly) don't matter that much for privacy but do matter for functionality, with application layer systems being more flexible but providing less complete coverage for other applications on your device. In addition, all of the multi-hop systems that I know are at the application layer, so as a practical matter if you want a multi-hop system you probably will be using an application layer system.

Finally, it's important to know that even the best system provides only limited protection. An attacker who has a complete view of the network can often do enough traffic analysis to determine who is on each end of the traffic. Fortunately, most of us do not need to worry about this powerful an attacker.

Some people run their own relays, in which case they might successfully conceal their identity, but because they will be the only user, they'll be trackable by the IP of that relay. ↩︎
The way this works is that when you are sending too quickly, packets get dropped by the network, so the sender can use the rate of loss as a signal that its sending rate is too high. ↩︎
Yes, I'm ignoring "transport mode", in which you just carry the UDP or TCP datagram. ↩︎
This includes other browsers on iOS even though those browsers are required to use Apple's WebKit engine. As far as I can tell, this is just a policy choice on Apple's side, not any kind of technical limitation. ↩︎
You'll sometimes hear the term "onion routing" applied to this, especially with Tor. ↩︎
This is a hideously complicated topic all on its own. ↩︎
Yes, you could pay with Bitcoin but don't think that's private. ↩︎

Self-Driving Vehicles, Monoculture, and You

2022-10-10T00:00:00Z

Warning: this post didn't come out quite as tight as I was hoping. I think there are a bunch of interesting ideas and connections to be drawn, but they don't hang together as well as I wanted. That said, I'm not quite sure how to improve things, and so I'm just going to post it as-is. The Internet has plenty of bits, after all.

Max Chafkin's article arguing that self-driving cars are failing is making the rounds, especially this amazing opening bit:

The first car woke Jennifer King at 2 a.m. with a loud, high‑pitched hum. “It sounded like a hovercraft,” she says, and that wasn’t the weird part. King lives on a dead-end street at the edge of the Presidio, a 1,500-acre park in San Francisco where through traffic isn’t a thing. Outside she saw a white Jaguar SUV backing out of her driveway. It had what looked like a giant fan on its roof—a laser sensor—and bore the logo of Google’s driverless car division, Waymo.

She was observing what looked like a glitch in the self-driving software: The car seemed to be using her property to execute a three-point turn. This would’ve been no biggie, she says, if it had happened once. But dozens of Google cars began doing the exact thing, many times, every single day.

King complained to Google that the cars were driving her nuts, but the K-turns kept coming. Sometimes a few of the SUVs would show up at the same time and form a little line, like an army of zombie driver’s-ed students. The whole thing went on for weeks until last October, when King called the local CBS affiliate and a news crew broadcast the scene. “It is kind of funny when you watch it,” the report began. “And the neighbors are certainly noticing.” Soon after, King’s driveway was hers again.

Waymo disputes that its tech failed and said in a statement that its vehicles had been “obeying the same road rules that any car is required to follow.”

Here's the thing, though: Waymo is right. It wouldn't be a big deal if just the occasional person did a K-turn in King's driveway (who among us hasn't turned around in someone's driveway?), but when everyone does it, then it's a disaster, as least for King. However, it's a little harder to pinpoint exactly what's wrong here.

There's an obvious account of this situation, which is that this is a case of AI risk, incentive alignment, and the famous paperclip optimizer. In this version of the story, Google's system for training their cars is only interested in saving time (or wear on the cars, or whatever), doesn't take into account the externalities of their behavior, so it's perfectly happy to keep people up all night with car noise if it saves a few seconds or minutes.

There certainly is some kind of alignment problem here, but I think this analysis doesn't quite capture it. As I said above, the problem isn't that any particular car does a K-turn in King's driveway, but that all of them do. Even if we ignore externalities, it's not clear that this is an optimal solution: according to the story there were cars lining up to make this turn, at which point you should be wondering if this really is the fastest way for them to accomplish their objective. This suggests another analysis, which is that this is a locally optimal approach which isn't globally optimal, even if we ignore externalities.

This shouldn't be an unfamiliar concept: there are lots of things which work at a small scale but not at a large scale. There are at least two possible failure modes that one can encounter:

This just isn't scalable at all
You need some diversity

Unsustainable Scaling #

Most people are used to systems that have unsustainable scaling. Sometimes this is due to externalities, such as with air pollution. Back when only a few people had cars, it didn't really matter that a typical internal combusion engine emitted way too much NO_x, but put enough cars on the road and you get acid rain, hence catalytic converters. The situation with CO₂ and climate change is similar: we can only dump so much into the atmosphere before whatever homeostasis there is starts to break down.

Other cases of unsustainable scaling aren't so much due to externalities as due to resource constraints. We saw that early in the COVID pandemic, where we had really effective COVID tests based on PCR but there were only a few labs that could do them. Those tests have become more standardized, but we also now have cheap lateral flow tests that scale. I understand that this is also a problem in educational interventions, which often seem to work in pilot projects with teachers who are committed to the idea but don't scale well when you need every teacher to do it.

The need for diversity #

Another possibility is that you actually do have something scalable, as long as not everyone tries to do exactly the same thing. It might be the case that there are hundreds of little hacks like this, and if only a few cars used each of them, it would be fine, so you just need diversity rather than uniformity. The common example of this is of course monoculture in crops, though you actually can get very high yields this way, but you end up with a brittle system. However, there are also situations in which the whole system falls apart if you don't have some diversity.

This is a familiar concept in networking, where, like above, you often have some resource that needs to be shared between multiple agents and if they don't share nicely, everything collapses.

Avalanche Restart #

One well-known case is what's called "avalanche restart". Suppose that you have a server that is under heavy load (i.e., has a lot of clients) and then for some reason it reboots.

Of course, this is experienced by clients as a failure, and they try to reconnect. The obvious thing to do is to try to reconnect immediately and if that fails try again (i.e., in what's called a "tight loop"). This is locally optimal, because it lets you reconnect quickly, but globally bad: if everyone does this, however, what often happens is that you can overload the server or the network that the server is on, which leads to bad service for everyone as it tries to switch between every client and might even cause it to reboot again (this shouldn't happen, but all software has bugs.)

There are two standard techniques to address this problem:

Instead of having the clients retry immediately, have them wait a random time (e.g., between 1 and 10 seconds).
If the client fails to connect, then it increases (typically, doubles) the amount of time it waits before the next retry. This is called "exponential backoff".

Typically, these are used together, so you randomly start and then exponentially back off. The net effect is that you don't have every client trying at the same time, and the rate of clients attempting automatically adjusts until the server isn't overloaded.

Obviously, this isn't locally optimal: if the server has very few clients it would be better if the clients just reconnected immediately. Moreover, if everyone else following the random start + exponential backoff approach, then it's obviously advantageous for a single client to just try to reconnect aggressively (to "defect" in the game theory jargon). But if everyone defects, then the result is that the server is over capacity and most people get terrible service. The point here is that it's better for everyone to do something slightly suboptimal but different than it is for everyone to do the same thing, even if it's locally optimal.

NiCad Battery Memory #

I had originally been intending to write about the famous Nickel Cadmium battery "memory" phenomonen. The way the story is usually told is that there was a satellite that was powered by solar panels and used NiCad batteries to store energy during periods when the the panels weren't illuminated (due to the Earth being in the way of the sun). Because the orbit is very regular—and there's no weather in space—the battery was charged and discharged on a repeating regular schedule. Eventually, it started exhibiting decreased storage at the point where it would usually start being charged. However, attempts to reproduce this phenomenon seems to have been mixed.

Network Transmission Rate Control #

A similar situation occurs with network rate control. A good example is the classic Ethernet local area network. In original Ethernet, every computer was on the same wire and so whatever you send is received by every other computer and vice versa, just like a radio network. But two computers can't transmit at the same time because they will step on each other. The question then becomes how to divide up the time.

One way to address this problem is to have defined time slices during which each node can transmit, but this requires tightly coordinated clocks and doesn't adapt well if one node wants to transmit a lot and the others want to listen. Instead, Ethernet solves this problem by having each node transmit as soon as it has something to send and no other node is transmitting, but it also detects if another node also chooses the same time to start (a "collision"). If there is a collision, each node picks a random amount of time to wait before it tries to start transmitting again.^[1] This way, the chance of a repeated collision is relatively low. Obviously, it would be better for each node to retransmit right away, but if everyone does that you will just get collisions again.

Here too, you get a more globally optimal result if everyone does something that's locally suboptimal.

Some other potential cases #

I'm not trying to suggest that this is some brilliant insight, but nevertheless it's an effect we see surprisingly often. Some other examples of similar phenomena:

Complaints that because of Instagram everyone goes to the same places for vacation.
Heavy congestion on popular hiking and running trails because everyone wants to do Rae Lakes, JMT, etc. and they've had to institute a quota system, even though there are lots of great trails that are basically empty. Pro Tip: quotas only apply to camping, so if you can trail run it in one day you can do anything.
Congestion on "alternate" routes that avoid rush hour traffic on the major arterials. This is a similar case because it would be fine if just a few people did, it but we can't have everyone driving through downtown Palo Alto to get from 101 to 280. We see this some organically but I've often wondered if traffic sensitive navigation systems like Waze and Google Maps that reroute you to alternate routes make efforts not to send everyone there.

There's also a whole game theory literature on what's called mixed strategies which is in part about how it's often better to play a mix of multiple strategies rather than a single uniform one. There's a connection here to the tragedy of the commons (and of course to Prisoner's dilemma) as well.

Coordination #

As I said, this is a pretty common problem, but it can be pretty hard to address when you have a bunch of individual agents all making their own decisions. Above, I've mostly talked about how each agent has an incentive to defect and get a locally optimal solution, even if it's not globally optimal, but even if every agent plays by the rules, it can still be vary hard to design a system that produces the right result.

As a concrete example, early implementations of the TCP network protocol implemented an algorithm for controlling the transmission rate that could fail catastrophically, resulting in what's called "congestion collapse", in which the network was entirely full of traffic, but it was mostly retransmitted data and almost no real progress was being made (Van Jacobsen and Karels have an approachable account of what happened and the fix). The problem of designing rate control algorithms that perform well but don't result in congestion collapse has occupied network engineers ever since. The fundamental problem here is the lack of a centralized point of view and control, instead each agent has to make its own decision independentally, and designing an efficient algorithm is hard.

This is actually the part I find a bit puzzling about the whole Waymo thing: surely the Waymo engineers know about this general phenomenon and they do have an overall view of what's happening, so it would be natural to put in some sort of throttling system so that not every car tries the same hack at once, or even to detect congestion in real time. Do they not have a system like this? Is this still the optimal algorithm in terms of car time, even though it's annoying for homeowners? Something else? Waymo people, my DMs are open!

There is also an exponential backoff component here in case of another collision. ↩︎

On the Security and Privacy Properties of Public WiFi

2022-09-25T00:00:00Z

One of the most common security and privacy questions I get is whether it's safe to use public WiFi networks (and whether you should use a VPN). The answer is "it depends", for the reasons I lay out below. If you want to skip the rest of this, I'll tell you that I mostly just use airport and hotel WiFi but am more hesitant about it if I have to log in with my own identity.

"Safe" is a difficult word that covers a lot of territory. At a high level, there three main threats one might be concerned about in this context:

Compromise of your device (information security)
Compromise of the data you are transmitting over the network (communications security)
Monitoring of your use of the network (privacy)

Let's take these in turn.

Compromise of your device #

Often the first thing people worry about is that the network will be malicious and will subvert your device via some vulnerability in the browser, the operating system, etc. I'm certainly not going to tell you that this isn't possible (all software has defects, and some of them will be vulnerabilities) but vendors go to a lot of effort to find and fix these vulnerabilities, so it's also not a trivial matter to find them and they're quite valuable. As a concrete example, at this year's Pwn2Own competition, a full compromise of an iPhone 13 or a Pixel 6 was worth $200,000 USD, and an extra $50K if you got kernel access.

This is not to say that modern devices are somehow impregnable, but rather that it's relatively unlikely that an attacker is going to use a zero-day (i.e., undiscovered) vulnerability to attack random people at an airport Starbucks. Major OS vendors (both desktop and mobile) and major browser vendors are pretty good about quickly fixing vulnerabilities, so if you are running an up to date browser and an up to date OS, you should be relatively safe.

Moreover, even if your local network is safe,^[1] you still have to worry about compromise by other network actors, such as the Web sites you visit. Generally, if your browser and device aren't secure against network attack, you should be pretty concerned about your safety whatever the status of your local network.

Note: this advice does not apply if you are someone who is especially likely to be attacked by a powerful attacker, such as a state-level actor. If you are an activist or a dissident, you need a totally different level of operational security that probably involves having several machines.

Compromise of your communications #

HTTP, HTTPS, TLS, and QUIC #

Historically, Web encryption used the HTTP protocol, which ran over a channel provided by TCP. When run securely, it was layered over TLS, which sits between HTTP and TCP and provides a secure channel, with the result being called "HTTPS" (for HTTP Secure). The server indicates to the client that a given URL was to be retrieved via HTTPS by giving it a URL starting with https: rather than http:. Recently, the IETF has standardized a new version of HTTP (called HTTP/3) which runs over a network protocol called QUIC rather than TCP. QUIC uses the TLS 1.3 cryptographic handshake and TLS-like encryption, so HTTP/3 provides a similar set of security properties to earlier versions of HTTP over TLS. It still uses https: URLs, and so it's convenient to just call it all HTTPS, even though the protocol is different.

The second potential area of concern is compromise of your communications. The basic situation here is quite simple: The operator of the WiFi network can inspect and or modify every packet you send, so they get to see anything that's not encrypted. This actually applies to any network you use, not just WiFi networks.

When it comes to Web traffic, the news is generally pretty good: a very large fraction of Web sites are encrypted using either TLS or QUIC. These protocols were designed under the assumption that the attacker has full control of the network, and so provide security even if you are on a malicious WiFi network. In general, as long as you are on an encrypted Web site, you should not need to worry about your passwords, credit card numbers, etc. And if you're not an encrypted Web site, then you probably shouldn't do anything even if you are on a trusted WiFi network because you have to worry about attackers elsewhere on the Internet between you and the site.

It's a little hard to get a precise estimate of the fraction of traffic that is HTTPS;^[2] below I show measurements from Chrome and Firefox respectively, with Chrome showing rather more use of HTTPS than Firefox does. It's still not clear what the source of the difference is, but in any case the pattern is the same, which is that most traffic is encrypted, especially in the US, and it's gradually increasing.

[Chrome HTTPS data]

[Firefox HTTPS data]

The situation is somewhat worse for mobile apps. In a Web site, the client-side implementation of encryption is located in the browser, so the site only needs to configure their own server correctly—which is fairly standardized, especially if you use a hosting provider which has built in HTTPS support—and then send the client https: URLs. By contrast, mobile apps have to arrange for their own transport security. Historically this has led to a lot of apps not doing encryption at all or doing it in an insecure fashion. The latest work on this appears to be from Oltrogge, Huaman, Amft, Acar, and Backes in 2021, which reports a significant number of vulnerable Android apps, despite attempts from Google to prevent this.

Obviously, it's dangerous to use an app that doesn't implement encryption securely on an untrusted network. A VPN can sort of help here in that it prevents you from attack by the local network. However, this is only a partial solution: even if the last mile is secure there are hundreds to thousands of miles of network between you and the server; if the app doesn't implement encryption correctly, then you are vulnerable to attack anywhere along that path. In general, what you want is for your apps—and web sites—to encrypt their traffic.

Monitoring of your use of the network #

The really serious problem here is privacy. While HTTPS does a good job of protecting your actual Web traffic, such as passwords, credit card numbers, etc., it does not effectively conceal the sites you are going to.

Routes for Browsing Behavior Leakage #

There are four main avenues for this leakage (collectively called "metadata"). In order of when they are available to the attacker, they are:

The DNS resolution of the server
The IP address of the server
The TLS server name indication (SNI) field.
Traffic analysis from the pattern of data (message sizes, timing, etc.) sent and received

Taking these in turn...

DNS Resolution #

Typically the URL that the client starts with has a domain name in it, such as https://www.example.com/. Before the client can connect to the server it needs to know the server's IP address (the numeric address of the server). The client uses the Domain Name Service (DNS) to resolve the name into an IP address. Historically, the local network has provided the DNS server that the client uses to resolve the name. The result is that the local network learns the name of every server you are going to, with obviously negative implications on privacy. Note that it does not learn which pages on the site you are visiting, just the site names themselves.

In the United States and some other countries, Firefox has deployed a feature called DNS over HTTPS Trusted Recursive Resolver (DoH TRR), which encrypts the DNS traffic and sends it to a separate server with defined privacy policies; this prevents the local network from learning the sites you are going to via your DNS queries. On other browsers, however, you generally are leaking your DNS traffic to the network.

IPv4 and IPv6 #

The original version of IP, IPv4, had 32-bit addresses, for a maximum of about 4 billion total addresses. For obvious reasons, this isn't enough for every device on the Internet. In 1995, the IETF standardized IPv6, which has 128-bit addresses. However, IPv6 deployment has been, extremely slow. For example, over 25 years later, less than half of Google usage is over IPv6. In the meantime, people have developed a number of mechanisms for sharing IPv4 addresses, including NAT on the client side and virtual hosting on the server side. While these may not be the cleanest designs from an architectural perspective, they actually act to improve privacy by grouping together traffic that would otherwise be separable by IP.

IP Address #

The second major mechanism by which your browsing history leaks to the local network is via the server's IP address. This is a signal of variable quality. Big sites like Amazon or Google run their own servers and so they also have distinct IP addresses: in these cases it's easy to tell which site you are visiting, just by looking to see who operates the IP address in question.

Smaller sites, however, often operate on shared infrastructure, whether via shared hosting, or behind content distribution networks (CDNs), with more than one site on a single IP address. In this case, the IP address only allows you to narrow down the site to the set of all sites on the same IP address, which can be quite a large number of sites, especially with a big CDN.

Server Name Indication #

This kind of shared hosting is convenient operationally but presents a problem for TLS. When a TLS client connects to a server, the server needs to provide a certificate proving that it owns the site (the domain name) that the client is trying to connect to. If there is just one site on a single IP, then the server can provide the corresponding certificate, but if there are many such sites, then the server needs to know which certificate to present.

When TLS was originally deployed (back when it was called "SSL"), this was a real problem and each server needed its own IP address; this was eventually addressed by adding a TLS extension called Server Name Indication (SNI), in which the client provides the name of the server it is trying to connect to. The SNI is not encrypted and so a network observer can just read it off the wire and learn which site the client is trying to connect to.^[3] As with DNS or the IP address, this just leaks the server's name, not the pages on the site you are going to.

The TLS community has of course known more or less since the beginning that SNI was a privacy problem. In versions of TLS prior to TLS 1.3 the handshake—including the server's certificate—was largely unencrypted, so this didn't seem like as big a deal, because the certificate also leaked this information, but TLS 1.3 encrypts most of the handshake, and so SNI became the last major privacy leak in TLS proper. In the beginning of the TLS 1.3 design process, a number of attempts were made to design a solution for encrypting the SNI, but it turned out to be a really hard problem and ultimately it didn't make it into the final specification. However, the TLS working group is now working on a specification for Encrypted Client Hello (ECH), which will protect the SNI under some circumstances. ECH is not yet widely deployed, but hopefully we'll start to see more deployment relatively soon.

Traffic Analysis #

The final privacy leak is via traffic analysis, which is the generic term for measuring the traffic patterns of the connection, such as the size of the messages being sent, their timing, etc. This turns out to reveal quite a bit about the sites people are going to. Goldberg, Wang, and Wood provide a good overview of the research in this area. There has been some work on adding countermeasures to TLS or HTTP to prevent this kind of traffic analysis, but the problem isn't that well understood and so far at least, there aren't any agreed upon defenses.

The good news is that traffic analysis is a lot harder than it looks—though Cisco actually sells a product that does some of this. If we were to close the other routes, it would be a pretty substantial privacy improvement.

Privacy Implications #

The upshot of all this is that whoever operates the local network gets to learn quite a bit about the behavior of people on the network. This is true whether they are a public WiFi network or your internet service or mobile provider. [Clarified — 2022-09-24]. Specifically, they get to learn:

The identities of the Web sites you visit (just by looking at the connections).
Many of the apps on your device, because they "phone home" to some server.

That's a lot and is quite likely to include information that many people would consider sensitive. The example I usually give is that you might be visiting some medical site, but there is plenty of other sensitive behavior that people engage in that they don't want others to know about, such as visiting dating sites or watching porn.

The actual privacy impact of this depends a lot on the nature of the network, however. Specifically:

Are you identifiable?
Are the network operator or the people on the network actually bothering to record your behavior?

The answer to the first of these questions is something you can mostly figure out for yourself: how many other people are on the network? Did you have to log in? Was there a shared password? For example, if you are in an airport with shared WiFi and either no password or a simple captive portal where you don't identify yourself, then it's going to be fairly hard to attribute your behavior to you (though the network can generally create a profile corresponding to all the sites you visit).^[4] Note that the apps on your phone may provide a somewhat unique fingerprint that could at least in theory be shared between operators, though I don't know if this really happens.

Encryption and Wireless Networks #

It's very common for wireless networks to be encrypted, but this provides surprisingly weak security. The basic problem is that the encryption only prevents people who are not on the network from seeing the traffic. For consumer and public access points WiFi Protected Access (WPA-2) usually is operated in a pre-shared key mode where the encryption keys for the network are derived from the password via a handshake performed when each device joins. This means that anyone who (1) has the password and (2) is able to observe you joining is able to see all of your traffic. Moreover, because the passwords are usually quite weak, it is often possible to just brute force them. There has been work on a public-key based system (labeled "forward secrecy") that would prevent these forms of attack, but it is not widely deployed and appears to have other flaws.

On the other hand, if you had to provide your identity (pro tip: a lot of captive portals don't check the e-mail address you provide them), then the network operator can link the history of your sites to you. So this means that situations where you have a user-specific password or need to actually log in have a much worse privacy situation. Note that in an environment like a hotel where the operator knows where you are, then this is probably enough to identify your traffic even if there is no password or a shared password.

The actual impact of course depends on whether the network operator or other people on the network are actually spying on you. Of course, there's no real way to tell whether they are or not; even if the network has a privacy policy which says that they don't monitor your behavior you can't really tell if they are doing so or not. Moreover, on wireless networks it's generally the case that other users of the same network can observe your behavior—though they probably won't know the information you used to log in—so even if the network operator has a good privacy policy you still have to worry about other people.

This brings us to the topic of VPNs: if you use a VPN then this will (mostly) prevent local attackers from seeing the sites you are connecting to, which is good, but it's a tradeoff because it also provides a neatly labeled traffic set to the VPN operator of which sites you are going to, together with your identity (because you logged in to the VPN), so you're really trusting the VPN operator to protect your privacy. On balance, if you use a reputable VPN service see Consumer Reports' VPN report, then this likely provides better privacy than just using an untrusted local network, but it's important to remember that ultimately there are only policy and not technical controls on what the VPN operator can do. Note that a multi-hop system like Tor or iCloud private relay doesn't have this property, because there is no single entity who can de-anonymize both you and your traffic.

Closing Thoughts #

People who work in communications security like to talk about the Internet threat model in which the network is maximally malicious. This is often phrased as "you give the packets to the attacker to deliver".^[5] The idea is that protocols need to be designed to be secure even in this very difficult setting. If you succeed, then it doesn't matter what network conditions you're running in, and questions like "is public WiFi safe" would be irrelevant. Unfortunately, while there has been a lot of progress in designing and deploying security protocols such as TLS—to a lesser extent in building secure software—the privacy properties of these protocols leave a lot to be desired. The result is that it actually is important to ask whether you can trust the network to handle your data in the way you would like. The idea behind privacy enhancing technologies like DoH, ECH, and proxying/VPNs is that they replace this trust with technical mechanisms that prevent attack even if the network is malicious, but we're not there yet, and in the meantime, you still need to ask how much you trust the network with knowledge of your activity.

This is actually a lot less likely than you think because consumer networking gear is famously insecure, so it's reasonably likely that your average home network has actually been compromised. ↩︎
It's actually even slightly hard to define what you mean. For instance, should measure page loads or HTTP transactions, or... ↩︎
SNI is not a perfect signal from the attacker's perspective because HTTP allows the client to coalesce traffic to multiple servers on the same connection as long as they share a certificate. For instance, if the server has a certificate for mail.example.com and calendar.example.com and the client connects to mail.example.com, it can then send traffic destined for calendar.example.com without creating a new TLS connection. This makes the problem of learning which site the client is connecting to slightly harder, but as a practical matter, there are plenty of non-coalesced connections and even when they are coalesced they may be associated with the same server operator, so SNI is a pretty good signal. ↩︎
There is research by Bird, Segall, and Lopatka indicating that browsing history can be used for reidentification, so this is not a perfect case of hiding in the crowd but it would require a fair amount of work to identify you. ↩︎
I've heard Steve Bellovin say this, but I think he may have been quoting. ↩︎

ELI15: PCR and PCR Testing

2022-09-14T00:00:00Z

As pretty much everyone is now aware, there are two main kinds of COVID test:

At-home based antigen tests (often called "lateral flow")
Lab-based molecular tests (often called "PCR" [though not all molecular tests are PCR—2022-09-14])

Lateral flow and PCR are both descriptions of the technology used in the test, but unless you already know what they are, they're just tech jargon. The purpose of this post is to explain how PCR works at the "explain it like I'm fifteen" level. To that end, I'll be omitting most of the chemistry and focusing on the main clever ideas, with some external cites for those who want to learn more.

Background: DNA #

As you no doubt know Deoxyribonucleic acid (DNA)—this is the last time you will need to read "deoxyribo..."—carries the genetic code that directs the development of humans and most (but not all, as we'll see) living things. The basic structure of DNA is of a sequence of small molecular subunits (nucleotides). Nucleotides generally have the same basic structure, which consists of a common "backbone" consisting of a sugar, a phosphate group, plus another chemical group called a nucleobase:

[Modified version of a diagram from Wikipedia]

Each type of nucleotide has a different nucleobase.

You can build a chain of nucleotides by attaching the sugar group of nucleotide 1 to the phosphate group of nucleotide 2 and then the sugar group of nucleotide 2 to the phosphate group of nucleotide 3, and so on. This is true no matter what nucleobases are attached to the backbone. There are four main bases, adenine, cytosine, guanine, and thymine, which gives us a 4-ary code (unlike the binary code used by computers). It's canonical to refer to these by their leading letters: A, C, G, T.

Instead of being a single chain, normally DNA exists as a pair of chains (with each chain often being called a strand). The key thing to know is that not any two strands can hook up. Instead, the base pairs are complementary: with adenine pairing with thymine and guanine pairing with cytosine, like this:

[Image from Wikipedia by Madeleine Price Ball.]

Thus, the sequence of bases on the first strand determines the sequence of bases on its paired strand. This means that most of what you need to know about a given DNA molecule is encoded in the sequence of bases. It's this sequence which determines the DNA code for an organism.^[1]

If you take two pieces of DNA with complementary sequences, they will tend to hybridize with each other to form a pair of strands. This will also work—though not quite as well—if the sequences are close but not identical, which can be used to measure "closeness" of two sequences; this was a more useful technique before sequencing became fast and cheap.

You don't need to know this to understand PCR, but the actual DNA molecule is arranged in a characteristic double helix structure, which looks like this:

[Image from Zephyris via Wikipedia]

DNA Replication #

The way that DNA replicates—for instance when a cell divides into two, requiring two copies of the DNA, one for each cell— is that the paired strands unzip to form two single strands (this is called "melting").

Note that in this diagram I'm only showing short strands with a small number of bases. The dashed regions are intended to indicate that the DNA strand just goes on indefinitely, but I'm not going to show it.

Next, an enzyme called DNA polymerase builds another paired strand onto each strand using the complementary bases in the ambient cellular environment. The result is now two DNA double helices, each of which is (hopefully)^[2] an exact copy of the first one, and each of which contains one of the original strands and one newly created strand, which used the original strand as the template.

The diagram below shows the process of replicating the unzipped DNA.

There are several more important things to notice here. First, the two strands are built in opposite directions. I've colored things red and blue to help keep track, with the blue chain polymerizing left to right (built off the red template which goes the other way) and the red right to left (built off the blue template, which goes the other way). Note that in reality both strands have the same chemistry, it's just that the backbones are facing opposite directions, and of course the bases are complementary.

Second, the process gets kicked off by having a primer, which is a short piece of DNA that is (of course) complementary to the strand being replicated. Because polymerization only happens in one direction, however, the primer ends up being one end of the replicated DNA strand, with everything on the other side of the primer just not being replicated. You can see this in the diagram above, there the replicated blue strand has nothing on the left of the CTGT (losing the T that was there in the original, as well as everything else to the left which I didn't show) and the replicated red strand has nothing on the right of the TATA, losing the C (as well as everything else to the right). Note that the top red and bottom blue strands are actually the originals, and so extend in both directions.

In normal DNA replication in the body, you'd want to replicate the entire strand and so the primer would be attached to the end of the chain (there's some special biochemistry for this that we don't need to go into), but in PCR, the primers are just little snippets of single-stranded DNA that get attached to the DNA in the complementary direction. PCR takes advantage of this in order to focus on a particular portion of the DNA sequence.

PCR #

Suppose you find yourself in a situation where you want to examine a relatively small amount of DNA. This comes up fairly frequently, for instance in cases where you have an environmental sample or when you are looking for something—like the COVID virus—in a larger sample. In these cases, it's useful to amplify the DNA of interest so you have a larger amount for analysis. This is where the Polymerase Chain Reaction (PCR) comes in. PCR takes advantage of the same biological DNA replication mechanism I described above to amplify (make a lot of copies of) a DNA sequence of interest.

The basic idea is simple if you know the sequence of interest. (as with COVID, where we have the full sequence). You just synthesize primers that match both ends of the DNA sequence you want to replicate, with one matching the end in one direction and one matching the end in the other. When you mix them up with single-stranded DNA, the primers naturally hybridize (attach themselves) to the right places on the DNA strands. You then run the replication process with these primers.

The first time you run the replication process, things are just as shown above: the strands separate and then the polymerase builds a partial replica of each original strand (on top of the original complementary strand) starting with the primer. At the end of this process you now have two paired DNA strands, as shown in the top part of the diagram below (which is just the same as the previous diagram.)

However, if you run the process again something interesting happens. As expected, each of the pairs unzips, leaving you with two original strands and then two partial replicas. The original strands replicate just as before and produce the same replicas as in the original process. However, when you build the complementary strand using the replicas from the first phase as the template they are built in the opposite direction from how that template strand was built. The result is that they start at the primer and stop when the strand ends, but because the strand already ended where the other primer was, the result is you get a strand that just consists of the region between the primers (inclusive). I've circled these strands in green so you can see them.

If you run the process over and over, what happens is that the original strands continue to make copies of partial strands and the partial (1st generation) and short (2nd and later generation) strands just make short strands. Each time you run the process you double the number of copies, so quite quickly you end up with a large copies of just the region of interest and a few copies of the rest of the DNA sample.

Partially Unknown Sequences #

Above, I assumed that you know the DNA sequence that you are interested in. This is certainly helpful, but it's not required. Actually, all you need to know is the sequence of the endpoints of the sequence of interest so you can make the primers. The replication process just depends on the primers binding to the relevant sections of DNA.^[3] Once that happens, the polymerization process will work just fine with anything in between (in nature it obviously needs to work with basically any sequence). This is useful for a number of scenarios, such as:

When you want to sequence a specific piece of DNA, for instance to look for mutations or defects.
When you want to test for a DNA sequence that is subject to a lot of mutation, so you don't know exactly what's there (SARS-CoV-2, for instance, mutates quite rapidly).

In both situations, as long as you can find some surrounding regions that are highly conserved, you can make primers and replicate the region of interest.

PCR in Practice: Taq Polymerase #

Conceptually, then, PCR is simple: make the right primers, dump them into your sample, and then repeatedly run the replication cycle. But what does "run the replication cycle" actually mean? We need to unzip (melt) the DNA, then let polymerase make copies, and then repeat. But if we just dump some DNA, polymerase, primers, and bases into a tube, not much is going to happen because the DNA is already paired up, so we need something to kick off the process.

If you heat up the DNA to about 90°C, then it will melt, so you can heat it up, and then let it cool down a bit and it will replicate.^[4] Unfortunately most polymerase enzymes are inactivated by being heated up, so if you just do this, you need to re-add polymerase every cycle, which is obviously a pain. However, the good news is that there is a polymerase enzyme (Taq polymerase) from a bacterium which lives in hot springs which can survive being heated to 90°C. This makes the problem much easier. You just need to mix up your DNA, primers, bases, and Taq polymerase in a tube and repeatedly heat it up and cool it down. You can buy a special machine called a thermal cycler that will do this automatically, and this is now standard lab equipment.^[5]

[From Rror via Wikipedia]

RNA #

OK, so this is all very useful, but what about if you want to amplify RNA? This is a particularly relevant application right now because SARS-CoV-2 (the virus that causes COVID-19) is an RNA virus (as is HIV). I'm not going to burden you with the details of RNA, except to say that (1) it's (usually) single-stranded rather than double-stranded and (2) it uses one different base^[6] but is otherwise more or less is isomorphic to DNA. You can still PCR-amplify RNA by using an enzyme called reverse transcriptase that transcribes RNA into DNA, at which point PCR works as usual. This technique is called Reverse Transcriptase PCR. As far as I can tell, you basically can do RT-PCR by dumping reverse transcriptase into your sample and running things as usual.

PCR Testing #

OK, so I've told you how to amplify DNA sequences, which might be fine if you wanted to sequence them, but how do you use this to make a COVID test. The basic idea here is that you take a sample from the patient and look for COVID RNA in it (the same idea applies to HIV testing). But there's a missing step here because what I've described so far just replicates DNA, it doesn't measure it. I guess you could try to replicate things for a while and then maybe filter out the replicated strands and weigh them or something, but that would be a really tricky bit of analytical chemistry. Fortunately, there's a much cleverer approach, called Real-Time PCR or quantitative PCR (qPCR), which actually measures the replication process.

There are a number of ways to do this, but the basic trick is to measure replication via fluorescence. One version of this is to make a probe which is basically a DNA sequence that matches some sequence in the region of interest but also has some special chemistry that makes it fluoresce (glow) when it detaches from a strand of DNA. You then add that probe to the rest of the PCR mixture, where it adheres to the single strands much as the primers do.

When the PCR reaction runs, the polymerization process evicts the probe from the template strand, replacing it with the newly built complementary strand, causing it to fluoresce. You can then measure the light coming out as the reaction runs: the more replication that's happening—and hence the more DNA there is that matches the primers—the more light is emitted. Of course, because the PCR test inherently amplifies DNA sequences, if there is any significant amount of the target sequence, you'll eventually get some fluorescence, so what matters is the amount you see after a given number of PCR cycles. You'll sometimes hear the term cycle threshold (Ct) in connection which COVID tests. This is just the the number of cycles you had to run before you detected the virus. The more cycles, the less there was in the initial sample.

Final Thoughts #

If you're coming to this fresh, it's really hard to appreciate how revolutionary all this is, and how much it's come as the result of decades of hard work by thousands of talented scientists. Just what I've described here reflects at least three Nobel prizes (Crick, Watson, and Wilkins in 1962 for the discovery of the structure of DNA^[7], Holey, Khorana, and Nirenberg for protein synthesis, Mullis for PCR), plus countless other contributions that didn't win the Nobel.

The result is an incredibly powerful set of analytic techniques—not just PCR, but fast sequencing, which I hope to talk about later—that have turned what used to be the effectively impossible problem of learning a given DNA sequence (the first viral DNA sequence was only performed back in 1984!) into what is today a routine task.

I'm really oversimplifying here. For instance, DNA can be methylated which affects how the DNA is interpreted without changing the sequence. ↩︎
In reality, of course, this process is messy and you get errors. There are also mechanisms to try to fix the errors, see proofreading). ↩︎
I think it's also possible to make multiple primers if there are variants, but I'm not a PCR expert. ↩︎
Your body, of course, does not heat up to 90°C, at least if you want to stay alive, but there are enzymes which will unzip the DNA at lower temperatures—2022-09-14. ↩︎
PCR was originally patented by Cetus and when I first saw PCR, the urban legend was that you could sell these machines without paying Cetus as long as you were careful to call them "thermal cyclers" rather than "PCR machines", even though as far as I know they were only used for PCR. ↩︎
Uracil rather than thymine ↩︎
See also Rosalind Franklin who did fundamental work here, but was widely overlooked for years. ↩︎

Ultra-Trail du Mont-Blanc (UTMB) Race Report

2022-09-05T00:00:00Z

Probably the two most prestigious events in trail ultrarunning are the Western States Endurance Run (Western States), held in June in California, and the Ultra-Trail du Mount-Blanc (UTMB), held in August in Chamonix, France. Both are 100-mile events (UTMB is actually 171 km/107 mi) and draw the top ultradistance runners. Americans tend to know about Western States because it's older, but UTMB is much larger and fancier, with a field of over 2000 (Western is <400) and just much higher production values.

Unlike prestige events in other sports (e.g., the Hawaii Ironman or the Boston Marathon), ultras tend to rely on lotteries for admission, so ordinary runners can find themselves running on the same course with the best in the world (who get in via other mechanisms). I was lucky enough to get into the UTMB lottery this year and knew I had to give it a shot.

[Map and profile from Runalyze]

UTMB Races #

The naming here is incredibly confusing. First, there are actually a number of races happening the same weekend as UTMB under the UTMB umbrella, including:

Race	Distance	Height Meters
Ultra-Trail du Mont-Blanc (UTMB)	171	10,000
Courmayeur-Champex-Chamonix (CCC)	100	6,100
Sur les Traces des Ducs de Savoie (TDS)	145	9,100
Orsières-Champex-Chamonix (OCC)	55	3,500
Petite Trotte à Léon (PTL)	300	25,000!

Historically there have also been a number of ultra races named "Ultra-Trail ", such as Ultra-Trail Mount Fuji (UTMF) or Ultra-Trail Australia (UTA). Some, but not all, of these are now owned by UTMB in what's called the UTMB World Series, which also includes a number of races that don't have the words "ultra trail" in the name, such as Speedgoat. These are of course branded under the UTMB name, producing the confusing situation in which the races that happen in Chamonix are collectively referred to as "UTMB Mont-Blanc" and the 170 km flagship race is referred to as "UTMB Mont-Blanc - UTMB", which is to say "Ultra-Trail du Mont-Blanc Mont-Blanc - Ultra-Trail du Mont-Blanc". That's the event that I ran and what most people mean when they say "UTMB".

Finally, UTMB acquired Western States last year and is re-branding UTMB Mont-Blanc as the series "finals", with Western States as a subordinate event, though potentially one of the continental "Majors".

Qualification/Entry #

Up to and including this year, UTMB had a two-phase qualifying system. First, you had to collect enough qualifying points by doing other events. The standard this year was 10 points over 2 races, with a medium-hard hundred being 5 points and a hard hundred being 6. The interesting thing about this structure is that you actually don't need to be that good to get in: it's of course hard to run 100 miles, but in order to get the points you generally only need to finish the event, which isn't that hard if you are going to have any chance to finish UTMB, which is quite a bit harder than your average 100.

My qualification came from San Diego 100, 2019 and Pine to Palm 100, 2019. Ordinarily, you would need to qualify within two years, but because of COVID UTMB allowed people to continue their qualification through this year. I applied to UTMB back in 2019 and didn't get in, and they double your chances each time you didn't get in, so I had something like a 20% chance of admission (as an aside, I'm not sure what it says about runners that so many of us want to run 170km in the Alps that they have to run a lottery to control entry.) To be honest, I hadn't expected to get in and had the rest of my season planned out, but when you get the chance to do UTMB, you do it—or at least I do.

Race Overview #

UTMB starts in Chamonix Mont-Blanc (Chamonix) and does a big loop around Mont-Blanc, mostly following the Tour du Mont Blanc course. The total distance is listed as 171.5 km (106.6 miles) and 10000 height meters (32800 ft) (note that this is 10000 meters of climbing, so also 10000 meters of descending). The general pattern is that you climb up to some mountain pass (col), then descend back down to one of the towns in the area, then climb back out and repeat.

Although Mont-Blanc is of course quite tall (4808 meters), UTMB isn't really at altitude: Chamonix Mont-Blanc is at around 1000m (3400 ft), and you never go much above 2500m (8300 ft), which is enough to feel some effects of altitude but nowhere near as bad as (say) Tushar's Mountain, which starts at over 3000m (10000 ft). And of course, you don't usually stay at that altitude for long. With this much vert, though, you're basically always climbing or descending, and there's very little flat running, with what there is mostly in the towns along the way, and much of that on asphalt or flat dirt track.

For those of you used to US ultramarathons, UTMB has a number of big differences. First, because of the giant starting field you're almost never alone unless you're way out front or way off the back. I don't think I spent more than 5 minutes without seeing anyone during the whole event. This also means that the trails can get super congested, especially at the beginning, where it's almost impossible to pass people. Second, it's not out in the middle of nowhere, but keeps going through these small towns and refuges. Every time you run through a town—at least during the day—there are people out in the streets lining the route cheering and high fiving you. This is especially true at the start/finish in Chamonix and the early towns like Saint Gervais when people would still naturally be awake.

Next, it's a nighttime start. Most US hundreds start in the morning, so if you're a non-elite you'll run through the night after running all day. UTMB starts at 6 PM, and because it gets dark around 8:30 you're running through the night. I think this is done so that the elites will finish in the daytime: the elite men finish around 20 hrs (2 PM) and the elite women finish around 22-23 (4-5 PM). The consequence is that the non-elites are going to run through two nights and if you're reasonably fast you're going to spend more than half the race in the dark.

Next, there's crewing but no pacing. Most US hundreds will allow you to have someone run the latter part of the race with you; for instance, I'm pacing my friend and training partner Chris at Stagecoach 100 in a few weeks. UTMB doesn't allow pacing outside of short zones near the aid stations—and somewhat informally in the last 200 meters of the race or so—though, as I said above, you're never really alone anyway. It does, however, allow a single crew member, which is super-helpful, and Chris came over to crew me.

Finally, there is just an unbelievable amount of climbing, more than all but a few US races such as Ouray or Hardrock, and much of it is fairly technical by US standards, by which I mean that there are a lot of big rocks and the like that you need to navigate, and sections where it's not really runnable at all, as in I would hike it even if I were running 10 miles rather than 100. By contrast, in most US ultras you could basically run any section individually, even though end up hiking in order to conserve energy. As I understand it, UTMB is actually considered pretty non-technical by European standards, and, for instance, the companion TDS race is rather more technical. In any case, it's hard, and as discussed later, this threw me off a bit.

Based on my previous races Liverun estimated my finish time as 34:29, so I had my pace targets based on that and their projections (largely so my crew could meet me), but in the event I was pretty far off.

Pre-Race Logistics #

With the race start on Friday, I arranged to fly out Sunday, arriving on Monday. I took the overnight flight from San Francisco—using miles to buy business class so I could sleep—through London Heathrow, and then on to Geneva. From there, you can get a car to Chamonix, which takes about 60-90 minutes. This all got me to Chamonix around 11:30 PM, which wasn't too bad. I opted to get a private car (via Mountain Dropoffs) on this leg because this meant I didn't have to wait for other people and I was ensured of being able to find my driver as soon as I was ready. This all went reasonably smoothly, though wearing an N95 mask for the whole trip was fairly unpleasant, as I didn't want to get COVID right before my race.

I'd arranged to stay at the Pointe Isabelle in the center of Chamonix maybe 200m from the race start. This was very convenient because it means you can just hike over the expo or the race start, as well as being within easy walking distance of the store for every major sporting good brand (Salomon, Arc'Teryx, Patagonia, etc.). This was actually a pretty nice hotel and I'd stay there again. I got a "4 person" room which had a double bed and a bunk bed in separate rooms, which was good for when Chris arrived.

Chris arrived Thursday morning, so I had a couple days to myself and mostly just didn't do anything. I had a few easy morning runs which gave me an opportunity to check out the last few miles of the course (not easy!) and other than that I mostly just stayed in my room and read or tried to sleep. I was jet lagged of course, but as I didn't really plan to time adapt, I didn't think it was worth keeping the kind of rigid schedule that usually helps adaptation, and so I slept a bit fitfully and took a lot of naps.

A note on poles and loops #

Standard hiking poles have both a hand-grip and loops you put your hands through. You're supposed to not really grip the grips too hard and instead use the loops for leverage, which stops your hands from getting tired. The problem is that when you're running—especially downhill—you don't want your hands to be in the straps: either you hold the poles by the middle or you hold the grips but you want to be able to let go if you crash. This means you need to put your hands through the straps and also because the straps are asymmetrical, if you are holding the poles together, when you want to use them normally you need to figure out which is which. LEKI has a different engagement system in which you wear a strap on your hand permanently and there's a little piece of cord set in the loop that clips into an engagement with the pole that you can get into and out of with a button. This means you can get in and out quickly and also that the poles themselves are symmetrical, so going from carrying them to using them is faster.

Black Diamond Handles [from the BD site]

Leki Handles [from the LEKI site]

As Thursday rolled around, I started to get worried about whether I had everything I needed and ended up scrambling to get a few more things. In particular, UTMB requires you to have a long sleeve shirt and a rain jacket, but I decided I needed another layer, and ended up buying a Patagonia Houdini (last year's model, on sale for € 70). This turned out to be a great choice because it was comfortable when things were a little chilly but when my long sleeve layer (Patagonia Capilene) would have been too hot. This last-minute panic buying was actually on top of some pre-trip panic buying when I replaced my rain jacket (going to the Inov-8 Raceshell HZ and hiking poles (going to the LEKI Ultratrial FX.One Superlight, headlamp (Lupine Neo) and pole storage (Salomon Pulse Belt) ). By Thursday night I had pretty much everything, so I laid out all of my stuff, including packing my pack and the stuff Chris would need for crewing. All that was left was to pick up my packet, actually put on my race stuff, drop my drop bag, and head over to the start. Once this was done, Chris and I headed over to Annapurna II for an early dinner at 6 and then early bedtime.

Unfortunately, I ended up not sleeping well, only getting about 5 hrs. Given the 6:00 start, I figured I'd just spend as much as possible of Friday sleeping, so Chris and I went and got breakfast and then Chris headed out for his run and I went back to bed. The way that race check-in works at UTMB is that you actually have a reserved window to pick up your packet. Mine was at 1200-1400 on Friday, so I only had to be up for that and then I could go back to sleep. Chris agreed to drop off my drop bag (only after 1400!), though it probably wasn't worth bothering with, as it only shows up at the 80 km mark in Courmayeur and Chris was able to meet me there, so it was just if he got hung up and didn't make it for some reason.

Race start is 1800 but they ask you to show up at 1730. This is where being close really paid off, as we were able to just walk over quickly around 1715. Even so, the start line was just totally packed (2000+ people, remember). And you're just packed in there with everyone else. From where I was standing I could just barely see the start line and the monitors. The next 30 minutes were a bunch of announcements and videos of the pros. It started to rain sometime in here, so I ended up putting my jacket on, though it came off not too soon after. I was wearing a KN95 mask for this time: being near so many people felt risk even outside.

The Race #

Finally, the gun. Well, the start at least. Of course, at this point you're still like 100m away from the actual start and packed in super close with everyone else, so everyone's trying to run but really you're just walking almost the whole time, so you run a few steps and then have to walk again.

Once you get past the line, you're running through the streets of Chamonix which, are absolutely packed with people cheering you on, high fiving, etc. This goes on for a few kilometers until you open up onto some rolling fire roads. At this point, there are still a huge number of people on the trail with you, so if you've managed to get yourself at the wrong part of the field you're either stuck behind people or people are pushing around you. I just tried to chill out and not worry too much about position. Apparently it can get super dusty in this area in which case you want to be in front, but with the light rain this mostly wasn't an issue.

Start to Les Contamines [31.2 km, 1581+/1347-, 4:08:44, -:08] #

The first leg to Les Contamines Montjoie is pretty straightforward. Initially it's pretty smooth road and fire road that's gradually downhill. As I said above, it's pretty hard to go fast for the first 5 KM or so, but eventually it opens up enough that you can kind of find your position. By this point it had stopped raining, so I just had my jacket back in my pack.

Things start to trend upwards after about 8K, heading through Les Houches and Col de Voza. This is all pretty good even trail, so you're just comfortably hiking and I spent a bunch of it chatting with YouTuber Jeff Pelletier. Eventually you come over the pass and then it's down through Saint-Gervais and on to Les Contamines. At Saint-Gervais I got a slightly unpleasant surprise, which was that the race food wasn't what I expected.

Some background: you don't carry all your food for an ultra; instead their are aid stations which have food and drinks, which is usually a combination of "real food" like pretzels, cookies, etc. and engineered foods like sports drinks, energy bars, carbohydrate gels, etc. There are a few major companies who manufacture this stuff, so an American ultra will typically have Tailwind or Gu Roctane for the drink and Gu Roctane, Spring Energy, or something similar for the gels. I'm pretty familiar with these, and I know I can tolerate them well, but I knew that UTMB would be serving something different, specifically Overstim.

I'd ordered a variety of different Overstim products and tried them out and seemed to tolerate them OK, but I neglected to order the specific flavor of sports drink (Mojito, which turned out not to be that bad), and then the energy bar was something like a granola bar, which honestly wasn't that good. Fortunately, I brought a bunch of my own stuff—mostly Spring Energy gels and Tailwind drink, so I wasn't entirely dependent on them, but it would have been a lot more convenient to just graze at the aid stations. They did, however, have mini Mars bars (in Europe that basically means a Milky Way), and I could foresee a lot of them in my future.

The run into Saint-Gervais is pretty great: it's late evening so everyone is out on the streets cheering you on, plus you're only a few hours in so you're still feeling good, which is not going to be the situation later. At this point, though, it's like you're a pro.

There's a relatively long gradual climb out to Les Contamines, which is still pretty easy. Contamines is the first aid station where you're allowed to have crew, so Chris was there. The whole setup was a little confusing, but eventually we met up and I switch my bottles, grabbed some more gels and Tailwind powder, and headed back out.

Contamines to Courmayeur [49.6 km, +3019/-3094, 10:35:31, +:41] #

This is a super long stretch consisting of two big climbs, first up to the Refuge de la Croix Du Bonhomme, and the to the Col de La Seigne and then a smaller 500m climb to the Arête du Mont-Favre before finally down into Courmayeur.

These are some serious climbs. First, they're long and steep, with 1200m of climbing to the Refuge followed by almost 1000m to the Col de la Seigne. Worse yet, they're rocky and it's not just a matter of your ability to just put out raw power, like in an Ironman or a marathon, because you're constantly having to adjust your stride or step higher than you naturally want to. Of course you're hiking all the uphills—and even that is hard work—but then the downhills are rocky and technical—and of course it's dark—so you're not able to move that fast on them either, even if you have a good headlamp.

I actually had a couple of small slips on the downhills, including one where my foot slipped off the shoulder of the trail and I twisted my knee a bit. I was a bit worried that was going to end my race right there, but actually I was able to run it off, so that was OK.

In addition, even though it was dark it still pretty humid for a lot of this, which slows you down itself. Bottom line, by the time I rolled into Courmayeur I was a lot more tired than I wanted to be at this point in the race. Usually you really want to take the first half of a hundred quite easy because the second half is going to be hard no matter what, but I'd already had to work quite a bit more than I had planned.

Chris met me at Courmayeur and we did the usual bottle swap and extra food thing. I also cleaned my feet, re-lubed them, and changed my socks. Things weren't actually bad here, but it's good not to take any chances. Chris had brought some Tailwind Recovery drink (higher calories, more protein), and I was able to drink that while I was doing this stuff. Taste-wise, this was a nice change, but at this point I was starting to feel the first hints of nausea and it sat a little heavy in my stomach. All of this took quite a bit longer than I was hoping for, especially as I had to do some waiting around, so I was out in 27 minutes, 1:08 behind. Ironically, I actually seem to have spent less time in this aid station than others, as I seem to have come in in 892nd and left in 772nd.

Courmayeur to Champex-Lac [45.9 km, +2720/-2558, 10:17, +1:30] #

This section starts with a very steep climb of 805m/4km out of Courmayeur to Refuge Bertone. This climb wasn't so bad—just the usual "are we there yet" stuff— but right after I hit the aid station at the top I just started to feel incredibly wiped out. It was starting to get hot and I just sat in the shade for a while and tried to pull myself together, with only modest success.

I spent a lot of the next few kilometers just hiking and trying to run a bit. This is unfortunate because this section (through to Arnouvaz) is some of the most runnable of the course, just rolling and smooth, so I was losing a lot of time. Eventually I just sat at the side of the trail and tried to recover. At this point I realized I probably need to start on caffeine—I had been hoping to wait until night—so I took some caffeine and (I think) some salt. This picked me up some and I was able to keep going.

I don't really remember the climb out of Arnouvaz to Grand Col Ferret (745 D+) and mostly remember just kind of suffering through it and then the descent to La Fouly. By this point I was pretty nauseated: I never really vomited but none of my food felt appetizing and every time I started to run I would notice that I felt worse. At this point I hooked up with another American runner and we did about 5K together (Steve, IIRC), just taking it really casual. We were both way off our pace targets (him 30ish and me 34ish) and our stomachs had turned, so we just tried to take it easy. Steve mentioned that he'd been at a talk the day before about how it was worth trying to take a short nap if you were going to be much over 30 hrs, so I decided to try to do that at Champex.

The La Fouly to Champex-Lac section is deceptively long: there's a long slow downhill from La Fouly which is mostly on dirt and then road, so theoretically runnable (here again, I wasn't running as much and so losing time), followed by the climb to Champex, where Chris was waiting. The climb isn't that technical and I started to feel better once I got on it and was able to push some without it jostling my stomach (also it was starting to get later in the afternoon). It helps that the climb isn't really that long and kind of shaded.

Met up with Chris again, more Tailwind Recovery, and we refilled everything. Sure enough, there was a tent with mattresses and I did try for 15 min, but I wasn't able to sleep at all. Eventually, I gave up, but did take advantage of the mostly quiet tent to change my shirt and hat, as everything was all sweaty and I wanted it dry for the evening.

Champex-Lac to Trient [16.2 km, +914/-1088, 4:35, +2:09] #

The rest of the course is three big climbs and descents, with crew at the end of each (well, the last one is the finish), so it was just a matter of getting through each.

The first of these is Champex-Lac to Trient. This was probably objectively the hardest of the three, not so much because of the vert (over 800M+) but because it's really rocky and steep, so it's hard to find your pace because you're constantly having high step, etc. The downhill doesn't get much better either because it's just rocky and rooty, so I (at least) couldn't go that fast and there was a lot of intermittent hike/running when I should have been running.

Arrived Trient in the dark and it's rinse repeat from here: Tailwind recovery, new bottles, and go. Still nauseated here, but it was kind of under control.

Trient to Vallorcine [8.9km, +836/-875, 3:10, +2:39] #

This was probably the best of the last three climbs: it was the least technical and so you could mostly just motor up it, and I felt pretty OK until partway up when I started to have some serious bathroom issues. The trail was pretty wide but was uphill face on one side and drop-off on the other but I was finally able to find a section where I could go downslope a bit, hang onto the hillside, and go. Thanks to whoever gave me a hand getting back up after I was done. Took a couple of immodium here, which seemed to help, at least as far as Vallorcine.

Not too much to report about this section. Just something I had to get through to make the final climb. Was relieved when I got to Vallorcine and met Chris for the final time. Did the usual aid station thing and was also able to bum some salt tablets off a fellow runner as I was running out, as well as some napkins to use as toilet paper off of the med workers.

At this point I decided to swap drinks: I'd been drinking Tailwind or Overstim the whole time but I'd brought some Maurten powder and poured that into my bottles for the last push. Maurten is a hydrogel formula designed to reduce GI distress and also comes in a 320cal/500ml formulation (Tailwind is 200), which means that you don't really need to eat anything, which I was looking forward to at this point. It's got a bit of an off-putting slimy texture, which is part of why I didn't want to use it the whole time, but at this point that seemed pretty good.

Vallorcine to Finish [18.5km, +972/-1200, 4:36, +3:20] #

This last section was a real mixed bag. You start out on some flat segments and then there's a longish gradual uphill. I felt great on this: it was very smoothly terrain and just a few percent grade so I pulled out the poles, put in the headphones, and power hiked at a nice hard pace, with the result that I was passing people left and right, in part because some of them were clearly dying but also because I was moving fast. This continued into about partway through the main climb, even as it turned into a series of rock steps.

Then about 1/3 of the way up, my bathroom issues returned. I was eventually able to get a little bit off the trail and go but a lot of people passed me during this section and I somehow never regained my momentum. In theory these were people who were behind me, and so I should have re-passed them, but in practice, it just wasn't that easy.

This section felt unbelievably long, mostly because you're just not moving at all fast due to the terrain. Once you finally get near the top you have to pick your way through a boulder field, which is really slow going, at least for me. Eventually, I made it to La Tête aux Vents, and from there it's mostly flattish to La Flégère, albeit quite rocky. Here too, it was tough to run, though some people with better footwork than I tore past me. I actually fell once here and landed hard on my arm. After that I put my poles away. Here's Jim Walmsley on this section, not looking very fast. There's a final tiny climb to La Flégère. Not worth taking out my poles and I just did it hands on knees to the aid station. It's all downhill from here and I had plenty of food and fluid so I didn't bother to stop and just headed back down.

At this point, my focus was on a sub 38 finish (well short of my target, but oh well). It's supposedly 8K down to the finish and I left La Flégère at 36:48, so I needed to run 9:00/km to get there on time. This doesn't sound very fast, but of course at this point you're tired. The initial descent out of La Flégère is on fire road and so I was able to hit it pretty fast. Even so, when my 37:00 timer fired, I was 1.2km out, and 10:00/mi wasn't going to do it. We fairly quickly got off fire road and into some very switchbacked single track, which slowed things down further.

I'd reconned the last 4K or so of the course, so I knew that it turned into runnable fire road and then smooth trail and road about 3K out, so I mostly just needed to survive the single track (without crashing!) and get to the part where I could work. I figured I needed to hit 37:30 with <4K to go in order to be on target, so I pushed as much as I dared. The pace wasn't bad, and I passed a few people, including some PTL finishers (the real heroes) but I definitely had a few braver—or more agile—people tear by me. I hit 37:30 at 3.8K or so, which was behind schedule, but I knew that I was getting close to the really runnable section so I just held on. I had plenty of gas here, I just wasn't able to run faster safely.

Finally I hit the fire road and was able to really open up some, even though it was kind of steep and rocky. Then there's a final switchbacky portion that I'd run before—though the course actually just goes straight downhill across a bunch of the switchbacks—and then out onto the road, or rather onto this terrifyingly rickety and slippery metal bridge that the race had erected over the road, and then finally onto dirt trail.

I looked at my watch and it was 37:40 and I figured it was less than 2K of flat running to go (actually more like a K, as it turned out) so things were probably OK if I didn't dawdle. As I said above, I had plenty left because I'd been running the equivalent of recovery pace for the last hour or so, so I felt comfortable pouring on the gas, or whatever gas I had left, and I ran the last half mile in 8:49 (which felt like 7:00), passing maybe 3 or 4 people in that last section. Once I realized I was so close and actually might go sub 37:50, I started pushing even harder. This is helped by the fact the in the last quarter mile you're running through Chamonix proper and everyone's cheering you on, even before 8 in the morning.

This time I learned my lesson about the kind of photos you get when you stop right at the finish line (like you're just standing there fiddling with your watch) and ran all the way through to finish in 37:49:49.

The best race picture ever taken of me [official race picture]

Event Review #

UTMB is an odd mix of very high production values—probably the best I've ever seen—and confusing organization.

On the good side, the support is fantastic and they have done a good job with a bunch of small things. For instance, there is solid live tracking of where runners are that helps your crew and then after the fact you can get really good data about your performance, including your pace and position at every checkpoint as well as links to videos of when you went through. For instance, here's me coming through the finish. Pretty glad to see I'm still running at this point. Another nice touch is that they give you a number for the back of your pack with your name and nationality on it so that people can talk to you when coming up from behind. And, of course, just having it be such a giant event with so many spectators is a great energy.

On the bad side, the communications and logistics can be pretty confusing. For example, the rules require you to carry "Minimum water supply: at least 1 liter". Does this mean you need to carry 1l at all times, in which case you would actually need to carry 2l out of the aid station so you had something to drink, or that you just need 1l worth of bottles? Who knows. Everyone seems to think it's the second. Another example is that they require "ID – passport/ID card". I'm not an EU citizen so do I need a passport? Apparently not; at least I didn't bring one. Similarly, it was hard to learn about the schedule for crew to be bused around. These are small points, of course, but it's just friction that adds up and is a bit surprising with an event that is otherwise well run.

Overall, though, this is a great event and I would definitely encourage anyone who was interested in European ultrarunning to check it out. Of course, it's also really hard to get into, so I can also recommend the Innsbruck Alpine Trail Festival, which is similar-ish terrain (though easier) and you can just sign up.

Retrospective #

This feels like one of those races that could have gone better. It's possible that 34:29 was optimistic and certainly I ran some of it with other people with strong track records who were nowhere near their target times. On the other hand, I feel like there was a fair amount of room for improvement.

Nutrition #

What went well here was that I stayed on my nutrition plan. I planned to drink 250ml of sports drink every 30 minutes and then eat 100cal of something else every hour,for a total of 300 cal/hr, plus whatever I consumed in aid stations. I didn't hit that perfectly, but I was fairly close. Same with the plan to drink Tailwind Recovery in aid stations, which worked well, as by that point that chocolate taste was about all I wanted. The end result was that I never really bonked at all.

On the other hand, I was nauseated a lot of the event and that stopped my from running when I should have. I probably needed to do some adaptation here and try to figure out how to debug it rather than just slow down and walk through it. This is always a tough one, but probably I needed to switch out what I was eating earlier. I had brought mostly Spring Cannaberry, which I usually like, but about halfway through I didn't want any more. I had brought some Powergel Strawberry/Banana, which usually I lose the taste for half-way (ironically, preferring Spring), but this time I liked it at the halfway-type point and wished I had more; unfortunately due to a snafu with my drop bag, I only had about two of these as opposed to the 5 or so I actually brought.

I think part of the problem here was running low on salt: Hydrixir long distance has about half as much sodium as Tailwind (333mg/500ml as opposed to 620mg/500ml). I found myself wanting salt and I did have salt tablets, but I don't think I got on this early enough, and the SaltStick caps I am using only have 215mg of sodium, so you need to take a lot of them to catch up for that deficit. I did have some soup early and that tasted good so it probably should have been a hint that I needed to be more aggressive about sodium. Next time, probably I need a schedule for salt intake.

In retrospect, I wish I had just assumed I wasn't going to eat any of the race food and just use the race drink, and then I could have just planned all my eating and not had to think about it. That would have reduced cognitive load.

Aid Stations #

I spent too much time in aid stations. Enough said. I was tired, but that's when you have to just get in and out. The data says like 1:40, and I should be able to get that down to less than an hour.

Pacing #

I think my pacing was pretty OK here. I felt like I did a pretty good job of not pushing the first half too hard most of the time. If you look at the graph of my position in the race, I was mostly flat through the first half, and then got gradually better after Courmayeur and especially Arnouvaz:

[From the UTMB site]

This looks like pretty good pacing, although it also has me falling further behind LiveRun's projections as time goes on rather than as I sort of assumed, losing time initially and then holding on.

There are several places I'm not happy here. First, I felt like I was working too hard on the early climbs. Basically, they were just too steep to take it easy on. Second, I should probably have pushed harder in the flat runnable section after Refuge Bertone then down from La Fouly. There were reasons, but I think it would have been better to push. These are kind of opposites, but I think that's right: you want to take the uphills easier early and then the downhills faster to take advantage of it.

Finally, as a result of my inability to run the technical downhills hard, I actually wasn't as tired at the end as I otherwise would have been, which is why I was able to dig for the last kilometer or two. I could have done that for quite a bit longer and would have started earlier if I hadn't been worried about my footing. Maybe this is a signal I should have pushed more of the uphills on the theory that I could recover on the downhills, but if you're really tired at the top, your chance of tripping goes up.

Training #

Some of this goes back to my training. I feel like my fitness was good as evidenced by various workouts, but there are two places where I think more specificity would have paid off.

First, I wish I'd done more hiking on difficult courses. I did a lot of training at similar grade ratios (~60m/km) but it was mostly on smooth courses where I could run the whole thing. Even when I hiked it was mostly on courses where I could have run. The few times I did something really hard (Yosemite, Mount Diablo), it slowed me down a lot. The result was that I wasn't able to initially take those difficult climbs as easily as I wanted while still making progress and then later to really push them without getting exhausted.

Second, I need to spend more time running technical downhills. I've gotten good enough to go fast when it's non-technical but as soon as it got rocky or rooty, a lot of people were going past me a lot. This was really noticeable in descents that had a mix of single track and fire road because I'd get passed on the former and pass on the latter.

On the other hand, UTMB is probably one of the few races like this I'm really going to run—just say no to TDS—so this may not be a piece of specificity I need so much in the future.

Overall #

Overall, this wasn't a big success but it could have been a lot worse. I finished in good order and while I had some bad spots I never had anything where I really cratered. I didn't get injured, I'm back to running a bit already, and I've built up a lot of fitness that I can use later in the season. Plus, I've got some cool UTMB gear to wear to other races.

Finally, I want to really thank Chris for flying over and crewing me. It made an enormous difference.

[These photos helpfully taken by strangers with Chris Wood's phone.]

Overall: 37:49:49, 623/1789 finishers (838 DNF),

ELI15: Private Information Retrieval

2022-08-30T00:00:00Z

In my post on Safe Browsing I mentioned that one possible solution to the problem of querying the Safe Browsing database is Private Information Retrieval (PIR) and then waved my hands vigorously about it being crypto magic. In this post, I'm going to attempt to explain how PIR works with as simple math as possible. You will, however, want to read the Web version of this post because there is a fair bit of math and I use LaTeX to render it with MathJax, which looks bad in the newsletter version.

The PIR Problem #

The basic version of the PIR problem looks like this:

You have a server with some database $\mathbb{D}$ consisting of a set of $d$ elements $D_1, D_2, D_3, ... D_d.$
The client wants to retrieve the $i$th element $D_i$ but doesn't want the server to know which element it retrieved.

There is an obvious trivial solution^[1] in which the server sends the client the entire database and the client just looks up the value it wants to know^[2] This provides privacy but at the expense of communication cost because you have to send the entire database. The challenge, then, is to build a system which has involves sending less data, has comparable privacy, and which doesn't chew up too much computational power.

There are two main flavors of PIR:

Single server schemes
Multiple server schemes

The single-server schemes are designed under the assumption that the server is malicious and use cryptographic mechanisms to protect against it. The multiple server schemes are designed under the assumption that some subset of the servers is non-malicious and are insecure if all the servers misbehave. In this post, I'll be talking solely about single-server PIR schemes; at some point in the future I might talk about multi-server.^[3]

A Simple Insecure Solution #

The first observation to make (due to Beimel, Ishai, and Malkin) is that any single-server system must involve the server computing some function over every element in the database. Otherwise, the server could simply look at which elements were touched and learn something about which were retrieved. This tells us something about how things need to be constructed.

Let's start with a solution that's insecure but can serve as the basis for a secure solution. Take the database and arrange it in a square arrangement (a "matrix") like so:

$$ \begin{bmatrix} D_1 & D_2 & D_3 \\ D_4 & D_5 & D_6 \\ D_7 & D_8 & D_9 \\ \end{bmatrix} $$

In order to make a query, the client creates a list of numbers, that consists of all 0s except for the number corresponding to the column in the matrix that it wants to read. For instance, if it wants to read value $D_6$, it would send the list below (I'm writing this vertically for reasons which will become apparent shortly).

$$ \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} $$

The server constructs its response as follows. For each row in the matrix, it then goes column by column multiplying the value in its database times the value in the same row provided by the client and adds up the values for each column in the database. This produces a list that is the same length as the client's input, where each value is constructed by multiplying the elements in the matrix times the elements in the client's input. In this case, we would then get:

$$ \begin{bmatrix} D_1 \cdot 0 + D_2 \cdot 0 + {\color{red}D_3 \cdot 1} \\ D_4 \cdot 0 + D_5 \cdot 0 + {\color{red}D_6 \cdot 1} \\ D_7 \cdot 0 + D_8 \cdot 0 + {\color{red}D_9 \cdot 1} \\ \end{bmatrix} = \begin{bmatrix} D_3 \\ D_6 \\ D_9 \\ \end{bmatrix}
$$

As you can see, what's happened here is that the 0s erase the columns we're not interested and we're just left with a list of the column of interest (in this case the rightmost one), shown in read. The client can then just read out the value of interest by looking at the right row.

Those of you who have taken linear algebra will recognize this as conventional matrix multiplication, where we multiply the database times the selection vector. However, you don't need to know that in order to understand what's going on.

It's worthwhile to stop and look at the properties of this design. In the trivial solution, the server had to send $d$ values to the client, whereas in this design the client has to send $\sqrt d$ values and the server sends $\sqrt d$. With a small database like this one, this is a trivial improvement, but for a large database $2\sqrt d$ is going to be much smaller than $d$. The server has to perform $d$ computations, one for each value in the database; as noted above, this is expected. Unfortunately, this scheme is also trivially insecure in that the server learns the column (though not the row) that the client is interested in so we need something fancier. The solution lies in a technology called "homomorphic encryption".

A More Secure Solution: Homomorphic Encryption #

Partially Homomorphic Encryption #

It's been known for a very long time how to do partially homomorphic encryption. As a concrete example, consider the case where you encrypt some data by XORing it with a key, i.e.,

$$Ciphertext = Plaintext \oplus Key$$

With this system, you can have the server compute the XOR of two plaintexts $P_1$ and $P_2$, given only the encrypted form. The client sends:

$$ (C_1, C_2) = (P_1 \oplus K_1, P_2 \oplus K_2)$$

The server returns:

$$ C_1 \oplus C_2 $$

Which the client XORs with $K_1 \oplus K_2$, i.e.,

$$P_1 \oplus K_1 \oplus P2_2 \oplus K_2 \oplus K1 \oplus K_2 $$

When you cancel out the keys ($A \oplus A = 0$) you get:

$$ P_1 \oplus P_2$$

The difference between partially and fully homomorphic encryption is that with a partial homomorphic system you can compute some functions on encrypted data but not others. With a fully homomorphic system you can compute any function, whereas this system is homomorphic with respect to XOR but not (say) to multiplication. The problem of fully homomorphic encryption had been open for a long time until Craig Gentry finally showed how to do it in 2009.

The reason that our simple approach was insecure is that the server has to know which values in the client's list are 0 and which are 1, and so can easily determine which column the client wants. But what if the server could perform this computation without determining which of the client's values was 1? Kushilevitz and Ostrovsky figured out how to do this in 1997, using a technique called homomorphic encryption. A homomorphic encryption system is one in which you can operate on encrypted data without seeing the content of the data (see the sidebar for some intuition on this).

Specifically, we want a homomorphic encryption scheme which is homomorphic with respect to addition. I.e., if we have two ciphertexts $E(A)$ and $E(B)$, there is some way to compute $E(A + B)$ without knowing $A$ or $B$. All we have to do is have the client encrypt its 1s and 0s under a homomorphic system to which it knows the key, then send the encrypted versions to the server. The server can then perform the same computations as before, except with the encrypted data.

The way this works is that the client sends:

$$ \begin{bmatrix} E(0) \\ E(0) \\ E(1) \end{bmatrix} $$

The server would then compute: $$ \begin{bmatrix} D_1 \cdot E(0) + D_2 \cdot E(0) + {\color{red}D_3 \cdot E(1)} \\ D_4 \cdot E(0) + D_5 \cdot E(0) + {\color{red}D_6 \cdot E(1)} \\ D_7 \cdot E(0) + D_8 \cdot E(0) + {\color{red}D_9 \cdot E(1)} \\ \end{bmatrix} = \begin{bmatrix} E(0) + E(0) + {\color{red}E(D_3))} \\ E(0) + E(0) + {\color{red}E(D_6) }\\ E(0) + E(0) + {\color{red}E(D_9) }\\ \end{bmatrix} = \begin{bmatrix} E(D_3) \\ E(D_6) \\ E(D_9) \\ \end{bmatrix}
$$

The client receives this value, decrypts, and it's got the result. One thing that might be sort of confusing here is that I'm showing the server both adding, and multiplying, as in:

$$ D_1 \cdot E(0) + D_2 \cdot E(0) + D_3 \cdot E(1) $$

However, because the server is multiplying the encrypted value by a known value, it can do this just by addition, as in:

$$ E(2A) = E(A) + E(A) $$ $$ E(3A) = E(2A) + E(A) $$

So, all you need is an addition operation. There are, of course, tricks to make this faster. For instance, you can compute powers of two (1, 2, 4, 8, etc.) and then just build up the final value from those. If you want to multiply two encrypted values, e.g., $E(A) * E(B) = E(AB)$ then you need a fancier system, but that's not required here.

Of course, finding a suitable homomorphic encryption scheme is tricky because you want something that is cheap to compute and has a small ciphertext. The original K-O scheme used a fairly inefficient homomorphic encryption system and much of the work here has been in finding better systems.

Detail: Homomorphic Encryption using ElGamal #

You don't need to understand how to build a homomorphic encryption algorithm in order to understand PIR, but it's sometimes helpful to see things written out. In this section, I describe a simple well-known scheme based on the ElGamal encryption algorithm.

In the ElGamal encryption system client and server share a known value $g$. In order to receive a message, an entity (say the client) creates a random value $y$ and publishes $g^y$. In order to encrypt a message $m$ to someone, you generate your own random value $x$ and then send the pair of values:

$$g^x, g^{xy} \cdot m$$

The recipient—who recall has $x$—can then do the following computation:

Take $g^x$ from the message and raise it to $y$ to get $(g^x)^y = g^{xy}$
Divide the second part of the message by $g^{xy}$ to recover $m$

Note that in ordinary integer math, given $g^a$ and $g$ it's easy to compute $a$ but we're going to be doing this in a setting where that computation is hard, namely modulo some prime $p$.^[4] This is called the discrete logarithm problem or just "discrete log". The intuition is that if you can compute $g^xy$ either by knowing $g^y$ and $x$ (which the sender does) or $g^x$ and $y$ (which the receiver does) but if you only know $g^y$ or $g^x$ you're stuck. Everything else is pretty much the same as normal math but just remember that part.

However, it turns out that system is homomorphic with respect to multiplication, not addition. Consider the pair of ciphertexts:

$$E(m_1) = (g^{x_1}, g^{x_1y}m_1)$$ $$E(m_2) = (g^{x_2}, g^{x_2y}m_2)$$

If we multiply the first parts and the second parts together, we get:

$$E(m_1 m_2) = E(m_1) \cdot E(m_2) = (g^{x_1 + x_2}, g^{y(x_1 + x_2)}m_1m_2)$$

You can decrypt this exactly as before to get $m_1m_2$.

However, I said above, that we wanted something that was homomorphic with respect to addition not multiplication. The trick here is that instead of encrypting message $m$ you instead encrypt $g^m$. Thus, the result becomes:

$$g^{m_1} \cdot g^{m_2} = g^{m1 + m2}$$

And you just need to take the discrete log to recover $m_1 + m_2$ But didn't I just say that taking discrete logs is hard? Basically, this works fine as long as the value to be retrieved is relatively short. So, for instance, if we restrict ourselves to retrieving a single bit, then you just need to compare against $g^0$ or $g^1$. The limit depends a bit on computational power, but it's fairly practical to retrieve 32-bit values with the right algorithms (for smaller values like 8 bits you can just build a table).

To use this system in practice, the client is just going to encrypt to itself by generating a key that it knows, but otherwise we just use this system as-is.

Complexity #

This is all pretty cool and it's better than nothing, but it's also not very efficient: in order to retrieve a single value you need to send $2\sqrt d$ values ($\sqrt d$ values in each direction) and the values themselves are relatively large (in basic ElGamal, from 512 bits to 8192 bits^[5]). On the other hand, if the database is large, it's still more efficient than sending the whole database. The database size is $d$ values, so if each value is a single bit (as in the original K-O scheme), the breakeven point where you send less data than you would just by sending the database is around $512^2$ (about 260,000) entries if you are using an efficient version of ElGamal. The situation with the original K-O system was even worse.

In terms of computational complexity, the server has to compute over each database entry, so that's $d$ units of work—recall that you have to compute over each value in order to have a PIR system. The client only has to compute the $\sqrt d$ input values, then decrypt the relevant returned values and take the discrete log, so that's fairly cheap.

Improvements #

As described in the original K-O paper, it's possible to significantly improve the basic scheme by being a little clever.^[6]

Reusing the Client's Vector #

The original K-O scheme was even worse than what I have presented above in that you could only extract single-bit values. This meant that if you wanted to extract multiple-bit values, you naively just repeat the protocol for each bit, so both the database size and the PIR protocol scale linearly with entry size.

This leads to an obvious improvement: say that I want to read values from a database where each entry is 8 bits rather than 1. As noted above, the client could just send 8 input vectors, but why? The client's vectors all pick out the same column in the database and they're not specific to anything on the server side.

Instead, the server can just compute its results for each of the 8 bits of the database using the same client input. The server then sends back $8 \sqrt d$ values, with the first $\sqrt d$ being for the first bit, the next $\sqrt d$ for the next bit, etc. but all computed over the same client input. This gives you a total communications complexity of:

$$ C (1 + \sqrt d + b \sqrt d) $$

Where $b$ is the number of bits to be extracted and $C$ is the size of the homomorphic encryption ciphertext. It also means you don't need multiple round trips. Of course, if you have a fancier scheme that lets you extract values that are greater than one bit, then this trick becomes less interesting. However, if you need to extract big values that make discrete log impractical (say 100 bits) then it becomes useful, because you can extract the value in pieces, each of which is easy to compute discrete log on.

Recursion #

The next optimization requires a little more cleverness. Recall that the server sends the client values corresponding to each row in the database but that the client only cares about one of the rows. Say we have a database that is consists of $d$ values and so our matrix is $\sqrt d$ on each side. The client sends $\sqrt d$ values and the server replies with $\sqrt d$ values. The client only cares about the $i$th value in the server's response, but it can't tell the server that because that would tell the server which row it was interested in.

The key insight here is that this itself is a PIR problem, with the database consisting of $\sqrt d$ values of length $C$. In the naive protocol described above, the server sends the entire database to us, but we only care about $\frac {1}{\sqrt d}$th of it. We can use the same PIR scheme to request just the pieces we care about, one at a time.

But why stop there? We can keep using the same trick! Imagine we have a really big database of $2^{48}$ entries. Then even the second level database representing $\sqrt d$ entries in the server's response is going to be quite large, which means that the PIR problem of extracting one element out of that vector is also expensive. But we can do the same thing again. It's turtles all the way down!

Further Improvements #

Even with the optimizations above, we're still left with a system which isn't very efficient, especially for smaller data sets, where it's quite a bit worse than just transferring the entire database (the advantage goes as a factor of $\sqrt N$). In the 25 years since the original Kushilevitz and Ostrovsky paper, there has been quite a bit of work in this area.

This seems to fall into a small number of buckets.

Improving the Inner Loop #

If you step back and look at the basic design of the K-O protocol, it looks like this (I'm using the linear algebra matrix multiplication notation here, but really the $\times$ just denotes that we are doing whatever our core operation is in criss-cross fashion, as before:

$$ \begin{bmatrix} V_1 & V_2 & \color{red}{\mathbf{V_3}} \\ V_4 & V_5 & \color{red}{\mathbf{V_6}} \\ V_7 & V_8 & \color{red}{\mathbf{V_9}} \\ \end{bmatrix} \times \begin{bmatrix} 0 \\ 0 \\ \color{red}{\mathbf{1}} \
\end{bmatrix} \rightarrow \begin{bmatrix} \color{red}{\mathbf{V_3}} \\ \color{red}{\mathbf{V_6}} \\ \color{red}{\mathbf{V_9}} \\ \end{bmatrix} $$

In other words, the input vector supplied by the client operates on each row of the database, picking out the column of interest to the client and ignoring the other values (remember that rows in the input vector correspond to columns we want to select). The server sends back each resulting row and the client reads the row of interest, ignoring the others. This basic structure holds whether the operation being performed is simple multiplication (as in our insecure example) or homomorphic encryption.

This means that the cost of the system is determined by the basic scaling properties of $2 \sqrt d$ communications cost and $d$ computational cost, but multiplied by the cost of the homomorphic encryption system. The more efficient the homomorphic encryption system is, the more efficient the whole thing will be. There has been a fair amount of work invested in finding more efficient homomorphic encryption algorithms to plug in here.

Reducing the Client's Input Vector #

There is another cute trick we can play, that's a natural extension of the techniques we have already seen. Suppose that we have a homomorphic encryption scheme that lets me:

Add as many encrypted values as I want
Do a single multiplication of two encrypted values

In this case, we can reduce the communication cost further, as described by Boneh, Goh, and Nissim. Instead of sending a single list of encrypted values, containing a single (encrypted) 1, the client sends a pair of lists, each containing a single (encrypted) 1. The server then computes the product of each pair of values in each list, e.g.,

$$ \begin{bmatrix} 1 \\ 0 \\ \end{bmatrix} \begin{bmatrix} 0 \\ 1 \\ \end{bmatrix} \rightarrow \begin{bmatrix} 0 & 1\\ 0 & 0 \\ \end{bmatrix} $$

We can then lay this out in a deterministic order left to right and top to bottom (though any rule will work) as a single list, like so:

$$ \begin{bmatrix} 0 \\ 1\\ 0 \\ 0 \\ \end{bmatrix} $$

This list can then be used as the input to the standard K-O protocol, and we've just reduced the total number of values the client sends from $\sqrt d$ to $\sqrt[4] d$ (the server to client communication remains unchanged). We can actually improve the situation further by changing the structure of the database to be non-square, instead having $\sqrt{3} d$ rows and $(\sqrt{3} d)^2$ columns. In this case, the client sends two input vectors, each of which are $\sqrt{3} d$ long, the server maps them onto a $(\sqrt{3} d)^2$ long vector. It does the same criss-cross trick as before, producing a result that is $\sqrt{3} d$ long and sends it to the client, for a total communications cost of about $3 \sqrt{3} d$.

Precomputation #

One interesting recent development in PIR is the design of systems which use precomputation to make the PIR process cheaper. The basic idea is that with a suitable homomorphic algorithm the server and client can perform some initial exchange, presumably involving some computation and the exchange of some data (a "hint"). Once the hint has been exchanged, the client can make individual queries much more cheaply. This makes sense for applications like Safe Browsing where the client is likely to make a lot of queries and so you can amortize the hint.

The specific precomputation techniques vary. In some designs, the client and server perform some client-specific precomputation and in others like SimplePIR, the server just does the computation itself and distributes the hint to every client.

Other Designs #

I've focused here specifically on designs that follow this K-O model, largely because they are intuitively easy to explain. There are also designs (for instance Cachin, Micali, and Stadler and Gentry and Ramzan) that are based on other structures and involve sending less data but at increased computation cost. The math here is a lot harder—I only somewhat understand it myself—so I'm not going to try to explain them here.

The Big Picture #

In conclusion, I'd like to make two points here. First, this is a really counterintuitive —at least to me—result: we can allow a client to read some fraction of the server's data without the server learning anything about which values the client wants and in a fashion more efficient than just sending the client all the data. Hopefully, this post gives some intuition for why that's possible, thus rendering it less counterintuitive if not precisely obvious.

Second, PIR is an immensely powerful primitive. There are a whole pile of problems which would be much easier if we had efficient PIR, ranging from Safe Browsing, to messaging interoperability, to authentication for phone calls. We're not yet at the point where you can just drop in PIR the way you would drop in TLS, without really thinking about the cost, but we are getting closer to the point where some of these applications are practical. In fact we may already be there in some cases.

Acknowledgement #

Thanks to Henry Corrigan-Gibbs for assistance with this post. All mistakes are of course mine.

This is the standard leadin to this problem, as seen, for instance, in the Wikipedia article. ↩︎
Indeed, the longer hashes version of Safe Browsing is precisely this. ↩︎
As an aside, it's known that it's not possible to have information theoretic security with a single server. You have to depend on some cryptographic assumption. There are information theoretically secure versions of multi-server PIR, as long as some of the servers are not malicious. ↩︎
Yes, yes, or on an elliptic curve or something. ↩︎
Recall that you have to send two values for each ciphertext. ↩︎
Note: I am using a somewhat different presentation order which I think is easier to understand. ↩︎

Can we make Safe Browsing safer?

2022-08-16T00:00:00Z

The Web is full of bad stuff and it's the browser's job to protect you from it as best it can. For certain classes of attack, such as attempts to subvert your computer, that is a conceptually straightforward matter of hardening the browser, as described in the Web security guarantee:

users can safely visit arbitrary web sites and execute scripts provided by those sites.

In practice, of course, browsers have vulnerabilities which mean they don't always deliver on this guarantee. However, even if you ignore browser issues, there are other classes of harm, such as phishing or fraud, that aren't about attacking the computer but rather about attacking the user. Because these threats rely on users incorrectly trusting the site, hardening the browser doesn't work; instead we want to warn the user that they are about to do something unsafe. The primary tool we have available for protecting against this class of attack is to have a blocklist of dangerous sites/URLs. The most widely used such blocklist is Google's Safe Browsing, which is used by Chrome, Firefox, and Safari, and other browsers (there are other similar services, but Safe Browsing is the most popular).

The Safe Browsing Database #

In order to implement Safe Browsing, Google maintains a database of potentially harmful sites that it collects via some unspecified mechanism. The Safe Browsing database^[1] consists of a list of blocked strings which consist of:

Domain names or parts of domain names
Domain and path prefixes, broken at path separators (/)
Domain and paths and query paramaters

So, for instance, for the URL https://example.com/a/b/c the database might contain example.com if the whole domain was dangerous or maybe example.com/a/b if only some parts of the domain were dangerous. In order to check a URL, you break it down into the list of prefixes and check all of them. If any of them match, then the URL is dangerous. Here's the example Google gives for the URL http://a.b.c/1/2.html?param=1:

a.b.c/1/2.html?param=1
a.b.c/1/2.html
a.b.c/
a.b.c/1/
b.c/1/2.html?param=1
b.c/1/2.html
b.c/
b.c/1/

If any of the substrings match, then the browser shows a warning, like this:

Pretty scary, right?

Querying the Database #

Note: There are a number of versions of Safe Browsing. This describes the Safe Browsing v4 protocol which is what is currently implemented in Firefox, which I just call Safe Browsing for convenience..

Of course, the Safe Browsing database is on Google's servers, so the browser needs some way to query it. The obvious thing to do is for the client to send Google the URLs it is interested in and just get back a yes or no answer. Safe Browsing does have an API for this, but of course this has some obvious very serious privacy problems, in that the server gets to learn everyone's browsing history, which is something that many browsers try to stop in other contexts. AFAIK, no major safe browsing client currently operates this way by default, although Chrome offers a feature called "enhanced safe browsing" in which Chrome queries the Safe Browsing service directly for some URLs:

When you switch to Enhanced Safe Browsing, Chrome will share additional security data directly with Google Safe Browsing to enable more accurate threat assessments. For example, Chrome will check uncommon URLs in real time to detect whether the site you are about to visit may be a phishing site. Chrome will also send a small sample of pages and suspicious downloads to help discover new threats against you and other Chrome users.

However, this is not the default behavior.

The other obvious design is to just send the entire database to the client and let it do lookups locally. This is a reasonable design and one which I'll consider below, but it's not the way the current system works. Instead Safe Browsing uses a design which is intended to balance performance, privacy, and timeliness.

The basic structure of the system works as follows. For each string S_i in the database, the server computes a hash H(S_i). It then truncates each hash to 4 bytes (32 bits) and sends the truncated list to the client, as shown below:

The impact of this process is to compress the set of strings somewhat, to a total size of 4I bytes where I is the total number of strings (there is also a system to compress the database somewhat). As shown in this diagram, it's possible that multiple strings will map onto the same truncated hash (though different full hashes). As a practical matter, this is a pretty sparse space: there are only about 2²² (3 million) strings and there are 2³² possible truncated hashes, so there will be approximately as many truncated hashes as there are input strings; the full hashes are 256 bits long and so are unique with extremely high probability.

However, the cost of this design is false positives: effectively, the hash function maps an input string onto a random 32-bit hash, and about 1/1400 of these hashes will correspond to one of the truncated hashes that the server sends to the client.^[2] Obviously, if the client were to generate an error every time there was a match this would create an unacceptable client experience, as people would regularly encounter scary warnings. However, this data structure does not have false negatives: if the hash prefix isn't in the list, then the hash won't be in the full list either.

Order of Operations #

In Firefox, Safe Browsing checks proceed partly in parallel to retrieving the URL; because the primary risk is the user inappropriately acting on the returned Web page, it's fine to contact the server as long as you don't display the result. This parallelism allows for better performance.

However, Firefox uses a similar mechanism for it's Tracking Protection feature, and the purpose of that feature is (partly) to prevent trackers from using IP address-based tracking, so it's not even safe to send a request to the server before checking the blocklist. Fortunately, Tracking Protection downloads a list of full hashes and so doesn't need to wait for the server.

Instead of generating an error, the client double-checks the match by asking the server to send the full hashes corresponding to the truncated hash. In order to check a string, the client proceeds as follows.

Compute the full hash
If the 32-bit hash prefix is not in the downloaded list, then the string is OK and continue to retrieve the URL.
Otherwise, send the hash prefix to the server and ask the server to provide the list of corresponding full hashes with that prefix (typically just a single result).
If the full hash is on the list of returned hashes, then generate an error.
Otherwise, continue to retrieve the URL.

This design has a number of advantages. First, it means that the server doesn't need to send the client the entire database, which is about four times larger than the truncated database because the hashes are four times larger (though more on this later).

Second, it allows the server to quickly retract inappropriately blocklisted sites. Suppose that the server had blocklisted a URL with hash XY where X is the 32-bit prefix and Y is the rest of the hash. The client retrieves X as part of downloading the database and then when it gets a match, asks for all the hashes starting with X. However, in the meantime, the server has decided that XY is OK. In this case, it can just return an empty list and the client will continue without error.

Conversely, however, the server cannot easily add new values between client-side database updates. Because the client never contacts the server if the prefix isn't in the database, then the server won't have an opportunity to add new entries unless they happen to correspond to a prefix which is already in the database, which, as noted above, is quite unlikely. This is somewhat unfortunate because a lot of phishing attacks operate on the time scale of minutes to tens of minutes and so you would need the client to update its database unpractically frequently in order to catch them (hence the reason for "enhanced safe browsing").

Finally, because most of the potential hash prefixes don't appear on the prefix list, the client mostly doesn't need to contact the server. This improves performance (because most URLs can be retrieved immediately) and privacy (because the server doesn't learn anything for most URLs). In addition, the client can cache any full hashes it has retrieved for a given prefix for some time, so it won't need to recontact the server during the cache lifetime.

Privacy Implications #

The basic privacy problem with the Safe Browsing is that even though clients don't connect to the server for most URLs, they do connect for some URLs. Naively, you would expect the server to get queries for about 1/1400 of the user's browsing history keyed by the IP address (obviously the browser shouldn't send cookies!) but actually this underestimates the situation in two important ways:

As described above, the browser checks multiple strings for the same URL, with the exact number depending on the URL. Each of these might result in a query to the server. If we assume that there are 5-10 strings to check per URL, we're looking at more like 1/200 to 1/400 URLs.^[3] The situation is even worse if you visit multiple URLs on the same site.
This calculation assumes that the server isn't malicious. Consider a server which wants to know whenever you go to Facebook: it just needs to compute the hash prefix for facebook.com and publish that. When the client queries for that prefix, it returns a random hash (thus ensuring there is no blocking), but the server gets to learn that the client might be going to Facebook.

Checking Passwords #

Another general problem in this space is checking compromised passwords. The general setting here is that there is a server which has a list of passwords that have been in breaches, such as HaveIBeenPwned and the client wants to determine if the user's password is on the list. Naively, you can use the same protocol for this application as for Safe Browsing, but there are two complicating factors:

The server may not want to keep the list of password hashes secret to prevent people from learning the list of passwords.
Because some passwords are much more common than others, the client may want to prevent the server from learning that it has one of these passwords by sending the corresponding hash prefix.

An example of a technique tuned specifically for password checking is provided in a paper by Li, Pal, Ali, Sullivan, Chatterjee, and Ristenpart which uses a combination of private set intersection to prevent the client from learning the hashes and "frequency smoothed" hash bucketing to prevent the hash from leaking information about the client's password.

Of course, the server doesn't actually learn which URLs the client is visiting because (1) it learn hashes and (2) the hashes are truncated, so that there are many strings with the same truncated hash. Note that it's very important that the client only request hash prefixes because if the client were to ask for the full hash, it would be relatively straightforward for the server to determine most of the input strings just by computing the hashes for known URLs.

However, even though there are many strings with the same hash prefix, some of those strings (e.g., facebook.com) are more likely to be visited by users than others (e.g., 86c0cb28d2ae2b872eb52.example). An additional consideration is that a client might need to query multiple strings associated with the same site. For instance, if the client queries the hash prefix for educatedguesswork.org (hash=A) and educatedguesswork.org/posts/safe-browsing-privacy/ (hash=B) then it's more likely that the user is visiting this site than a pair of unrelated sites that happen to have hashes A and B. Providing a complete analysis of the level of privacy leakage from Safe Browsing is fairly complicated and depends on the distribution of visits to various sites and your prior expectations of which sites a user is likely to visit, but suffice to say that there is clearly some privacy leakage. Ideally, we would have no leakage.

Improving Privacy #

Trying to improve Safe Browsing and in particular address these privacy issues is an active area of work and in particular something that Google and Mozila have collaborated on for quite some time. There are three primary known approaches to improving the privacy of this kind of system:

Proxying
Use full hashes
Crypto!

I'll discuss each of these below.

Proxying #

The most obvious technique is just to proxy the queries to the server. This conceals the IP address, which prevents the server from directly linking queries to the user. As I understand it, Apple already proxies Safe Browsing traffic, at least for iOS. Proxying is a nicely general technique which is simple to implement and reason about. Indeed, one might think that we could simplify the system by skipping the prefix list and just having the client query the server for every string (or more likely every full hash). This would provide better timeliness, including the ability to quickly add new entries, though of course at some performance cost.

There are a number of subtle points, however. First, it's important that the queries be unlinkable from the perspective of the server. Consider what happens if the client makes a long-term connection to the server (through the proxy) and then proceeds to make all its queries through that single connection. In that case, the server might be able to use the pattern of requests to infer the user's identity and then to connect that to the rest of their browsing activity. For instance, suppose user A retrieves the following URLs:

github.com/fuzzydunlopp fuzzydunlopp.example/edit www.instagram.com/marlo.stanfield/

It's a fair inference that the user in question is fuzzydonlopp and that they also are visiting Marlo Stanfield's Instagram.

This suggests that connection proxying systems like MASQUE are bad fits for this application and instead we would be better served by message proxying systems like Oblivious HTTP. In O-HTTP, each request is separately encrypted to the server, but requests from multiple clients are multiplexed on the same connection from the proxy, thus making it difficult to link them up. Even so, however, you need to worry about timing analysis (e.g., when potentially related requests come in close succession).

A related problem is that some servers are concerned about abuse (e.g., excessive requests). It's common to use IP addresses for this purpose, for instance by looking for excessive traffic for a given IP address. It's not actually clear to me that abuse is that big a consideration in this case because serving the query is actually very cheap, as it's just a very small data value, but in any case having a proxy which conceals the client's address prevents them for being used for this purpose. This potentially makes it harder to manage misbehaving clients while providing service to legitimate clients.

There are a variety of techniques which might be usable for this application (e.g., PrivacyPass), but it's not clear how well they work in this case because you need to design a system which provides anti-abuse without linkability but which is also cheap enough to verify that it's not easier to just serve the request. For instance, if you have the choice between verifying a digital signature and then serving the request or just serving all the requests, it's probably better to just serve the requests: the vast majority of requests will be valid, and for those you need to both verify the signature and serve the requests so you have to pay both costs. Moreover, in many cases even a failed verification will be more expensive than just serving the request. In addition, if the proxy and the server have a relationship, then the proxy can do some of the work of suppressing abuse, as described in the O-HTTP spec.

Distribute Longer Hashes #

Another alternative design is to send the client longer hashes. The false positive rate is dictated by the fraction of hashes which correspond to blocked strings, and so just making the hash longer makes the false positive rate lower. If you use a sufficiently long hash, then you can make the false positive rate acceptably low and there is no need to double check with the server at all. This produces a much simpler system which is both faster (because you never need to contact the server) and more private (because the client never makes any queries to the server which depend on your browsing history).

How long a hash do you need? Safe Browsing uses 256-bit hashes (SHA-256), but you almost certainly need less. If you use a b-bit hash and there are 2²⁰ blocked strings, then the chance that a randomly chosen non-blocked string will be reported as blocked is 2^-(b-20) If we used an 80-bit hash, then the natural rate of false positives would be 2^^-60, which seems acceptably low. However, this leaves open an attack in which an attacker deliberately creates a collision in order to make a site unreachable.

Consider the case where the attacker wants to block example.com. They make their own malware site and search the space of URLs until the find one which has the same hash as example.com. They then put their site up at that URL and wait for the server to detect it. Once they do, then they publish the hash and suddenly no client can go to example.com. This attack doesn't work with the current Safe Browsing design because the client contacts the server, which uses a full hash (though the attacker can force the client to contact the server for example.com), but it works if you remove the double checking step. The natural defense against this attack is to just make the hash longer. For instance, if we were to use a 128-bit hash, then the attacker would need to do more like 2¹⁰⁰ work in order to create a collision, which is probably acceptably large.^[4]

It's important to note that the privacy guarantees of this system are better than those of the proxy system: with the proxy, privacy depends on the proxy and the server not colluding, whereas with longer hashes the privacy of the system does not require trusting anyone.

Of course, sending longer hashes means more communication cost: if we use 128-bit hashes, it will probably cost about 4 times as much to update the client. However, this is an upper bound: in the current Safe Browsing design, the client needs to make connections to the server in order to double check (this is even more expensive with proxying) and these are not necessary with longer hashes. Moreover, those connections are in the critical path for downloading URLs, whereas updating the hashes can be done in the background.

Crypto! #

Finally, we could use cryptography. This is closely related to a well-known problem called Private Information Retrieval^[5] in which the client wants to query a database without the server learning which database entry it is querying. Naively, PIR is precisely what we want here, in that it would give good privacy and yet full timeliness (we might still want to distribute the partial hashes to reduce the number of queries required for performance reasons) but the problem is that it's really hard to build a PIR scheme that has good enough performance to be in the critical path for a browser. For instance, in 2021, Kogan and Corrigan-Gibbs published a system called Checklist specifically designed for Safe Browsing, but it comes at real costs, as described in the Checklist abstract:

This paper presents Checklist, a system for private blocklist lookups. In Checklist, a client can determine whether a particular string appears on a server-held blocklist of strings, without leaking its string to the server. Checklist is the first blocklist-lookup system that (1) leaks no information about the client’s string to the server, (2) does not require the client to store the blocklist in its entirety, and (3) allows the server to respond to the client’s query in time sublinear in the blocklist size. To make this possible, we construct a new two-server private-information-retrieval protocol that is both asymptotically and concretely faster, in terms of server-side time, than those of prior work. We evaluate Checklist in the context of Google’s “Safe Browsing” blocklist, which all major browsers use to prevent web clients from visiting malware-hosting URLs. Today, lookups to this blocklist leak partial hashes of a subset of clients’ visited URLs to Google’s servers. We have modified Firefox to perform Safe-Browsing blocklist lookups via Checklist servers, which eliminates the leakage of partial URL hashes from the Firefox client to the blocklist servers. This privacy gain comes at the cost of increasing communication by a factor of 3.3×, and the server-side compute costs by 9.8×. Checklist reduces end-to-end server-side costs by 6.7×, compared to what would be possible with prior state-of-the-art two-server private information retrieval.

Of course, PIR schemes continue to improve (for instance, Henzinger, Hong, Corrigan-Gibbs, Meiklejohn, and Veikuntanathan just published a new system called SimplePIR), so at some point it may just be possible to swap in a PIR system for all of this custom machinery. This has the potential to provide the best combination of security, timeliness, and privacy.

Final Thoughts #

Safe Browsing and similar services are a key part of protecting users on the Internet, but the current state of technology requires us to make some compromises between effectiveness, privacy, and timeliness. It's not clear to me that the current design has the optimal set of tradeoffs, but with better technology, it may also be possible to build a system which is superior on every dimension.

There are actually several databases for different categories of blockage. ↩︎
Effectively, this is a single hash Bloom Filter. ↩︎
I thought I remembered the Firefox sent some random hash prefixes to the server to create some additional deniability, but a quick skim of the code doesn't show anything. Will update if I learn more. Updated 2022-08-17: here. Thanks to Thorin for the link. ↩︎
Another potential defense would be to have the server generate the hash with a secret salt value, thus making collisions hard to find. However, this makes incremental updates hard because the attacker then learns the salt. The server could also make the problem somewhat harder by using a large number of public salts, but this just increases the work factor by the number of salts. ↩︎
We don't need private set intersection here because it's not a problem for the client to learn the server's data even if there is not a match. ↩︎

Discovery Mechanisms for Messaging and Calling Interoperability

2022-08-04T00:00:00Z

As I discussed in an earlier post, it looks like the EU [corrected an embarassing typo that had this as UK -- EKR] Digital Markets Act (DMA) is going to require interoperability between messaging systems. That previous post focused on how to establishing end-to-end encryption between messaging systems. In this post I want to talk about the problem of discovering which messaging system someone is on.

Identifier Portability #

Many messaging systems bootstrap off of existing identifiers in the form of of phone numbers (jargon: "E.164 number"). Phone numbers are structured, which means that when you place a call over the Public Switched Telephone Network (PSTN) it incrementally routes the call via the country, area code, etc., but from the perspective of a messaging system, they are opaque and unstructured, which is to say that the identifier +1.415.555.0123 might be for a user who is on iMessage, WhatsApp, or even both. If all I have is someone's phone number, how do I know which service to reach them on?

Phone numbers as a shared namespace #

Phone numbers weren't originally designed to be a single namespace that was shared between carriers, but rather as a single namespace to be used by a single carrier, the Bell System (motto: "One Policy, One System, Universal Service"). Even then, numbers were structured, but the structure represented the topology of the system so that you could incrementally route calls. For instance, you could use the area code to direct traffic to the right region followed by the local office code to direct it to the right switch, and then down to the right subscriber line.

When the Bell System was broken up the breakup was done along geographic lines into what were called Regional Bell Operating Companies (RBOCs). Because the topology of the system was also roughly geographic—unlike, say, the Internet, where number prefixes do not really correspond to geographic regions—you could at least roughly align the RBOC boundaries with the number structure. However, subsequently jurisdictions started to require Local Number Portability, which allowed you to take your number from carrier to carrier. Thus, even if you were originally assigned a number out of Verizon's block, you could "port" it to T-Mobile, with the result that you have a shared namespace.

One possibility would be to simply sidestep this question by having identifiers be scoped, either by having people say "connect with me on WhatsApp at 1.415.555.0123" or by just adding an explicit scoping parameter, so your address is 1.415.555.0123@whatsapp.com (see here for more on this). This is how e-mail works and isn't the worst thing in the world, but does make it more complicated to contact someone else if all you have is their number, as well as making things confusing if they change their preferred app. By contrast, phone numbers are portable across carriers, which is to say that if you move from T-Mobile to Verizon you get to keep your phone number, and I don't need to know what carrier you have in order to call you: I just enter the phone number. This is implemented by having a giant—well, not really that giant, as the entire US number space is less than 10 billion numbers and so basically fits on a USB stick^[1]—database that knows which carrier is responsible for each number. When you want to call someone, your carrier checks this database (technical term: "dip") to see where to route the call.

So, what if you want to have this same property for instant messaging or video calling systems? This actually turns out to be surprisingly complicated.

Phone Number-Based Addressing for Single Applications #

Before trying to solve the problem of routing between applications who use phone number-based addresses, it's useful to look at the simpler problem of a single application that uses phone numbers as addresses (e.g., WhatsApp). Instead of using the number portability database, which doesn't really have the information you need here, these devices bootstrap authentication off of SMS.

How does the PSTN authenticate you? #

You might be wondering how the PSTN knows which number is associated with a given device. Back in the days of landline phones, the answer was simple: each subscriber had their own literal line. I.e., there was a separate pair of copper wires that went from the central office to the subscriber's house and the switch knew which pair of wires went with each number.

Obviously this doesn't work with mobile phones. Instead, each phone has its own cryptographic key which it uses to authenticate to the network. When your number is assigned to you, that key is then associated with the number in the carrier's database. In modern phones, that key is generally stored in a Subscriber Interface Module (SIM), which is a small chip embedded in a plastic card:

[From Wikipedia]

The SIM card is actually what gives your phone its identity, and if you swap SIM cards between devices, you will also swap their numbers.

The app prompts you for a password and your phone number.
The service then sends you an SMS message with a random code.
You enter that code into the app's user interface.

This demonstrates that you can receive messages at the indicated phone number.^[2]

This authentication mechanism relies on the assumption that the PSTN correctly routes messages to the right location and that nobody else can read them. When you think about it, this is actually a bit of an odd assumption to make at the time you are installing a messaging application that offers stronger security than SMS, but that's actually a surprisingly common scenario: certificate issuance on the Web relies on the weak security properties provided by unencrypted DNS to bootstrap up to TLS, after which the DNS no longer needs to be trusted.

The general concept here is that you only trust the weaker system once to form the initial association and from then on you have strong continuity of authentication (in some systems, this is known as Trust On First Use (TOFU)). In both cases, you can build supplementary mechanisms like Certificate Certificate Transparency or Key Transparency to detect mississuance.

One natural question to ask is why the app can't just ask the device, which, after all, knows its own phone number. The problem is that the device can't be trusted. Remember that what we are trying to do is to convince the service that a given device is associated with this number, and even though the service wrote the app in question, it's very difficult for them to determine that an attacker hasn't modified the app to lie about its number. The SMS verification mechanism doesn't have this problem; because it actually checks that you can receive messages, it works even if the device and the code running on it are totally untrusted.

It's easier to see the trust relationships if we look at what's really happening, as shown in the diagram below:

In the first phase, the user is interacting with the application, which is what collects the password and the phone number and sends them to the server. The server then sends the code through the phone network to the device. The device shows it to the user, who then gives it to the app. The app then sends it back to the server, which is then able to confirm the code and verify the account. Importantly, even though the server is sending the code to the app (via the user) the SMS channel to the phone is out of band from the app's connection to the server. In fact, they may even be using different technology; for instance, if you are on WiFi, then the connection to the server will use that radio even though the SMS comes in over the mobile telephony network. Even if all the data is going over the mobile channel, the IP communications from the app aren't strongly bound to your phone number.

Note that even if you don't trust the answer, if you could ask the device for its number, you could still skip prompting the user. However, the number may not be available. Apple's security and privacy policies forbid this (presumably for privacy reasons) though it appears to be possible on Android. For similar security reasons, the app can't just reach into your SMSes—which are received by the operating system—and grab the confirmation code, as this would allow it to read any SMS.^[3] The exception here is iMessage, which uses similar techniques to verify the phone number, but because it ships as part of the operating system is able to do so silently, even though Apple doesn't permit other apps to do so.

Once the service has associated the user's account with their phone number, the rest of the system is fairly straightforward the app connects and authenticates as the user and the service just routes messages/calls to the user; no further interaction with the PSTN is required. It is worth noting, however, that this has some funny results if the phone number is ever reassigned because the service won't be notified. The result can be that Alice has an account on some service for a number that has been reassigned to Bob. It's hard to avoid this situation with this kind of loose service coupling, but of course it's not unique to the Internet: I still get paper mail addressed to the people who lived in my house over 20 years ago.

Phone Number-Based Addressing for Multiple Applications #

The basic situation isn't that different when different users use different apps, except that you not only need to determine which device is associated with a given user but also which app they are using. As a simplification, let's assume that everyone just uses a single app (analogous to the situation with mobile phones where each subscriber just has a single carrier); We'll look at the multi-app situation below.

Consider the following three users:

User	App	Number
Alice	A	1.650.555.0011^[4]
Bob	B	1.415.555.0022
Charlie	A	1.510.555.0033

What happens if Alice gets Bob's number and wants to contact him in App A? The obvious thing would be for Alice to just SMS Bob and ask "which app are you using?" She could then tell A to contact "1.415.55.0022 via app B" (assuming that A and B) can already talk to each other as discussed in my earlier post). This will work but it's clumsy and inconvenient; what you want is for Alice to put Bob's number into app A and for A to figure things out. Unfortunately, this doesn't appear to be something that A can do on its own; rather, we need some additional infrastructure.

I'm aware of two major designs here. In the first design, you have a directory service which knows which number is associated with which app. In the second design, each user—or rather their app—has to discover it out for itself.

Directory Services #

The obvious way to approach this is just to use the same approach as for number portability, i.e., to have some sort of global directory service that tells you which app to use for each number.

It's possible you could directly integrate it with the existing PSTN databases, but that's probably going to be a lot of work and it's probably easier to just use the same kind of SMS verification we discussed in the previous section. For instance, suppose you had a single global directory service. When you installed the app you would prove possession of your number to the directory service which would then create a record mapping your number to the app you were using. This directory can then be queried by other people, as shown in the diagram below.

[Update: fixed diagram -- 2022-08-04]

In this example, Alice installs app A, which automatically contacts the directory and proves possession of her number. The directory then creates a record mapping her number to app A.^[5] When Bob wants to contact Alice, he puts her number into app B, which contacts the directory and finds out that Alice uses A. B then uses whatever interoperability mechanism it has with A to establish communication.

This system is obviously massively oversimplified. If we wanted to build something real, we'd need to address some important design questions and fix some—as-yet-unsolved—privacy issues.

Authentication #

The first question we'd need to address is the authentication structure. In the design I sketched above, the directory service is solely responsible for knowing which app a given number is associated with, but not for authenticating the user. For instance, if Alice and Charlie both use app A then when Bob tries to call Alice, A can redirect the call to Charlie. Of course, A might run some kind of certificate/key transparency type of system to prevent this kind of attack, but that requires every app to engage with that.

Note that the reverse is also true: when Bob calls Alice, Alice is relying on B's representation that it's really Bob, and B can lie. Moreover, it's important for Alice to check the directory to make sure that Bob's number is actually associated with B. Otherwise, service C could just claim to be speaking for Bob even if he's not a user of app C at all.

An alternate approach would be to have a global authentication system in which the directory issues a credential to each user binding their number to whatever cryptographic credentials their app uses (effectively, this is a certificate authority for phone numbers). In this case, it wouldn't be possible for an app to lie about user, though of course we now have to trust the directory. The advantage of this design would be that you only have to trust one thing and maybe you could have better auditing and transparency for a global service.

It's also possible to run both kinds of systems simultaneously, where each app uses its own authentication system internally but also is able to make use of a global credential system. This allows for innovation inside an app but also provides interoperability.

Centralization #

Another problem with this design is that it seems to require a centralized directory service, or at best a small number of such services. The basic invariant here is that you need a procedure that takes in a number and outputs the app it's associated with. The easiest way to do that is to have a single service. Perhaps if there were only a small number of apps you could check them individually but if there are tens or hundreds it's a real scalability problem (and may also be a privacy problem, as discussed below).

ENUM #

For the real nerds here, there is actually an RFC documenting a less centralized design rooted in the DNS called ENUM. The idea was that you would store records in the DNS under your phone number (hilariously, reversed, because phone numbers read left to right and DNS addresses read right to left), so you might have 8.4.1.0.6.4.9.7.0.2.4.4.e164.arpa.. This never took off for a host of reasons, and I don't think it's really a viable option here because it requires DNS delegations to match the phone number structure, which seems like a lot of work for everyone involved.

There are really two objections here: one about deployability and one about network architecture. The deployability objection is that someone has to run the service and that has to be paid for, so who is going to do that. I tend to think that this isn't that big an issue: this really isn't that big a service by modern standards, and we have a reference point for what it costs to run something similar in the form of Let's Encrypt, which has a budget of around 6 million dollars, with the costs scaling sublinearly. The whole premise of the situation is that companies like Apple and Facebook will be required to interoperate, and against that background, this isn't really that much money.

I take the network architecture objection more seriously: yet another centralized service isn't great for the Internet. I think there are some ways to make it somewhat less centralized, for instance by having each app maintain its own mirror of the database, but at the end of the day there's a tradeoff here between the good of interoperability—assuming you think it is good—and the bad of centralization. I tend to think that the balance is in favor of interoperability but it's not a slam dunk, especially if you think that there are other architectures that would do a better job (see below).

Privacy #

Probably the biggest issue with this design is that it has some fairly unfortunate privacy properties. Specifically in the naive version of this design:

The directory service gets to see which app(s) a given phone number is associated with.
It's possible for ordinary users to scrape the directory service and learn which app(s) a given user is associated with.
The directory server gets to see every lookup and so be able to learn who is trying to connect with who. (This is even worse if the user has to try every possible app)

It's probably possible to address some of these issues, though it's not immediately obvious that they can be completely fixed. The rest of this section contains some handwaving in the direction of potential solutions. I just came up with these recently, so don't blame me if they are horrifically broken.

The last one is probably the easiest, as there are a number of reasonably efficient private information retrieval (PIR) schemes for allowing a client to retrieve a single value from a server without disclosing the value to the server. So, if we just require those values to be retrieved over PIR (or even over a proxy!), we can probably provide some kind of privacy for who is connecting to who.

Similarly, I think it's probably possible to prevent large-scale scraping of user data by clients. This is a pretty typical rate limiting problem and it's already a problem existing apps have to face, so we could probably apply similar techniques here. This doesn't do much to prevent learning about a single individual, though, for instance, suppose I want to know if someone is on WhatsApp. There seems to be an inherent tension here between allowing seamless discovery and connection and providing privacy in this case, so I'm not sure if it's really soluble at the end of the day.

The best idea I have for the directory service getting to see which apps a given number is associated with is to split up the data between two servers. The idea would be that you would have two directory servers operated by unaffiliated entities. The client would then prove its identity to both servers (as above) and this would give it a credential that it could use to authenticate to that server. It would then take encrypt its app identity and send the key to one server and the encrypted value to the other. Then when someone wanted to contact you, they would contact both servers and reconstruct the original value, as shown below^[6]

[Update: fixed diagram --2022-08-04]

This stops the servers from being able to access the entire database, though you still need to worry about scraping attacks, either against both servers or by one against the other, so it's not perfect.

SPIN #

Recently, Jonathan Rosenberg, Cullen Jennings, Alissa Cooper, and Jon Peterson—a group of heavy hitters in real time communications if there ever was one—published an alternative design called SPIN for this problem. The idea is to replace the centralized server by having each client do its own phone number mapping via SMS. I.e., when Alice
wants to contact Bob, her device sends an SMS to Bob's device (again, with some unpredictable random value). Bob's device responds with the app(s) that Bob supports and perhaps with his identities on those apps. The reasoning here is the same as with the directory service: only someone who could receive SMS at Bob's number could complete the challenge, so you must be talking to Bob.

Of course, this leaves us with the problem of Bob knowing who is calling, because Alice just asserts her number. One way to address this would be for Bob to issue a challenge in the opposite direction, but this isn't actually what SPIN does. Instead it assumes that Alice has obtained a credential—presumably using a similar issuance process to the one I indicated above—that she uses to sign her message to Bob, but that's a design choice. If you wanted to entirely eliminate centralized infrastructure you could certainly do that, and that's an obvious selling point of SPIN. Even with this kind of hybrid design, the directory service doesn't need to be available for query and so you don't have the privacy problems I discussed above (it also isn't in the critical path for calls, but availability of this kind of server system seems like a mostly solved problem at this point).

Of course, the SPIN design has a number of drawbacks (in fact, I originally started thinking about this problem because I read the draft and I wanted to try to fix them).

Offline Access #

With SPIN, you can't really do discovery of anyone who isn't online at the same time as you (more precisely, it just stalls until they are online and you can get the return message). This isn't necessarily that big an issue for real-time calls because if someone isn't online then you're not going to be able to call them anyway (though there's voicemail) but it's a big issue for instant messaging, which is inherently asynchronous. Jonathan Rosenberg argues that mobile devices are basically always connected. I'm not sure that this is really true, but if you want to extend to systems which have e-mail style identifiers, then those may be on desktop not mobile devices, so this is a drawback. This isn't an issue for the directory service design: once a user has registered with the directory service then anyone can do a lookup whether you are offline or not.

One partial mitigation for this might be for the operator of each app to record (cache) phone number validations as they happen, so that they gradually learn some of the mappings and can resolve them immediately. For instance, once Alice (on service A) has discovered that Bob is on service B, if Charlie (also on service A) can learn this information from A without a new verification stage. This has the advantage that it's "soft state" in that things work without it, but the disadvantage that some things work and some don't.

It (mostly) requires changing the operating system #

Because the SPIN design involves every client doing its own phone number verification, people are going to get a lot of SMS messages requiring them to verify, which is annoying. SPIN expects to address this by having the device operating system absorb the messages and respond for you so the user doesn't see them. This isn't necessarily a bad idea, but it's kind of ugly and means that people with older operating systems will have a bad experience.

Again, this isn't an issue with the directory service version because apps can just register themselves. That version does work better if the operating system helps out with SMS verification, but even in the worst case the user is just bothered once for each app they use, not for each person who wants to call them.

Attack Resistance #

As noted above, SMS routing in the PSTN isn't really that secure, and so you have to worry about misissuance. One way to mitigate this is to have the results of verification published in a transparency log. This allows everyone to see which credentials have been assigned to each number and potentially detect misissuance. This works fine in a directory service type system but in a system where each user does their own verification, you might run into a scenario where an attacker hijacked just the connection between Alice and Bob but not between Charlie and Bob. This would need some fancier mechanisms to detect, though we could probably design something.

Privacy #

As noted above, the privacy situation is largely better without a centralized server, but there's still an issue around probing for individual user information. I.e., Alice wants to know which app(s) Bob has and so sends an SMS and looks at the results. One way to address this is for Bob to have some logic that runs on the device that determines whether to answer the query—perhaps depending on whether Alice's number is in the contact list—though it's not clear how easy that is to configure.^[7]

Multiple Apps Per User #

Multiple apps are a pretty straightforward extension to either of these systems. In both cases, you can basically think of the system as publishing a "record" attached to the phone number. I've implicitly assumed that the record would contain a single app, but there's no technical reason why they can't contain a list of apps (this is slightly more complicated in the directory service version for cryptographic reasons, but not really that hard).

The situation for the initiator is somewhat more complicated: I'm using app A and I want to call someone and learn that they have apps B and C. What now? Presumably each app is going to have a priority list of apps it would prefer to interoperate with (favoring itself!) and will just pick the top one. But this can lead to some obvious problems, such as: will you get the same app in each direction? What happens if someone installs a new app that is more preferred? These aren't strictly discovery problems but are definitely ergonomics issues that apps will need to work out somehow.

Final Thoughts #

Obviously this is a difficult problem without a single great solution. I do think it's possible to come up with something reasonably good here, especially if we're willing to make some technical compromises. That's a lot more likely if there really will be a requirement to interoperate; while there are real technical problems, many of the problems are around incentives (e.g., why should I run a server so some people can talk to my users?) and regulation provides those incentives.

This problem would be vastly easier if the addresses people were using had been structured from the very beginning: as an example, e-mail addresses already consist of a user portion and a domain portion, and so it's easy to know where to route any given message. But because instant messaging addresses are largely opaque, you're stuck with clumsier solutions. On the other hand, most e-mail addresses aren't portable—you can't take example@gmail.com over to Hotmail—so if you ever wanted that you'd be back in the soup. To the best of my knowledge there's no real way to have address portability without some kind of routing database, either an explicit one like the DNS or my directory service, or an implicit one like the PSTN fabric that powers SMS verification.

You can now buy 128 GB flash drives, so this gives us 12 bytes per record. ↩︎
Note that it does not demonstrate that this device is associated with that number. For instance, you could have two devices, one of which is associated with that number and one of which you are installing the device on. ↩︎
Yes, it's possible to design a system that doesn't require full SMS access, but that's not how these APIs work. ↩︎
See here for why I am using 555 numbers. ↩︎
In a real system, we'd probably want to prevent malicious apps on Alice's phone from registering for another app, in what's called an "identity misbinding" attack, but I'm ignoring that here. ↩︎
Update 2022-08-04: You could also use secret sharing, but encryption has the advantage that if the record you want to store is large then the total size is smaller. ↩︎
It might be possible to replicate this functionality in the directory service model. Naively, Bob could just upload the algorithm for which numbers to answer for, but this has its own privacy problems because it leaks Bob's contact list to the service. There may be some fancy cryptographic solution that addresses all these privacy problems at once, but I don't have it in my pocket. ↩︎

Pacifica Foothills Race Report

2022-07-25T00:00:00Z

On July 17th, I raced the Pacifica Foothills 30K. This wasn't really on my training calendar, but a colleague decided to run it and I offered to drive her, figuring I could fit in a catered 18 mile training run. And then at the last minute my friend Lisa decided to run the 21K, so it was a bit of a group thing.

[Photos from Runalyze]

Because this was just wedged in and not really part of my build for UTMB, my coach Emily Torrence and I decided to just train through it without really tapering at all, and just use it as a training event. The course is laid out as a (mostly) out-and-back followed by two loops, with a single aid station at the start/finish. The out-and-back is advertised as 7.5 and the each loop as 5.7 for a total of 18.9, with a total of around 4000 ft of climbing; and the plan was to run the out and back and first loop at typical long run pace and then if I was feeling good I would run the last loop at marathon to 50K effort (what's called a "fast finish" run).

Usually a small local race like this would be pretty laid back but I had to fly to London right after and then from there to Philadelphia and eventually to Utah for Tushars 70K, so I had to pack all that stuff up beforehand, and then rush home afterwards to get to the airport, making the logistics a bit complicated.

The race was small and things are pretty chill at the race start and I managed to get into the bathroom right before the gun went off. It helped that they actually started a few minute late, so even though I got out of the bathroom at about 8:29 I had a few minutes to get set. The race starts out climbing and I'm usually a pretty fast climber so I decided to start out pretty close to the front.

First Out and Back [7.26 mi, +1,709/-1,693 ft, 1:15:16] #

From about mile 1 it was surprisingly rocky and technical—not ridiculous but not your typical buttery California single-track—so I wasn't going that fast. Even so, I was quickly passing people. I wasn't quite sure where I was but figured I wasn't too far off the front.

Eventually I settled in behind a pair of women who (spoiler alert) turned out to be the first and second women. They were moving just a bit slower than me but I decided to just camp out for a little bit and keep things in the easy zone. Eventually I started to feel like it was too slow, though, so I passed the second woman and then the first, but she quickly re-passed me on a downhill. Around 2ish miles the trail opened up onto some fire road which went all the way to the top.

I hadn't read the elevation profile that carefully and was expecting the top of the hill to be halfway through the segment, but it came up quite quickly. There's just a set of flags and some rubber bands and the idea is you grab a rubber band that proves you went to the top (very secure!) and then head back down. One nice thing about this out-and-back structure is that it lets you see where you are and I counted two men and one woman in front of me, which put me in fourth overall and "on the podium" as they say (though there's no actual podium in these small races).

This was supposed to be a training run not a race so I was trying to be pretty careful on the way down, which also meant I wasn't going as fast as others. The second woman tore by me pretty quickly and about half-way down two other men passed me as well, putting me in the 5th male position. Even with being careful, the terrain was a bit tricky and I rolled my left ankle pretty far. Fortunately it was just short of real injury, so it hurt for a minute or two but I was able to shake it off. I didn't lose any more places on the way down.

The way back was quite a bit longer than the way up, due to a somewhat different route after the out and back segment. It also had a few small climbs, which wouldn't ordinarily be that big a deal but I was looking forward to the aid station. Anyway, I rolled in in good order, refilled my bottle with Tailwind and headed back out.

Loop 1 [5.39 mi, +1,168/-1,211 ft, 55:34] #

The loop portion of the course is arranged as a .8 mi/400 ft climb followed by a 1.7 mi/700 ft climb. Ordinarily these wouldn't be that hard but at this point it was starting to heat up and there wasn't much shade. Fortunately, this was nice smooth trail so it was just a matter of grinding it out. There are a lot of switchbacks and false summits on the second climb so it's a bit hard to know when it really ends.

I was starting to get a bit confused about my place because some of the 21K people had started to pass us (the start was 15 min later). At this point, though, there were three people nearby. I know this because they were moving slowly on the climb—including walking—where I was climbing pretty well, so I'd close in on them or even pass them on the way up and then they'd pass me on the way down. I wasn't really racing this but it was a little difficult not to feel competitive at this point, so I had to make some effort to hold back. I did get a chance to see what race people were running and the answer was "two 30K, one 21K".

At this point I was back at 5th overall/3rd male, and things were going well but I must have been starting to feel tired because right around the top of the loop I caught my toe and went flying. I sat on the ground for a few seconds to check myself out, concluding that I was uninjured and just bleeding a bit, so got back up and headed down the hill.

I would have descended reasonably cautiously anyway, but after this was even more cautious and so the same two guys caught me again on the downhill. I tried not to worry about it and just cruised into the aid station. I got some more Tailwind, grabbed a gel, told them I wasn't badly hurt, and headed back out. I looked over and saw that the clock was reading 2:12, which is ahead of where I expected to be (and actually turns out to be long because they started the clock at 8:30 and not at the actual start time).

Loop 2 [5.45 mi, +1,207/-1,207 ft, 54:48] #

I was feeling pretty OK at this point and I already knew what the course was like from here, so I decided it was time to pick up my pace for the fast finish section. Based on my lap times I wasn't actually going that much faster on the climbs, but everyone else was really slowing down, so the effect is still to have you going quite a bit faster than the people around you.

On the first climb I quickly passed the 3rd (who turned out to be Alastair from Trail and Kale) and 4th man and then the woman who had been first when I saw her but had slowed down and was now the second woman. In the past, both men had caught me on the first descent so I was a bit worried that might happen again, but I let myself open up some and so managed to stay ahead. I spent the next climb trying to put some more time on them—and wishing I'd paid more attention to the exact profile so I knew when it would be over—and then the descent simultaneously trying to keep my pace up and waiting for footsteps behind me, but never heard any.

I rolled into the finish with the clock reading about 3:07. My watch read 3:05:38, and the official finish reads 3:04:51. Lisa had already finished and she told me that I'd finished 3rd, and sure enough when I talked to the guy running the finish I was, so I picked up my 3rd place trophy and 2nd male coaster, took a selfie, and headed home.

[Lisa and me at the finish]

Retrospective #

I would call this race a success. I executed on the plan more or less exactly and came away feeling tired but not dead. It's just a local race but I think this is the highest place I've ever had. That's a pretty solid outcome with no taper and running most of it at my usual long distance run pace. Plus, I was even able to switch gears a bit at the end, especially in comparison to my peers: the next finisher came in at 3:09:01, so that's putting a 4 minute lead on them over a 55 minute loop, which is a pretty big gap. I think if I had tapered and tried to race the whole thing I would have got in under 3:00, though I would have needed to drop almost 8 minutes (a bit under 5%) in order to have made men's second, which might just barely be possible.

My nutrition went fine: I figure I took in about 700 calories (2.5 bottles of tailwind plus 2 gels), which is a bit light for my usual target of 300 cal/hr, but it's OK to go into the hole a bit on something this short.

As usual, footing remains a problem, especially on technical stuff. Neither my ankle or the fall turned into that serious a problem but either could have been and with UTMB coming up I really don't want to get injured; I've raced on an injured rib and it's no fun. It all worked out OK though.

Verifiably selecting taxpayers for random audit

2022-07-11T00:00:00Z

Note: this post contains a bunch of LaTeX math notation rendered in MathJax, but it doesn't show up right in the newsletter version. Check out the Web version where they render correctly.

The New York Times reports that both James Comey and Andrew McCabe were selected for a rare kind of IRS audit (odds of being selected about 1/20000-1/30000). These audits are supposed to be random and the Times article focuses on the suggestion that there was political influence on the selection:

Was it sheer coincidence that two close associates would randomly come under the scrutiny of the same audit program within two years of each other? Did something in their returns increase the chances of their being selected? Could the audits have been connected to criminal investigations pursued by the Trump Justice Department against both men, neither of whom was ever charged?

Or did someone in the federal government or at the I.R.S. — an agency that at times, like under the Nixon administration, was used for political purposes but says it has imposed a range of internal controls intended to thwart anyone from improperly using its powers — corrupt the process?

“Lightning strikes, and that’s unusual, and that’s what it’s like being picked for one of these audits,” said John A. Koskinen, the I.R.S. commissioner from 2013 to 2017. “The question is: Does lightning then strike again in the same area? Does it happen? Some people may see that in their lives, but most will not — so you don’t need to be an anti-Trumper to look at this and think it’s suspicious.”

How taxpayers get selected for the program of intensive audits — known as the National Research Program — is closely held. The I.R.S. is prohibited by law from discussing specific cases, further walling off from scrutiny the type of audit Mr. Comey and Mr. McCabe faced.

I don't have any particular insight on this particular case. Obviously, the chance of these particular people both being selected are very small, but there are a lot of people that former President Trump didn't like and so the chances that some of them will be selected for audits are reasonably high, so it's a bit difficult to develop the right probabilistic intuition for this, though TheUpshot gives it a valiant try.

From my perspective, however, the underlying problem is that because the process is opaque, we don't have confidence that the selection is random. What we'd really like to have is a system that is provably random. Sounds like a job for cryptography! This post is an attempt to think through the problem, both as an interesting exercise in itself and as an example of how to think through the requirements for this kind of system and then build it up in pieces.

Important disclaimer: I just wrote this up and it hasn't been analyzed by anyone else—or really by me—so it quite possibly has grievous flaws that I have not identified.

Verifiable Random Selection #

Let's start with a simpler version of this problem: we have a public list of names $\mathbb{N}$ of size $n$ consisting of names $N_1, N_2, N_3... N_n$. We want to select a random subset $\mathbb{A}$ (i.e., $\mathbb{A} \subset \mathbb{N}$) for auditing. What we need to be able to do is prove that that subset was randomly selected.

Here's the basic approach. First, you publish the following information:

The list in a specified order so that each name is associated with an index from $0$ to $n-1$.
A random number generation algorithm $R(s)$ that takes seed $s$ and generates values in the range $[0..n)$ with equal probability.
A method for computing the seed that is (a) verifiable (b) unpredictable at the present time and (c) not under the control of any plausible set of people who might cheat.

The first two of these are straightforward, but the last is more complicated. We need some mechanism that's easy to explain but also verifiably fair. One simple solution, used by the IETF for selecting volunteers for it's nominating committee is to use preexisting random numbers like lottery results or the low order digits of stock prices. I cover some other approaches later.

Once you have the random seed $s$, things are pretty straightforward: you run $R$ iteratively to generate numbers in the appropriate range. Each number corresponds to a selected list entry. Typically, you sample without replacement, so if you select an entry that's already been selected, you just generate a new number and try again.^[1] The code looks something like:

R.seed(s);
selected = [];

while (remaining > 0) {
  do {
      candidate = R.next();
  } while (candidate in selected);
  
  selected.append(candidate);
  remaining -= 1;
}

It's worth taking a moment to see why this works. Steps 1-3 are all fair, which is to say that if you assume a random $s$, then any set of selected values are equally likely. This means that unless you know $s$ in advance, it's not possible to predict who will be selected. It's also not possible to modify the order of the list or the detailed structure of the random number generator^[2] in order to select one set of people over another. It's critically important that steps 1-3 be run before $s$ is known, otherwise you could tamper with the list order or $R$ in order to get the effect you want. The jargon here is that you commit to them in advance, and that they can't be changed afterwards. This also gives the public the opportunity to verify that the list of names is correct and that random number generator $R$ meets the correct requirements. For the same reason that they need to be committed to in advance, if you allow changes—even to correct errors—after $s$ is known it's too late because someone who detected an error might choose to strategically disclose it or not once they saw the outcome of the selection.

This simple design has several of deficiencies which make it less than ideal for sampling taxpayers. First, it just bootstraps off an existing source of randomness, but why do you think you can trust that? That's relatively easy to repair, as discussed below. More importantly, it involves publishing the identities of every taxpayer who might potentially be audited. This is already not ideal, but gets even worse if you want to oversample some set of taxpayers (e.g., those who have higher net incomes), or exclude some people (e.g., those who didn't pay income tax). For obvious reasons, this shouldn't be public information. Moreover, because a lot of people have the same name, you need to identify them somehow, and that probably means their social security number (SSN). SSNs are a terrible identifier, but they're also very widely used as a form of authentication—how often have you been asked for the last 4 digits of your social as an authenticator?—so having them be public is bad news.

One possibility would be to have the list consist just of SSNs. This would ordinarily be a bad idea because, as noted above, SSN are sensitive, but in practice a very large fraction of 9 digit numbers are actually valid^[3]: there are $10^9$ (1 billion) possible 9 digit numbers and about 330 million people in the US, so any given random 9 digit number has about a 1/3 chance of being a valid SSN for someone currently alive, so having a list of valid numbers isn't that informative; it's the binding between SSNs and people's names that is sensitive. However, it's still a problem if you want to do weighting by income because you don't want someone who knows your SSN to be able to infer your income bracket.

Moreover, any cleartext list has the problem that the public can determine who was audited, which seems suboptimal. We'd like a solution that allowed you to verify that selection was fair but not who was audited. More precisely, anyone should be able to verify that the selection was fair and people who are selected should be able to verify that they—but nobody else—were selected.

Hashing Taxpayer Identities #

The obvious solution is just to hash the identities. So, we start with a list of taxpayer identities (e.g., SSNs or the pair of name and SSN) and hash each entry to make a new list, as shown below:

You then just select out of the original list using the method I described above. The IRS has the original list and can therefore easily determine who is to be audited. Anybody can verify that that computation was done correctly, and the people who are selected can verify their selection by hashing their identity and seeing that it matches one of the selected hashes.

Note that I've also reordered the hashes by sorting them in numeric order. This destroys any initial structure in the list. If we don't do this, then people could look at which hashed list entries had been selected and potentially learn information about who had been selected for the audit. For instance, if there are 150 million taxpayers and the first one audited has index 500,000, it's unlikely it's Aaron A. Aaronson. Because the hashes are effectively random with respect to their inputs, just sorting the hashes numerically produces a list whose order is unrelated to the original order.

This is a simple and obvious solution, but unfortunately it's also wrong. The problem here is that the input identity values are low entropy and the hash is public. Because there are only $10^9$ SSNs, it's easy to compute the hashes for any name and all possible SSNs and just compare them against a given hash. This costs roughly $2^{30}$ computations per name, which is quite cheap. There are at most $2^{29}$ distinct names in the US (there are fewer than that many people and of course many people have duplicate names), so computing every possible name/SSN pair costs less than $2^{60}$ computations, which is a lot but not at all out of the realm of a dedicated attacker.^[4] Moreover this computation just needs to be done once and then you have the whole table.^[5]

Commitments #

I said above that part of the problem here was that the hash was public, so what if we make it private instead. One way to do this is with what's called a commitment. A commitment is like a hash, except that it depends on an unknown secret value, so that it's not possible to compute the commitment without knowing it. I.e.,

$$ Commitment = C(secret, Message) $$

The way you use a commitment is that you publish the output of the commitment but not the secret value. Then you can prove that the commitment matches a give message by revealing the secret value, at which anyone can compute the commitment for themselves. Constructing a secure commitment scheme is somewhat complicated, but you can think of it as hashing the concatenation of the secret and the message, e.g.,

$$ C(secret, Message) = H(secret + Message) $$

A commitment-based scheme works more or less the same as a hash-based scheme, except that the IRS generates a new secret for each user and stores it with the input table. It then can generate the table of commitments, as shown below:

Selection of the taxpayers to be audited proceeds exactly as with hashes. The result is that anyone can verify that the list of hashed (committed) identifiers to be audited was generated correctly. In order to convince a given taxpayer that they were selected you show them their associated secret. They can then compute the commitment themselves and verify that it's on the selected list.

This solves the problem of keeping the selected list secret, but at the cost of verifiability. Yes, you can verify that the right commitments were selected and a given taxpayer can verify that they correspond to a specific commitment, but you can't verify that the original commitments match the right list of taxpayers. For instance, suppose that the IRS wants to make sure it always audits Alice Atlanta. All it has to do is make a list that mostly consists of commitments for Alice Atlanta, like so:

Obviously, this greatly increases the chance that Alice will be selected. Because all the commitments use different secrets, they are all distinct even though they are for the same identifier, and so it's not possible for anyone other than the IRS to see that there are duplicate inputs. When Alice is selected, the IRS can just reveal the relevant commitment and she doesn't know that there were other commitments for her.

One interesting thing that can happen is that Alice might be selected twice (this can just happen randomly). This isn't something that ordinary people can detect: non-selected people just see the total number of selectees and the IRS can just pick one of the commitments it selected for Alice and show her and discard the other one. Obviously, you could have an internal check that verified that the right number of people were audited, but that's not publicly verifiable.

Verifiable Random Functions #

The source of the non-verifiability in the commitment approach is that because each commitment uses a fresh secret, there there isn't a unique mapping from identities to commitments. Fortunately, there is a function which has the properties we need, namely:

There is a unique mapping from identities to commitments
The mapping can't be computed by third parties
The mapping can be verified by third parties

What we need is what's called a verifiable random function (VRF). A VRF works by having a secret key $K_s$, a public key $K_p$, and a pair of functions $VRF()$ and $Verify()$. The function $VRF$ outputs two values, the output value and a proof of correctness of the output value, like so^[6]

$$ (Output, Proof) = VRF(K_s, Message) $$

The $Proof$ can be used as an input to the function $Verify(K_p, Output, Proof, Message)$, which returns $True$ if and only if $Output$ and $Proof$ match the $Message$. The result is that you can only compute the VRF if you know $K$ but anyone can verify the VRF given the triplet $(Output, Proof, Message)$. The details of how to construct a VRF are out of scope for this post, but see Goldberg, Reyzin, Papadopoulus, and Vcelak for a specification describing several VRFs.^[7]

In this case, we would use the $Output$ as the value in the "hashed" list used for the selection and keep the $Proof$ secret. Because the VRF is deterministic, any input value can only correspond to one output, thus preventing the kind of duplication attack we saw with commitments. As before, anybody can verify that the selection algorithm was run correctly, and you prove to the selectee that they were on the list by giving them the corresponding proof, which they can verify for themselves.^[8]

Oversampling #

Having multiple entries for a given taxpayer can be used as an attack but is also potentially useful, for instance if you want to have higher-income taxpayers be more likely to be audited. One possibility here is just to have multiple lists with different selection probabilities, but this gets clumsy if you have a lot of different selection levels and also reveals the distribution of the number of taxpayers in each cohort.

An alternative design is to have multiple entries. For instance, suppose that we have two groups, Rich and Poor and we want Rich people to be selected twice as often as Poor people. This can easily be achieved by simply having two entries for each Rich person. We can't do this directly with a VRF, but we can just have the input to the VRF be the taxpayer's identity plus a counter. E.g., for Alice we could have Alice Atlanta: 1 and Alice Atlanta: 2. This doubles the chance of selection, and when either of these identities is selected you just have to prove to Alice that she was selected and that the counter in question is within the appropriate range (in this case, either 1 or 2). Nothing stops the IRS from creating an entry for Alice Atlanta: 3 but if they show it to Alice, she can contest it because her maximum index should be 2, so it's not really different from having having Alice Schmatlanta; it's just an entry that doesn't correspond to anyone. The same strategy can be applied for any set of ratios, though things get a bit messy if you want to (say) have one set of taxpayers be audited 1% more than another set, because you need them to have 100 and 101 entries respectively.

One difficulty with this strategy is that it doesn't properly handle multiple selections. For instance, we might select both Alice Atlanta: 1 and Alice Atlanta: 2. In practice, these audits are very rare, so this is pretty unlikely and so it's probably easiest to just do one less audit, but I think you can solve this problem with another layer of hashing. Specifically, to compute the list entries you would compute

$$ H(VRF(K_s, Identity) + Counter) $$

If you get a duplicate during the sampling process, you reveal the inner VRF output and prove that they two selected entries correspond to two hashes with different counters. This doesn't reveal any information about the rest of the structure of the list. Note that the attacker can't just iterate through hash inputs because the VRF output is high entropy even if the identities are not.

Generating Random Seeds #

Above I sort of handwaved the random seed generation problem. For a number of options it's fine to depend on some sort of untrusted source. However, you don't need an external randomness source. The basic idea is that you have a set of parties who get to contribute randomness to the seed. Each party $i$ generates a random share $R_i$ and you concatenate them in some pre-determined order and use that as the random seed.

If each party generates their value independently, then as long as at least one of the values is random, the whole output will be random. The problem here is the word independently. Suppose that there are $n$ parties and parties $1..n-1$ all publish their seeds. Party $n$ can then iterate through a bunch of seeds until it finds one that produces the set of random numbers it wants. Fortunately, we have a pre-existing tool for fixing this, the commitment. Effectively we have a two-round protocol:

Round 1: everyone publishes their commitments to their shares $R_i$
Round 2: everyone reveals $R_i$ and show that it matches the commitment

This protocol will work as long as there is at least one party who (1) generates a random value and (2) doesn't collude with the others by revealing their value before the commitments are published.

There are, of course, a few logistical problems here: who are the parties? What happens if they publish their commitments but then decide not to reveal $R_i$ (for instance because they don't like the resulting output)? These are real problems for some instantiations of this kind of scheme, but in practice it's probably fine to just have a small number of trusted parties (e.g., the US Government, some of the Big 5 Accounting Firms, etc.) who would suffer severe reputational damage if they were to cheat or refuse to reveal their share.

Another approach that people sometimes use for public verifiability is to have people roll dice. Cordero, Wagner, and Dill describe procedures for this in a classic paper called The Role of Dice in Election Audits.

Note that you can use all of these systems together: you just run them all, glue the data togetether (e.g., by concatenating it in a predetermined order), and feed it into the random number generator as the seed.

Drawbacks #

This is not a perfect system. First, like many cryptographic systems, it's fairly complicated and the math required to convince yourself that it behaves as advertised is way beyond most people. On the other hand, people regularly trust their credit cards, passwords, instant messages, and importantly tax returns to systems no more complicated than this and that are based on pretty similar cryptographic primitives. Moreover, the current system is totally unverifiable, so almost anything is an improvement.

From the technical side, I'm aware of at least one notable deficiency: while this system prevents the IRS from inappropriately auditing someone, it doesn't prevent them from making sure someone doesn't get audited; all they have to do is omit them from the list. In our original design, this was easily detectable, but once we mask taxpayer identities with the VRF, it's no longer possible. I'm not aware of any simple way to fix this, because you would need a list of the valid identities to compare to, which is something I'm trying to avoid. With that said, I'm not sure how serious this is: if the IRS wants to cheat it can just not audit someone that gets selected. You need some (non-transparent) internal procedures to detect this case, so maybe you can use them to ensure the list is complete as well.

Final Thoughts #

Taking a step back from this particular case, there are a lot of types of data processing that have real impact on our lives but where we just have to trust that the entities—whether governments or corporations—are handling it correctly. This includes a broad range applications from voting to medical records to income taxes to your search history. In each of these cases, mishandling of the data could lead to real harm; even if you trust the current entity to behave correctly there is no guarantee that they will do so in the future or that their systems will not be compromised.

The good news is that we are starting to have the technologies to allow the public to verify that these processes are conducted correctly. A good example from another field is risk-limiting audits for election verification, as pioneered by Philip Stark—which also requires some method of verifiably sampling, albeit a simpler one— which is actually starting to be used in real elections. In general, this is a good development: it's important to have good policies and trustworthy institutions, but even better if we don't have to trust them, especially in cases like this where correct behavior is important for democratic governance.

Note that this algorithm isn't efficient if you want to select a subset that's close to the size of the original list. One alternative is to instead select the list of excluded entries. ↩︎
This is not intended as a formal statement of the requirements for the RNG, but roughly speaking you want every possible sequence to be equiprobable over the ensemble of $s$ values. ↩︎
Ironically, this makes use of the property that SSNs are such a terrible identifier. ↩︎
Note that if SSNs were just a lot longer, then this system would be mostly OK. ↩︎
Hashing is often used in an attempt to conceal e-mail addresses, and doesn't work any better there. ↩︎
This is not the conventional presentation of VRFs but I believe it's a little easier for non-cryptographers to follow than the presentation in, for instance the CFRG VRF specification. ↩︎
Intuitively, you can construct a VRF by applying a hash to a deterministic digital signature function. The hash becomes the output and the full signature is the proof. ↩︎
I borrowed this general technique from CONIKS, which describes a more complicated system for assuring unique bindings between identities and cryptographic keys. ↩︎

Tenaya Loop Adventure Run 2: Redemption

2022-07-08T00:00:00Z

[Map and profile via Runalyze]

Last year, my training partner Chris Wood and I ran the Tenaya Loop route around Yosemite. This route was pioneered by former ultrarunning and current FKT star Leor Pantilat. It turned out to be harder than we expected, and we ended up bailing out partway through.

This year I was scheduled to do Old Cascadia 50 on June 18 as a warmup for Ultra-Trail du Mont-Blanc (UTMB), but that got rescheduled to October because of too much snow and so I decided to take another crack at Tenaya. In the event, I had to make a last minute trip to Brussels on the 19th, so I had to reschedule Tenaya to Saturday June 25. Flying in from Europe on Wednesday evening and then driving to Yosemite on Friday doesn't give you ideal performance, but it's what we had, and I guess good prep for how tired I expect to feel the second half of UTMB.

Logistics #

Last year Yosemite reservations only let you in after 5. This year the rules are that you need a reservation if you come in between 6 AM and 4 PM but not if you enter earlier or later, which was convenient for me because I wanted to be on the trail before 6 to maximize light. I decided to stay at Yosemite Riverside Inn, which is on Highway 120 right en route to the Tenaya Lake trailhead. It's not luxury, but it's fine.

I went to bed around 8:30, got up at 3:00 and was at the trailhead by 5:00. Last year the whole parking lot was under construction so you had to park at the side of the road and there were no bear lockers or bathrooms, but now it's been totally renovated and there are some reasonably new/clean pit toilets and a whole rack of bear lockers. This is a much better experience as you get to use the bathroom before you start. And while you're technically only forbidden to leave food in your car overnight—and I'd brought a BearVault BV500—it's a lot more reassuring to have it in the lockers. The bear canister only stops the bears from taking your food, not breaking into your car to get it.

I mentioned above, you don't need a reservation or a permit, but you're still supposed to pay the entry fee. However, there aren't any rangers around at 4ish when I got in or 10ish when I left, so I still owe the National Park Service money. Call me!

Start to Nevada Fall [12.8 mi, +2211/-4364, 3:32] #

The first stretch quickly climbs from the trailhead up to the top of the whole route at just under 10,000 ft. It's flat at the very beginning, but I only made it about a mile or two before it headed upward and I unpacked my poles, which I ended up using for almost the whole rest of the day.

In theory this route then takes you by Cloud's Rest, but for some reason I can't seem to read the map properly and so I missed Cloud's Rest for the second time in a row. I think the confusion here is that the top of the climb is right where you turn, so I just got focused on going straight down. I did stop to put my poles away, which was already kind of a mistake because that's when a bunch of mosquitos decided it was time to swarm me. This set the pattern for the rest of the day: most times when I stopped I would get a bunch of mosquitos on me. I had brought sunscreen but not insect repellent, and just kept hoping that it would go away, so I ended up alternately ignoring it and desperately trying to swipe them away as I did whatever I stopped to do.

The descent from here is pretty nice and reasonably smooth, eventually linking up to JMT. I didn't feel as fresh for this part as I was hoping to or as I did last year, but it didn't go that badly. Things start to get a lot more crowded after the JMT merge, I suspect because of people doing Half Dome, but people are typically good about getting out of the way when they see you running down. Pro Tip: there are some bathrooms at the Little Yosemite Valley campground.

Nevada Falls past Glacier Point and to the Valley [23.9 mi, +11.1 mi, +2041/-3993, 3:06] #

The Nevada Falls junction on this route is a bit confusing because there is a short trail down to a vista point that you don't take, but you do go partway down JMT to another vista point and then turn around and head up to Glacier Point. Last time we went down way too far, but this time I just went down to the vista point and turned around. This section is on hard rock with a cliff face on the uphill side and there was quite a bit of water run-off and general spray, so it was hard to stay dry. This actually would have been nice later in the day, but not so much at 9:30. On the other hand it was reassuring to know there was plenty of water.

From this vista point you just turn around and head back to the trail junction and then up the Glacier Point trail. This is a longish uphill grind, so I got the poles back out and headed up. As you start out on the trail, there are a bunch of signs warning about how there is no way to get up and back from Glacier Point except walking, there are no rangers, no water, etc. This was slightly worrisome: I had a filter so I didn't need water taps but the higher you get the less there tends to be surface water, and I had already drank about a liter out of the two liters I started with.

When I got to Glacier Point there were still a fair number of people there, which isn't surprising, as it's really only about 4 miles (though about 3000 ft) from the Valley, and the trail is reasonably good. As advertised, there weren't any services, so a few photos and I headed down.

Going down is the easy part #

One thing I've noticed trail running at big tourist locations like Yosemite or the Canyon is that people are super impressed when you tear by them going downhill (I know because they say something). This has always felt a little odd to me because the hard part of these events is the climbing and I'm not going down that fast (this is what fast looks like). OTOH, you're mostly hiking up these grades—or at least I am—so I guess it doesn't look that impressive, even though it's more effort.

Because the trail drops ~3000 feet in 4 miles, you—or at least I—have to be pretty cautious, so I was going to run down, but not just bomb down it. I wanted to practice descending with poles so I kept them out. I think on balance this made things easier: every time there's something a little technical or sketchy you can plant your poles and use them to stabilize. They're also useful for helping get over any rocks or whatever you might need to jump over. I don't think I put them away at all for the whole rest of the run.

By this time, the trail was starting to get reasonably hot and I was starting to worry about fluid. Fortunately, about halfway down I was glad to find a little stream that let me fill my water bottle and drink a half liter or so and then fill it up. I don't remember this being there last year and it gave me a little boost as I cruised down into the Valley feeling pretty good.

Valley to Yosemite Point [29.45 mi, +5.55mi, +3461/-417, 2:53] #

Last year we didn't know any better and went over to Yosemite Lodge to get water, but the route goes right through Camp Four which has bathrooms and running water, so I headed there instead. Had a slightly bad moment when I stopped at the Information booth and asked if there was any water in Yosemite Falls and she said "no", but then said "there's water in the falls but no tap" which is the answer I actually cared about. Anyway, I filled up all four of my bottles with water and Tailwind (in the process discovering that I think I lost two of my Tailwind sleeves on the trail, sorry about that!).

Threw away my trash in the nearby garbage, threw on my headphones^[1] and headed up the climb to Yosemite Point. This is by far the hardest part of the route, gaining over 3000 feet in under 6 miles, with 2700 feet coming in the first 3 miles. The trail is a lot of stair steps and stair-step like stuff, so you're using your poles a lot. I find the trick here is just to try to maintain a constant pace and back off a little if you get tired, but not actually stop. I mostly managed this, except for 5 minutes or so when I stopped in the shade and did some pack management, swapping out my bottles, grabbing food, putting on sunscreen, etc. Other than that, it's just a matter of slogging your way to the top. Fortunately, it's not too exposed, so while it's hot, you're not just baking. On this day, it actually started to drizzle a bit and I started to wonder if I was going to need my rain gear, but it never really did much.

The trail gets a big faint between Yosemite Falls and Yosemite Point and last year we got a bit lost here, which is part of what lead to bailing out. This year it was a bit easier, partly because I had seen it before, partly because I had more time left, and partly because I was less tired. In any case, I made it to Yosemite Point just fine.

Yosemite Point to Finish [32.7 mi, +3.25 mi, +994/-469, 1:12:52] #

The next segment is flattish, taking you to the North Dome trail. It was a relief to be on something runnable after all that climbing. I'd seen the first 2ish miles before, up to the intersection to Porcupine Flat, where we bailed last year, and after that it was uncharted territory.

Eventually I got to the intersection with the trail to North Dome. This is another out and back—though it seems pretty flat—I must have been getting a little low on calories or something because I stared at the map for a while and then managed to head out precisely in the wrong direction, which is to say onward to the end, rather than to the out and back to North Dome. I only realized this after I'd climbed about a mile or so and was wondering where the heck the top was; at that point I wasn't heading back, so I just missed that view, I guess.

Somewhere on this leg, but not quite sure where, I saw a bear cub cross the trail, followed by a somewhat larger bear, potentially it's mother. They sort of ran around for a while with one on each side of the trail, and for obvious reasons, I wasn't excited about getting in between them. I tried making a lot of noise, singing, etc. which I wouldn't say was super successful. Eventually just kind of stood there until they wandered off (sorry, no pictures!). Once they were out of sight I headed on past trying to sing a bit to let them know I was there; this is surprisingly hard to do at 8000 ft, and I definitely felt the altitude.

From here it's about four miles of easy gradual downhill and then another long gradual climb of about 1400 ft towards a vista point at about 41 miles. This is the last big climb, so I relaxed a little bit and enjoyed the view.

Things are pretty straightforward from here. There's a pretty gentle climb put to near Tioga Road and a final vista point where I ran into a couple of guys all set up for stargazing with chairs, tripods, and a bottle of wine (you can see some of that in the foreground of the picture below). We talked for a few minutes, I got one final shot of the sunset and then headed back down.

This last part was actually the worst; the trail was a bit rocky and faint in places and then the last mile or so is in a sort of marshy area, which meant mosquitos. This became especially obvious when I stopped to fish through my bag for my headlamp and they were immediately all over me in the minute or two I spent just getting it on. From here on, though, it was flat and easy, so I just cruised it in.

Nutrition #

	Brought	Consumed	Calories
Tailwind	10 + 4 in bottles	12?	2400
Gels	6	6	600
Powerbars	6	3	600
M&Ms	Bag	0	0
Total	-	-	3200

This seems a little on the light side: just a bit over 200 calories an hour and I usually aim for 300. I didn't keep as careful track as I would like but my sense is that I was on track in the beginning but then started to fall behind once my initial Tailwind ran out and I was just drinking out of the filter. It's not that bad to filter into bottles, but then it's a bit of a pain to put the Tailwind in and so you end up mostly drinking straight water and falling behind on your calories. Refilling my bottles with Tailwind at Camp 4 seems to have helped here, and the atp makes that easy.

Around 8-10 salt tabs at 215 mg each plus two caffeine pills in mid-afternoon and then the early evening. The caffeine definitely helped as the day wen on.

Retrospective #

This went well on balance. Although it's a little hard to compare because of more detours last year, I was as fast if not faster this time through the same parts (6:47 versus 7:16 to the Valley Floor and 9:31 versus 10:09 to Yosemite Point) and only slightly slower overall despite the divergent sections being much harder this time, and I felt a lot less dead when I got to Yosemite Point and at the end. This despite being alone, not having tapered, and in fact having flown in from Europe 3 days before.

I'm increasingly getting my equipment dialed in. I was already good with the poles on the uphill and I'm starting to get the hang of using them on the downhill, where I felt more stable than before. The trick seems to be to just run with them in your hands and then lightly plant them ahead of you most of the time, but then when something is tricky you're prepared to lean on them more. This helps stabilize you if you have to make an odd foot plant or if you slip a bit. I don't know if it was the poles or not, but I managed to do the entire route without falling.

I'm still sorting out the shoe situation. I've been doing my long runs in the Salomon S/LAB Ultra 3, which has pretty good support. I've mostly been racing in the slightly lighter Salomon Sense Pro 4, which are a bit more aggressive and unsupportive. I used the Sense Pro 4s last time and my ankles were pretty sore after, and I'd been hoping to convert to Salomon's new Pulsar Trail, which I like but feel just a hair too wide so my feet slip around a bit—not good for technical terrain. I'm going to experiment with cinching them down more, but hopefully the new Pulsar Trail Pro will be out soon and I can try that. Otherwise, I think it's the Ultra 3s for UTMB.

This was my first long run with my new Salomon Sense Pro 10 pack. Generally, it's nice and roomy and fits well. I'm still experimenting with the pole placement: you can bungee them to the front, which works well, but there are two positions: interior of the bottles (on your chest) or exterior (by your arms). I used the exterior position this time but I'm thinking the interior might be better. This event was right at the limit of what I'd want to carry: after 45 miles my shoulders were kind of sore. My loadout here was probably slightly more than I need for UTMB: I was carrying more or less the required kit, but also all of my food, the filter, an emergency beacon, my heavy light (Lupine Piko) and a spare battery, so there's probably some room to save a couple of pounds, especially if I'm willing to have a slightly less bright light.

Timing was a lot better: starting at 6 rather than 7 and when it got dark at 8:30ish meant I was able to do the whole thing in daylight—or at least twilight. I did pull out my headlamp towards the end but mostly just because the footing was a bit dodgy in the twilight, and I could have finished without it.

The mosquito thing was not good: every time I stopped at all I got swarmed and after I finished I had to really rush to get changed and on my way. I spent the next few days slathering myself with hydrocortisone and scratching. A lesson for next time.

Overall, though, this seems like it was well executed. I kept moving well and never really had any doubt I could finish. I dragged a bit towards the middle due to what I think is nutrition but was feeling good again at the end. I walked all the climbs but was able to run most of the flats and the downhills. There were quite a few downhill sections that were technical where a jog/walk was needed but I mostly felt like I was moving well within the limits of the terrain, which is what I was looking for.

Overall: 45.5 mi, 11237 ft, 15:04:57

I don't usually use music for this kind of thing, in part because it compromises your awareness, but they're good for this kind of slow grind, especially when you're solo. ↩︎

An overview of browser privacy features

2022-07-04T00:00:00Z

Recently I was interviewed by for an article about how to privately search for reproductive health services. During the discussion I found myself explaining the different privacy features available to Web users and wishing that I had something written to point to. Hence this post.

Types of Tracking #

First, it's important to be clear about what we are trying to accomplish. When we talk about Web tracking, there are two different kinds of tracking we are concerned about:

Cross-site tracking of your activity across web sites (e.g., I went to Nike and Adidas).
Same-site tracking of your activity at different times on the same site (e.g., I searched on Google for "shoes" and then later for "tofu").

Mostly, when people talk about "Web tracking" they are talking about cross-site tracking. This is clearly something that people didn't really sign up for and doesn't really provide much direct user benefit (we can argue about whether personalized ads are a user benefit, but if so they're not a very large one). For this reason, a number of browsers have started to build privacy features designed to block cross-site tracking by default.

By contrast, a lot of important Web functionality depends on the ability to link up one visit to another (for instance, this is how you stay logged in to your accounts between visits). Even in cases where users don't explicitly log in, sites use information about previous visits to personalize your experience (for instance, to make content recommendations). This isn't to say that all such tracking is desirable, but merely that we can't just turn it off because users would notice and be unhappy. This means that we need to find some way of providing privacy in cases where users want it and not when they don't.

Attacker Models #

Most browser privacy work focuses on what's called a Web attacker. which is to say an attacker who controls some set of Web sites. This is distinct from a lot of Internet security work which assumes a network attacker (see here for more on this) who can observe all of your traffic. The main reason for this is that it's a lot harder to defend against a network attacker—defending against a Web attacker is hard enough—and as we'll see below, we don't know how to do so cheaply.

Tracking Your Browsing History #

Consider the browsing history shown in the diagram below, in which the user visits the sites a.example, b.example, and c.example. If a tracker is present on each of those sites (this is not uncommon!) it will be able to get an accurate picture of your browsing history, learning which sites you visit and in which order.

Cookies #

The main mechanism that sites use to track your behavior is the cookie. Recall that a cookie is just a piece of state that a site can set in your browser and gets sent back to that site whenever you visit it. Because cookies can be embedded on multiple sites, this allows the third party to gradually build up a picture of your browsing behavior, as I described previously, thus building up a more complete profile of your browsing history. This is obviously extremely bad for user privacy.

As noted above, a number of browsers—notably Firefox and Safari— have started building in anti-tracking mechanisms to reduce this privacy leakage. These mechanisms are concerned with reducing cross-site tracking and operate primarily by restricting the use of third-party cookies and other cross-site state mechanisms. The idea is that instead of allowing trackers to link up behavior on multiple sites, they just get to see behavior on individual sites. The state of the art here is what's called first-party isolation (FPI) (or "double keying") which means that the browser stores cookies separately for each top-level site (the one that appears in the URL bar). In Firefox, this feature is called Total Cookie Protection (TCP), and in Safari, I think it's just part of their Intelligent Tracking Protection^[1] suite.

With FPI, if tracker T appears on sites A and B it will get a different set of cookies on each site. The diagram below shows the usual situation without FPI. The client first visits a.example which incorporates an ad. Because this is the first time the client has encountered this ad server, the client has no cookies for it. When the server serves the ad, it also sets cookie 1234. When the client later visits b.example, which uses the same ad server, the client sends the cookie 1234 which lets the server link up the two visits. Finally, the client goes back to a.example, which again serves an ad, and the client sends the same cookie.

The next diagram shows the same browsing pattern but with FPI on. The first interaction is the same, but then when the client goes to visit b.example and loads an ad from the ad server, it doesn't have a cookie, because cookies for visits to a.example are stored separately from those for visits to b.example (no matter which origin the cookie is for!). Instead, the client makes the request without a cookie and the ad server sends a new cookie 5678. However, when the client goes back to a.example it sends the original 1234 cookie. This preserves some important functionality, such as when a web site uses multiple domains associated with the same company (e.g., the site is served off of service.example but has an API on a CDN such as service.cdn.example),^[2] as opposed to blocking third party cookies, which would break this kind of use case.

Because FPI allows trackers to link up two visits to the same site, but not to different sites, our original user's browsing history would appear to the tracker as three separate traces, like so:

Ideally, the tracker has no way of knowing that these traces are all from the same browser or from different browsers. How much privacy this provides depends on how much time you spend on site. For instance, because people spend a lot of time on Google and Facebook, they get a pretty good idea of your activity and interests, and, depending on that activity, they may be able to tie it to your personal identity. On the other hand, if you go to a site once, then that site doesn't learn a lot about you.

Other Tracking Mechanisms #

Unfortunately, cookies are not the only way to track users. There are two much harder to block mechanisms:

The IP address
Fingerprinting

The IP address is largely tied to a given device, though devices—especially mobile devices—can change their IP address, so it serves as a pretty strong/stable long-term identifier. Because the IP address is necessary for communicating with the server, there's not a whole lot that browsers can do about it directly without relaying traffic through some other node (more on this below).

The other major non-state mechanism for tracking users is fingerprinting. Fingerprinting exploits natural variation in the hardware and software that users run. The Web provides a number of APIs that allow sites to learn information about a user's machine, such as what browser and version they are running, what operating system it is on, what language it is set to, and even the number of logical processor cores it has. Any individual value like this isn't particularly identifying, but when you add them up they provide a significant amount of information about user identity. Estimates of precisely how much vary widely, but everyone agrees it's nonzero and probably at least enough to reduce the set of possible users by a factor of 1000 or more, depending on how unusual a given user's configuration is.

Countering fingerprinting is a difficult problem, and requires compromising between providing maximal privacy and breaking functionality. For instance, a number of Web APIs can be—and are—used for fingerprinting, but they are also widely used for non-fingerprinting purposes, so restricting their use is difficult.

Private Browsing Modes #

Most browsers include some kind of mode ("Private Browsing" on Firefox and Safari , "Incognito" on Chrome) that is designed to provide a somewhat more private experience. Historically, these modes were mostly designed not to prevent web tracking but rather to prevent against local attack. The idea here is largely that you might have some kind of shared computer and you don't want whoever you share it with to know what sites you are going to. The official motivating use case for private browsing is often phrased as buying presents for someone, with the unofficial use case being pornography.

At a high level, private browsing modes work by not storing browsing state past the lifetime of the browsing session (though the definition of session varies somewhat). Here's Firefox's list of what it doesn't store:

Visited pages (history)
Form and search bar entries
Download list entries
Cookies
Cached Web content and Offline web content and user data

If this is all working correctly, then someone who uses your computer after you have closed the browser should not be able to learn what sites you have gone to.

Because cookies and cached content are deleted, private browsing also inherently provides some protection against tracking by Web sites in both the first and third party contexts. This protection operates at the level of preventing linkage between sessions. In particular, it should prevent the use of these mechanisms for tracking between private and non-private contexts, such as when you visit a site in private browsing and then go back to it in regular browsing. It also prevents sites from using these mechanisms to track you between multiple private browsing sessions. If our example browsing activity above had used private browsing along with FPI, then what trackers would see is shown below:

The thing to notice here is that the browsing activity before and after the browser restart are disconnected, so the trackers (in theory) can't link them up.

Of course this also means that you don't stay logged to sites that you logged into in private browsing, which is obviously a pain. And if you do log in, then of course the site is able to link up your behavior before and after, obviating the value of using private browsing for those sites. This makes private browsing mode of limited usefulness for a lot of browsing activities (e.g., shopping).

In addition to clearing state, browsers have started to add more explicit anti-tracking mechanisms to private browsing mode. For instance, Firefox Private Browsing mode automatically enables Enhanced Tracking Protection Strict Mode (not the world's least confusing name), which stops the browser from even connecting to many known third party trackers, thus preventing them from tracking you by IP address or via fingerprinting (see below). The theory here is that users who have selected private browsing have shown they care more about privacy than breakage compared to the usual person so the browser can take a more aggressive posture in terms of enabling privacy features. Thus, private browsing modes may provide some additional protection against cross-site tracking within a session as well as well as between sessions. This is something that varies a lot between browsers.

Beyond Private Browsing Mode #

For the reasons described above, private browsing only provides partial protection against tracking, either by first parties or across sites. In order to get that, you need to do something about IP-based tracking and probably about fingerprinting.

How Stable Are IP Addresses? #

Most devices use IP addresses that are assigned by their local network, for instance using DHCP. In principle, the network can change these addresses frequently, but as a practical matter they appear to change infrequently. Note that this does not mean that IP addresses are uniquely identifying: it's common for multiple devices to share the same home IP address via NAT, in which case sites may or may not be able to distinguish multiple devices behind the NAT. Specifically, if the devices are of different types, then the site probably can, but if you have two identical iPhones, they might not be able to.

The situation with mobile devices is generally a bit better because, well, they move around. The way that Internet routing works is that the address helps determine where to send the packets, so if you move around physically—e.g., to really different cell towers—your address should change too in order to allow the data to be delivered correctly.^[3] Though of course, if you are using a mobile device from your home WiFi, that address is of course likely to be fairly stable.

Preventing IP-Based Tracking #

Addressing IP based tracking requires routing your traffic through some service that will conceal your IP address. At present, there are three main alternatives:

Technically these are all somewhat different but at a high level they all work by hiding your IP address behind that of the service, so that the site can't track you over long periods of time (depending on how often your IP address changes). Because your traffic is encrypted to the proxy, these mechanisms also provide some privacy against network attackers, though that protection is somewhat limited. For instance an attacker who controls the network on both sides of the proxy might be able to link up your traffic on either side via timing and packet sizes.

From the perspective of Web tracking, these systems are all mostly equally good. The main difference between the designs comes down to how worried you are about other kinds of tracking. For instance, in a typical VPN design, you connect to the VPN service and it forwards your packets to the server. This means that the VPN sees both your address—and presumably has your account information anyway—and the site you are going to, so it is able to track you even if the site doesn't; you're just trusting them not to.

iCloud Private Relay addresses this by having two proxies as shown below:

[Source: Apple white paper]

Those proxies are operated by different providers and so neither has both your identity and the site you are going to and would therefore have to collude in order to learn your browsing history. You could potentially have accomplished^[4] the same thing by getting two VPN accounts with different providers, but that's not the usual configuration and would require you to do the work yourself. With Private Relay you just engage with Apple and they take care of the arrangements with the providers (using some somewhat fancy crypto to authorize you to the provider without revealing your identity). Tor takes this one step further by having three hops chosen out of a set of community operated servers. In both cases, the idea is that your behavior is private as long as one of the server is honest—or hasn't been subverted.

The basic problem with all of these designs is that they require some server (or servers) which relay the traffic and someone has to pay for those servers and their associated bandwidth. iCloud Private Relay and most VPNs are not free, so the user is the one who pays. Tor is different: instead of having a single provider such as Apple or your VPN provider, Tor servers are operated by the Tor community on a volunteer basis and are free to users (this is one of the reasons why Tor performance is generally not great).

Preventing Fingerprinting #

As I said above, a browser's fingerprint depends on a combination of the client software and the hardware it's running on: if you run the same browser on the same hardware, you'll have a fairly stable fingerprinting result. If you run a different browser on the same hardware, you'll have a somewhat different fingerprinting result. This means that if you use one browser for your usual browsing and another browser on the same machine for your "embarrassing" browsing, then each set of activity will have a consistent fingerprint and may be somewhat linkable; it may also be possible to partially link up the two sets of activity based on the fingerprint; I would not generally assume that if you use (say) Chrome for your regular browsing and Firefox for your private browsing, you are entirely safe from fingerprinting. It's probably worse if you use the same browser engine^[5] type (e.g., Chrome and Edge^[6]) or the same browser but in regular vs. private browsing mode, in part because they will expose the same hardware affordances and so have similar fingerprints in that respect.

A number of browsers have explicit anti-fingerprinting mechanisms with varying degrees of effectiveness. These include:

Blocking connections to origins which perform fingerprinting (Firefox)
Adding noise to API return values to make fingerprinting harder (Brave)
Removing APIs which can be used for fingerprinting and trying to make other APIs return consistent results across devices (TorBrowser)

Chrome has also proposed something called the Privacy Budget in which sites would be allowed to access some data but then to throttle access after they had obtained a certain amount (see here) for our analysis of this proposal. I don't believe it's been implemented.

This is an area of research that I'm not super familiar with, but my sense is that it's not really that clear how much information can be obtained from fingerprinting. There have been a number of papers on this topic but they generally fall into two categories:

Specific new fingerprinting techniques
Attempts to measure the amount of fingerprinting information available via fingerprinting.

Estimates of the total amount of fingerprinting surface vary a fair bit but generally hover around 18-20 bits of information. Naively, this would be enough to reduce the size of the crowd you are hiding in by a factor about a million, which is obviously bad, but not enough to identify you specifically in many cases. This is kind of misleading because some people's configurations are more unusual than others. For instance work by Gómez-Boix, Laperdrix, and Baudry found that out of a data set of around 2 million users 29% of mobile users are unique, whereas 56% of personal computers are.^[7] On the other hand, if you have a very popular device that is configured in a common way—e.g., an out of the box iPhone—then this might leak a lot less than 18-20 bits. I'm not aware of much academic research on this question or on the effectiveness of anti-fingerprinting mechanisms (please let me know if you have any!). Presumably it's better than nothing, but I don't know by how much.

Final Thoughts #

The bottom line here is that there are a lot of tracking mechanisms on the Web, and I've just covered the main ones. It's possible to do quite a bit to mitigate tracking, but the more you do, the bigger impact it has on your browsing experience, both in terms of functionality and performance. Everyone has to sort of choose their own level of comfort here, but if you don't at least do something to protect yourself from IP-based tracking, then the level of privacy is going to be limited, especially for a single site. Finally, if you want to actually browse privately, then you actually have to be anonymous, which means not logging into stuff, not buying things, etc. You can still watch cat videos, though.

This is the best link I could find, but a better one would be appreciated. ↩︎
You might say that people shouldn't architect their systems that way, but this kind of thing happens and if the browser breaks them, then the browser gets blamed. ↩︎
Don't even get me started on mobile IP. ↩︎
I say "potentially" because those two providers might have their equipment in the same data center or cloud provider, in which case you need to worry about that provider. ↩︎
For those who don't know, a lot of browsers are built on the Chromium open source code base that Chrome is based on, which means that they are internally very similar. In addition, every browser on iOS is based on the same engine because Apple forbids other engines on iOS ↩︎
Brave is a potential exception here because of their anti-fingerprinting features). ↩︎
This is also a bit misleading because in a larger data set, these might not be unique. ↩︎

Understanding The Web Security Model, Part VI: Browser Architecture

2022-06-27T00:00:00Z

This is part VI of my series on the Web security model (parts I, II, outtake, III, IV, V). I'd been planning to talk about microarchitectural attacks next, but it's pretty hard to understand without some background on overall browser architecture, so I'll be covering that first.

Background: Operating System Processes #

We actually have to start even earlier, with the structure of programs in a computer. In early computers, you would just have one program running at a time and that program had sole control of the processor.

Modern computers can of course run multiple programs at once, but they do that by having them share the processor. The operating system is responsible for managing this. Each program runs in what's called a process. The operating system lets process run for a little while (what's called a time slice), then stops it and hands control to the next process, which gets to run for its own time slice before control is handed to the next process, etc. This is called multitasking and allows multiple programs to share the same computer.^[1] In modern computers, time slices are very short and the processor switches between programs very quickly so it gives the illusion that everything is running in parallel.^[2]

In a modern OS, programs don't need to do anything special to make this happen; they just act as if they have full control of the processor and the operation system takes care of switching between them. In particular, each process has its own view of the computer's memory and so process A can't just address process B's memory, either by accident or intentionally. This isn't to say that they can't interact at all, but the operating system is responsible for mediating that interaction, allowing some things and forbidding others.

It's also possible for a single program to run multiple processes. One reason to do this is to let two operations run in parallel. Consider a networking process like a Web server. The basic code for something like this might look this might look something like:

loop {
   request = read_request();
   response = create_response(request);
   write_response(response);
}

So what happens if a Web server wants to serve two clients at once? This is fine if the requests come in quickly, but what happens if the request from client A trickles in over a few seconds and then client B sends its request? The server can't process it until its finishing handling client A. If instead the server runs in two processes, however, then process 1 can handle client A and process 2 is available to handle client B when its request comes in. The operating system takes care of making sure that each process gets time to run, so this works fine without any extra effort by the server, as shown below:

You can also get multitasking inside a single process, using a mechanism called threads. Threads inside a process get scheduled independently, so that you can write the same kind of linear code as above and have it run in parallel, but they aren't isolated like processes are. This means that, for instance, thread 1 can accidentally corrupt thread 2's memory, or, if thread 1 crashes it can crash the whole program. On the other hand, switching between threads tends to be cheaper than switching between processes, so each mechanism has its place. Finally, a process with multiple threads tends to consume less memory than the same number of processes because the threads can share a lot of runtime state.

Single-Process Browsers #

Originally, browsers just had everything in a single process. This included not only the user interface and networking code but also all the code that rendered the Web page and the JavaScript that ran in the page. Moreover, they often ran almost everything in a single thread,^[3] with the program being responsible for multiplexing keyboard input, network activity, etc. (see the side bar for more on this). Because each thread can only do one thing at once, this tended to produce a lot of situations where the browser would become temporarily unresponsive (the technical term here is jank) because it was doing something else rather than responding to the user or playing your video, so gradually more and more of the the browser migrated into other threads in order to reduce the impact on the user experience.

Event-Based Programming #

If you don't have threads, it's still possible to multiplex between different tasks. The basic technique is what's called an event loop. The basic idea behind an event-loop is that you have a piece of code that allows you to register event handlers for when certain things happen (e.g., a packet comes in or someone types a key). An event handler is just a function that runs when that event happens.

So, for instance, you might have something like:

function onKeyPressed() {
   ...
}

function onMouseMovement() {
   ...
}

register(KEY_PRESSED, onKeyPressed);
register(MOUSE_MOVED, onMouseMovement);

run_event_loop();

The run_event_loop() function just runs forever, waiting for something interesting to happen—where "interesting" is defined as "some event that has a handler registered" and when it does it runs the associated handler function. When the handler function completes, the event loop resumes waiting until something else happens.

This works fine and is still common—for instance, the popular Node.js JavaScript runtime works this way—but it's a lot of work to program in. First, because nothing happens while the event handler is running, you constantly have to worry about whether you accidentally are taking up too much time with some operation. For instance, if someone presses a key and then clicks a button and your key press handler takes 500ms, then the button click doesn't get processed for 500ms, which is obviously very unpleasant for users.

This means that you have to break up anything long-running into multiple pieces, but every time you switch from one logical operation to another, you have to arrange to save your state so it's there when you come back to it, which is annoying. By contrast, if you are writing multi-process or multi-threaded code, then the scheduler takes care of pausing one logical operation and letting another run, so you don't need to worry about saving your state and coming back to it. In fact, it's so annoying to program this way that some event-driven systems (in particular JavaScript in both Web browsers and Node.js) have developed mechanisms like async/await that let the programmer write code that appears to be linear but is secretly event-driven.

As an example, until 2016 Firefox had an architecture with a single process containing a number of threads for tasks that could be run asynchronously like networking and media. For instance, the user interface runs on one thread, but what happens if the user asks to do something that takes a long time, like load a Web page? The way this happens is that the UI thread dispatches a request to a different thread which is responsible for networking. The networking thread can then connect to the Web site and download the content in the background. This allows the UI to continue to be responsive to the user while the Web page downloads.

This architecture is straightforward and has a number of advantages. In particular, it is easier to share state between the different threads. For example, consider the case I just gave above in which the UI thread needs to send a request to the network thread, it would assemble a request structure and pass it to the network thread, which could look something like this (this is not real Firefox code):

struct {
    enum method;
    std::string url;
    std::string referer;
} NetworkRequest;

// 

NetworkRequest *msg = new NetworkRequest();
msg->method = HTTP_GET;
msg->url = std::string("https://example.com/");
msg->referer = std::string("https:/referer.example/");

networkThread->Dispatch(msg);

When this code calls networkRequest->Dispatch() it passes a pointer to (i.e., the memory address of) the NetworkRequest object to the networking thread, which then can access the contents of that object. In C++, the NetworkRequest object does not consist of a contiguous block of memory. Instead, the url and referer members are likely to be separate blocks of memory, with the NetworkRequest object just holding pointers to those objects. This all works because threads share memory, which means that a memory address that is valid on the main thread is also valid on the networking thread. Therefore, you can just pass a pointer to the structure itself and everything works fine.

By contrast, if there were a separate networking process, then this wouldn't work because the pointer to the structure wouldn't point to a valid memory region in the networking process. Instead you have to serialize the structure by turning it into a single message, e.g., by concatenating the method, the URL, and the referer. You then send that message to the network process which deserializes it back into its original components. Any responses from that process would have to come back the same way.

This is a huge advantage when you have a single threaded program that you want to make multithreaded, because memory sharing makes it comparatively easy to move an operation to another thread. I say "comparatively" because it's still not easy. If you have multiple threads trying to touch the same data at the same time you can get corruption and other horrible problems, so you have to go to a lot of work^[4] to make sure that doesn't happen.^[5] This kind of problem, called a data race, can be incredibly hard to debug, especially as it often won't happen in your tests but only in some scenario where things are operating in a way you didn't expect; but even uncommon things happen a lot when you have a piece of software used by millions of people.

With processes, by contrast, you mostly get this kind of protection for free, because memory isn't usually shared, but you have to pay the cost upfront of restructuring the code so it doesn't depend on shared memory. This tends to make threads look more attractive than they actually would be if you counted the total cost including diagnosing issues once the software is deployed. In any case, so it's quite common to see big programs with a lot of threads.

Stability and Security Issues #

Because all the threads in the same process share the same execution environment, defects that occur in one thread have a tendency to impact the whole program. For example, consider what happens if part of your program tries to access an invalid region of memory. On UNIX systems this generally results on what's called a segmentation fault, which causes the process to terminate. If your entire program is in a single process, then the user just sees your entire program crash. Web browsers are very complicated systems that therefore have a lot of bugs, and it used to be very common for people to just have the whole browser crash.^[6]

Another example is that it's possible for one Web site to starve another Web site. Because the JavaScript engine runs on a single thread, if site A writes some JavaScript that runs for a long time, then site B's JavaScript doesn't get to run. On Firefox, this issue was even worse because the browser UI also ran on the same thread, so it was possible for a Web site to prevent the browser UI from working well. Firefox had some code to detect these cases and alert the user, but it could still cause detectable UI jank.

A single process can also lead to security issues: if an attacker manages to compromise the code running in part of the program, then they can use it to access any memory in the process. For instance, in a Web browser they might steal your cookies and use them to impersonate you to Web sites. In a Web server, they might steal the cryptographic keys that authenticate the server and use that to impersonate the server to other clients. In addition, because any code they manage to execute has the privileges of the whole program, they can do anything the program can do, such as read or write files on your disk, access your camera or microphone, etc.

Process Separation #

As discussed in a previous post, there is a standard approach to dealing with this issue:

Take the most dangerous/vulnerable code and run it in its own process (process separation).
Lock down that process so that it has the minimum /privileges needed to do its job (sandboxing). The details of this vary from operating system to operating system but the general idea is that a process can give up its privileges to do things like access the filesystem or the network.
If the process needs extra privileges have it talk to another process which has more privileges but is (theoretically) less vulnerable.

This strategy was introduced in SSHD and then first shipped in a mainstream browser by Chrome/Chromium. The way that Chromium originally worked was that the HTML/JS renderer ran in a sandbox, but the UI and the network access ran in the "parent" process (what Chromium called the "browser kernel"). The following figure from Barth et al.'s original paper on Chromium shows how this works:

In this figure "IPC" refers to "interprocess communication" which just means a bidirectional channel that the two processes can use to talk to each other. As noted above, that requires serializing the messages for transmission over the wire and decoding them on receipt.

As you would expect, this architecture has a number of stability and security advantages.

Stability #

On the stability side, if the renderer process crashes, the parent process can detect this and restart it. This isn't an entirely glitch-free experience because the site the user is viewing still crashes, but because Chrome can run multiple processes, it doesn't necessarily impact every browser tab. Similarly, because each tab is running in its own process, if tab A has some kind of long-running script it doesn't necessarily impact tab B, and won't impact the main browser UI.

Security #

Because the renderer is sandboxed, compromise of the renderer is less serious. For instance, the renderer would not be able to read files off the filesystem directly but would have to ask the parent to do it. Of course, if the renderer can ask the parent to read any file, then this isn't much of an improvement, so instead the renderer asks the parent to bring up a file picker dialog and then only the selected file will be accessible. This is a specific case of a general pattern, which is that the parent only partly trusts the renderer and has to perform access control checks when the renderer asks for something.

In order to gain full control of the computer, an attacker who compromises the renderer must first escape the sandbox. This tends to happen in one of two ways:

The attacker uses a vulnerability in the operating system to elevate its privileges beyond those it is supposed to have.
The attacker uses a vulnerability in the parent process to subvert that process or to cause it to do something it shouldn't.

Sandbox escapes do happen with some regularity but you've now raised the bar on the attacker by requiring them to have two vulnerabilities rather than one.

Of course this does not provide perfect security. First, much of the browser runs outside the sandbox, so compromise of these portions can lead directly to compromise of your machine. A good example of this is networking code, which is exposed directly to the attacker and is easy to get wrong.^[7]

Second, sites are not protected from each other because the same process may serve multiple sites, either consecutively—for instance if the user navigates between sites—or simultaneously—for instance, if the browser uses the same process for multiple tabs or because a site loads a resource from another site. If a site is able to successfully attack the renderer, it can then access state associated with another site, including cookie state and the like. Thus, the browser protects the user's computer, but not any Web-associated data. As more and more of the work people moved to the Web, this became a more serious threat; if an attacker can't take over your computer but they can read all your banking data and your mail, this represents a serious threat.

Site Isolation #

The natural way to address the problem of sites attacking each other via browser vulnerabilities is to isolate each site^[8] in its own process. This is called site isolation, and unfortunately it turns out to be a lot harder than it sounds, for a number of reasons.

First, there are a number of Web APIs that allow for synchronous access between windows or IFRAMEs. For instance, if site A does window.open() then it gets a handle it can use to access the new window, for instance to navigate it to a different site or—if it's the same site—to access its data. Similarly, the opened window gets a window.opener property that it can use to access the window that opened it. The APIs that use these values are expected to behave synchronously, so for instance, if you want to look at some property of window.opener this has to happen immediately. If each site is in its own process, then that becomes tricky, so you have to implement some way of allowing that. There are a fair number of similar scenarios and converting a browser to site isolation requires finding and fixing each of them.

Second, unlike the simpler site isolation design, you need to ensure that each process is constrained to only do the things that are allowed for that site. For instance, the process for site A cannot access the cookies for site B. This means that every single request to access data that isn't local to the processes's memory not only needs to go through the parent—as in process separation—but the parent needs to check that the process that is making it is entitled to do so, first by keeping track of which process goes with which site and second by doing the right permissions checks. Previously, these permissions checks could be in the renderer process, which was a lot easier, especially if, as in Firefox, they had started there in the first place.

Finally, because having a lot of processes consumes a lot more memory, a lot of work was required to try to shrink the overall memory consumption of the system. This also means that is harder to deploy full site isolation on mobile devices which tend to have less memory.

At present, Chrome—and other Chromium derived browsers such as Edge and Brave—and Firefox have full site isolation, but to the best of my knowledge, Safari does not yet have it.

Inside Baseball: Multiprocess Firefox #

Unlike Chrome, which was designed from the beginning as a multiprocess browser, Firefox originally had a more traditional "monolithic" architecture. This made converting to a multiprocess architecture much more painful because it meant unwinding all the assumptions about how things would be mutually accessible. In particular, Firefox had a very extensive "add-on" ecosystem that let add-ons make all sorts of changes to how Firefox operated. In many cases, these add-ons depended on having access to many different parts of the browser and so weren't easily compatible with a multi-process system.

At the same time as Chrome was building a multiprocess architecture, Mozilla was developing a new programming language, Rust, which was specifically designed for the kinds of systems programming that is required to make a browser engine. Rust had two key features:

Memory safety so that it was much harder to write memory unsafe code, thus eliminating a broad class of serious vulnerabilities.
Thread safety so that it was much easier to write multithreaded code without creating data races that lead to vulnerabilities and unpredictable behavior.

Instead of converting Firefox to a multiprocess architecture, Mozilla focused on the idea of rewriting much of the browser engine in Rust (a project called Servo). If successful, this would have addressed many of the same issues as a multiprocess system: you could easily write multithreaded code and because it was memory safe you wouldn't need to worry as much about compromises of one thread leading to compromises of the process as a whole. If this had worked it would have been very convenient because it would have allowed for a gradual transition without breaking add-ons (which was considered a big deal). It would also have used less memory and quite likely been faster.

The Big Rewrite ultimately didn't work out, for two major reasons. First, it just wasn't practical to rewrite enough of the browser in Rust to make a real difference. Firefox is over 20 million lines of code, a huge fraction of it in C++, reflecting over 20 years of software engineering by a team of hundreds. Even if writing in Rust was dramatically faster it would still be very expensive to replace all that code. Firefox eventually did incorporate several big chunks of new tech from Servo, such as the Stylo Style engine and the WebRender rendering system, and a lot of new Firefox code is written in Rust, but it just wasn't practical to replace everything.

The second reason comes down to JavaScript. A huge fraction of the memory vulnerabilities in browser engines actually isn't due to the memory unsafety of the browser but rather to logic errors in the JavaScript VM that lead to the code it generates being unsafe. Writing in Rust doesn't inherently fix these problems—though of course a rewrite might lead to simpler or easier to verify code.

In any case, Mozilla eventually decided to introduce process separation, in a project called Electrolysis. At first Firefox only had one content process and even later after it added multiple processes, it was far more conservative than Chrome about the number of processes that it started, in an attempt to conserve memory. (see here for some spin on why having 4 processes was perfect rather than just easy). And those add-ons? Eventually Firefox deprecated them, in favor of WebExtensions.

In retrospect, the decision to do Electrolysis was fortunate because, as we'll discuss next time, multithreaded architectures simply can't properly defend against Spectre-type attacks, so Firefox would have had to move to multiprocess in any case and having already done Electrolysis at least got it part of the way there.

Next Up: Microarchitectural Attacks #

Because site isolation was so much work, converting browsers from process separation took a really long time. Chrome was the first browser to start working on site isolation back in 2015 but they were still far from finished in 2018 when an entirely new class of attacks that exploited microarchitectural features of modern processors was discovered. The only known viable long-term defense against these attacks is to move to full site isolation, leading Chrome to increase their level of urgency and Firefox to launch Project Fission to add site isolation to Firefox. I'll be covering these attacks in the next post.

Technically, what I'm describing here is "preemptive multitasking", because the operating system switches programs out without their cooperation. The alternative is "cooperative multitasking", in which programs give up control of the processor. ↩︎
Newer computers also have multiple processors and/or multiple cores and can really do some stuff simultaneously, but that's not that relevant here. ↩︎
As far as I can tell Mosaic actually was completely single-threaded. ↩︎
Much of the machinery of languages like Rust and Erlang is designed to make it possible to safely write multithreaded code without a lot of mental overhead. ↩︎
The Mozilla San Francisco offices used to have a sign set about 8 feet off the floor that read "Must be this tall to write multithreaded code". ↩︎
Technically it's possible to recover from memory violations in the sense that you can just tell the program to ignore the error and keep executing—the Emacs editor used to allow this—but once you've had some kind of memory issue like this, your program is in an uncertain state so all bets are off. ↩︎
Firefox and Chrome are both moving networking into a separate process, and I believe Chrome may have recently completed this on some systems. ↩︎
Perhaps surprisingly, the unit of isolation is not the origin but rather the site, which is to say the registrable domain, aka "eTLD+1". So, for instance, mail.example.com and web.example.com. The reason for this is that sites can set the document.domain property to set their domain to the parent domain, e.g., from mail.example.com to example.com. This puts them in the same origin. See the Chromium design document on site isolation for more detail. ↩︎

First impressions of Web5

2022-06-13T00:00:00Z

Recently Jack Dorsey announced a new project called Web5 which is billed as "an extra decentralized web platform". I've now had time to take a look at the pitch deck and some of the specifications. This post provides some initial impressions.

Overall Idea #

Although Web5 bills itself as for the "decentralized Web", it seems to be addressing a somewhat different set of applications than those I explored previously (helping to make the case that "decentralized Web" is an unhelpful term). In that post, we mostly looked at the problem of how one could publish Web sites and apps without having to use some kind of centralized service. Web5, however, seems to be trying to solve the problem of how to use various Web services (e.g., Spotify or Twitter) while still maintaining control of your data. To that end, the site lists two main use cases:

Control Your Identity Alice holds a digital wallet that securely manages her identity, data, and authorizations for external apps and connections. Alice uses her wallet to sign in to a new decentralized social media app. Because Alice has connected to the app with her decentralized identity, she does not need to create a profile, and all the connections, relationships, and posts she creates through the app are stored with her, in her decentralized web node. Now Alice can switch apps whenever she wants, taking her social persona with her.

Own Your Data Bob is a music lover and hates having his personal data locked to a single vendor. It forces him to regurgitate his playlists and songs over and over again across different music apps. Thankfully there's a way out of this maze of vendor-locked silos: Bob can keep this data in his decentralized web node. This way Bob is able to grant any music app access to his settings and preferences, enabling him to take his personalized music experience wherever he chooses.

The system defines a number of technical components to address these use cases.

Decentralized Web Nodes #

The core idea seems to be that instead of storing your data on the service, you instead store it in a Decentralized Web Node (DWN), which is a network element that is somehow associated with you and that you trust with your data. When services want to use your data—for instance, when Spotify wants to look at your playlist—they contact your DWN and request it. Because the data is stored on your DWN, you nominally control it and how it is used. In other words, this is a federated system.

The diagram below shows the main idea:

In a conventional Web application, each site has its own storage, typically some kind of database (see here for an overview of this kind of Web app). The site stores all of your data/state and you don't have any real access to it. In Web5, each Web site will instead store its data on your DWN. This gives you access to and control of the data but also in theory means that it's portable and/or shareable. For instance, if you want to change from using Spotify to using Apple Music, you just give Apple access to the playlist data on your DWN—and, I suppose, revoke Spotify's access. It's also intended to allow multiple sites concurrent access to the data. There certainly are use cases where this would be valuable, for instance, sharing your travel reservations between Kayak and TripIt.

Note that this kind of element isn't a new idea. For instance Tim Berners-Lee's Solid project has a very similar concept called "Pods":

Solid is a specification that lets people store their data securely in decentralized data stores called Pods. Pods are like secure personal web servers for data. When data is stored in someone's Pod, they control which people and applications can access it.

Of course the technical details of Web5 and Solid are completely different (for instance, the APIs are different and Web5 is based on DIDs whereas Solid uses OIDC for authentication^[1]) but at the big-picture level these ideas seem to be pretty similar.

More generally, the basic idea of Bring Your Own Storage (BYOS) is quite old. Prior to the great Webification of everything—closely followed by the mobile appification of everything—this is how applications were generally built: you would have some network protocol like IMAP (for mail) or CalDAV (for calendaring) that everyone implemented, you would sign up for an account with a service, and then separately download a client. You could switch clients at any time because the whole system was interoperable.

One thing that the Web5 documentation is pretty vague on is where the DWNs come from. What I mean here is not the code (they have some open-source implementation you can download) but the server. It's important to recognize that this system depends on trusting the DWN. Although there is some cryptography the primary security and privacy protections are provided by the DWN doing access control and so this isn't something you can just run on some totally decentralized system. I think it's a safe assumption that most people aren't going to run their own physical DWN server—the inconvenience of that sort of thing is what kicked off our current round of centralization—so we need some other alternative. I guess the idea is that there will be some DWN service that you can subscribe to like you do with Dropbox or gSuite, but it would be nice if the plan here were clearer. There's also some stuff in the spec about how DWNs should be based on IPFS, but I don't really understand that at all. As far as I can tell, how the DWN stores data should be largely invisible.

Data Model #

A DWN mostly presents a fairly generic data storage interface, with two main concepts:

Collections of objects attached to a given JSON "schema" (i.e., a definition of the elements that need to appear in a JSON object, such as a playlist).
Threads of messages attached to each other. It's not entirely clear to me how these are supposed to work, but the idea seems to be to provide a generalized peer-to-peer messaging facility (the slide deck says "send and receive messages over a DID-encrypted universal network").

There's also the concept of Permissions: an entity can request access to a given set of objects (such as a collection) and the owner of the DWN can grant and revoke access.

I won't want to spend too much time on the details here other than to say that this whole part of the system seems fairly thin and would probably benefit from engaging more with prior work. For example, WebDAV provides a fairly sophisticated data management and access control model that is quite a bit more advanced than that presented here, including hierarchical collections, locking, metadata, and access control lists. This isn't to single out WebDAV as ideal but merely to observe that there's a lot of prior art in terms of what kind of capabilities distributed data stores need and my sense is that what's presented here is largely insufficient. As a specific example, real data stores need some way to deal with conflict resolution and concurrent editing—especially if you have multiple uncoordinated applications writing to the same data, and Last-Write Wins, which is the only specified mechanism, is really not enough.

Similarly, the whole threads concept seems pretty underspecified. If the idea is to provide some kind of generic secure messaging structure, there's a lot more to do here than just encrypt to people's DIDs—which I think is how it is supposed to work. Modern secure messaging systems like IETF Messaging Layer Security (MLS) incorporate a whole bunch of security and interoperability features (e.g., ratcheting).

My point isn't that these are fatal flaws—all of these are details which could in principle be fixed—but rather that building a system like this correctly is very complicated and that there's a big difference between what we've seen so far and a real system. Moreover, the fact that this initial specification is so incomplete should not inspire confidence that it can be turned into something as generic as it seems to aspire to be.

Distributed Web Apps (DWAs) #

The other big idea here is that apps will be written as what the document calls Distributed Web Apps (DWAs). This part is pretty handwavy, but the basic idea seems to be that they are an extension of what's called a Progressive Web App (PWA). PWAs are a sort of confusing topic, but at a high level, a PWA is a Web app that has been designed to act more like a native app. This means things like:

An icon on the home screen
Working offline
Storing data on the client (this is required to work offline)

While PWAs run in the user's browser, they still ultimately depend on the main Web site for their data and potentially for some of their logic. It seems that a DWA will instead directly access the DWN to get the user's data, but under the authority of the site. So, for instance, if you granted example.com access to your music playlists, it could either contact the DWN directly or empower the DWA to do it directly from your browser. The technical details here are a bit fuzzy, but this also seems pretty clearly doable via some combination of tokens, delegation, etc. so I don't think we should worry too much about that.

DWAs seem like kind of a separable idea from DWNs. Looking at PWAs, we see that some sites build native apps and not PWAs, some build both, and some build neither (my impression is that it's quite uncommon to build just a PWA); it's really a design choice by the site. Similarly, if you managed to make the shift from site-based storage to DWNs, I would expect sites to do some combination of native apps, DWAs, and regular Web sites based on what worked best for them (there's no reason why DWNs can't be used with native apps, even though that's not how it's presented). I don't think DWAs make or break the vision of Web5.

DIDs #

Finally, I should mention that all the identities in Web5 are phrased as Decentralized Identifiers (DID) (see here for some background on DIDs). At some level, this is just a detail: you need some way to talk about principals, there are a lot of potential options here, and DIDs are entirely generic.

In order to participate in Web5, the DID document has to contain DecentralizedWebNode service endpoint that contains one or more HTTPS URLs, like so:

{
  "id": "did:example:123",
  "service": [{
    "id":"#dwn",
    "type": "DecentralizedWebNode",
    "serviceEndpoint": {
      "nodes": ["https://dwn.example.com", "https://example.org/dwn"]
    }
  }]
}

[Source: DWN specification]

Note that because the security of this system depends on the security of the DWN, and the DWNs, and the DWNs are accessed over HTTPS, this means that the security of this system depends on the DNS. This means that the security value you are getting out of generic DIDs is somewhat limited. The cost of supporting generic DIDs is the interoperability risk of having a DID method that isn't supported by one of the services you want to use. As a practical matter, if Web5 takes off, I'd expect those services to mostly converge on a small number of methods.

How to present new technical proposals #

As an aside, the way Web5 is presented requires a fairly large amount of filling in the blanks. Basically we have a Web site, a slide deck with an overview of the system as a whole, and then some detailed protocol specifications and code on Github. This is all fine, I guess, but what's really needed is a document describing the system architecture, how the technical components fit in, and how it meets the use cases. Over the years I have reviewed a lot of early-stage specifications and the details of those specifications rarely matter, as they usually get extensively revised during development and standardization. What's necessary at this stage is to give readers enough of an understanding of your overall vision that they can see how it's going to work, figure out if it's worthwhile, know how you've solved the hard problems, and know what problems remain to be solved. Too many details actually gets in the way of that, and a slide deck like this is way too high level. What's required is a document describing the system architecture. My put on how to write these is found in RFC 4101, but there are obviously lots of ways to do that. But a slide deck isn't it.

Building a Full System #

I said a number of times above that this is pretty thin on details. That's not uncommon with early stage proposals, but can make it very hard to assess the viability of the ideas because you don't know what's hiding behind the vagueness. Things can be vague for at least three major reasons:

It's obvious how to fill them in but someone needs to do so. For instance, you are pushing around JSON and so you'll need some formal definition of the contents. Nobody thinks that's impractical, but it's just work.
There are a number of viable ways to do something and it's a lot of engineering to work it out, often because there are conflicting requirements which have to be balanced, so you've put it off.
You actually don't know how to do it.

Reason (1) isn't a problem at this stage, though it will eventually be one if you actually want people to build interoperable systems. Reason (2) is generally a sign that it's going to take quite some time to get to production. Reason (3) potentially represents an existential threat to the project, especially if you actually have to solve the problem in order for it to succeed.^[2] It can often be hard to distinguish cases (2) and (3), and it's also very often the case that people think they have case (2)—or even case (1)—but they actually have case (3).

It's clear that this document has a bunch of case (1), which, as I said, I'm not too worried about. More worrisome, however, is that it has a lot of (2) and some stuff that's either actually in category (3) or at least requires so much work that it's practically in (3), even though we sort of could figure out how to do it.

Interoperability #

My first concern here is interoperability. One of the primary use cases seems to be that two similar sites will share the same data on your DWN. The slide deck gives two examples: (1) two music services sharing your music playlist and (2) sharing your travel reservations between sites. In order for this to work properly, the sites that are sharing the same data need to agree on the data format and semantics.

Data Model #

In the examples provided in the slide deck, the data format is identified by a link to a JSON schema on schema.org, which is a registry of schemas (definitions of data structures). For instance, in the music playlist example, playlists would be rendered as MusicPlaylist. Here's a slightly trimmed version of the example from schema.org (I also fixed their misspelling of "Lynyrd Skynyrd").

{
  "@context": "https://schema.org",
  "@type": "MusicPlaylist",
  "name": "Classic Rock Playlist",
  "numTracks": "2",
  "track": [
    {
      "@type": "MusicRecording",
      "byArtist": "Lynyrd Skynyrd",
      "duration": "PT4M45S",
      "inAlbum": "Second Helping",
      "name": "Sweet Home Alabama",
      "url": "sweet-home-alabama"
    },
    {
      "@type": "MusicRecording",
      "byArtist": "Bob Seger",
      "duration": "PT3M12S",
      "inAlbum": "Stranger In Town",
      "name": "Old Time Rock and Roll",
      "url": "old-time-rock-and-roll"
    }
}

This is actually not what I expected to see, because the definition of the byArtist in track is actually of type Person, but "Lynyrd Skynyrd" is clearly a text field. This appears to be a known problem in schema.org.

We expect schema.org properties to be used with new types, both from schema.org and from external extensions. We also expect that often, where we expect a property value of type Person, Place, Organization or some other subClassOf Thing, we will get a text string, even if our schemas don't formally document that expectation. In the spirit of "some data is better than none", search engines will often accept this markup and do the best we can. Similarly, some types such as Role and URL can be used with all properties, and we encourage this kind of experimentation amongst data consumers.

This sort of makes sense in a system which seems to be mostly devoted to publishing metadata that can be consumed if available and ignored if not, but it's not sufficient for bidirectional interoperability. Obviously, if Spotify expects to use personal names and TIDAL expects to use Person we're going to have problems. It gets worse, though. There are at least three separate ways to render the artist who performed "Old Time Rock and Roll":

Bob Seger
Seger, Bob
Bob Seger & The Silver Bullet Band (this is what Amazon Music uses, incidentally).

You could also have "and" instead of "&" in both the name of the band and the name of the song. This isn't a problem with playlists produced and consumed by the same entity because they can be consistent about their choices—or more likely have the identifiers refer to actual assets (e.g., Spotify has resource identifiers that look like this 6rqhFgbbKwnb9MLmUQDhG6) and just have human-readable metadata—but it's critical for interoperability, where mismatches will result in mysterious failures.

The situation with Reservation is equally bad. To take one example, it contains departureAirport (nested under reservationFor which is of type Airport). Airports can be listed either by IATA code or ICAO code, so what happens if site A uses the IATA code (YYZ) and the other site uses the ICAO code (CYYZ)? I guess you need to be prepared to accept both. At a higher level, how do you link up multiple reservations attached to the same trip? The schema doesn't tell you, so you have to invent something (use Trip? Create an identifier that you attach to each reservation?) and you can expect different providers to invent different things. Similarly, if Expedia and United create separate trips, how do you join them?

The point isn't that this kind of schema is bad but that it's insufficient in that it mostly defines syntax and not semantics and there are many structures that are compatible with these schema (to some extent deliberately because it allows for flexibility!). If you want to have interoperability, you need to rigorously define the semantics of everything. As a good example of how this plays out in practice, look at the CalDAV specification, which contains 99 pages of specification about how precisely calendaring systems should interoperate, all assuming that you already have a WebDAV-based data store. This is the kind of thing you need to do if you actually want multiple sites to interoperate with the same data values, and you'll need to do it one at a time for each application, not just point at schema.org and hope. It's not impossible, it's just a lot of work, and it has to be done for every single application domain where you want to interoperate.

It's worth noting that these are actually the easy cases because they mostly involve multiple sites computing on your data. The problem of how to have a consistent data model for something complicated like Twitter or Facebook where people's viewing experience is assembled out of other people's data and you want to have a consistent experience when viewing a mixture of content sourced by services A, B, and C—even when you are on service D—is likely to be a lot harder.

Application Architecture #

Consider the case of photo sharing, which seems like an obvious example of owning your own data. So you have all your photos on your DWN and now you want to give Flickr access to them so that you can share them with other people. What now?

The first question we have to answer is where the data will be served from when people go to look at your albums. One answer is that it's served off of your DWN, but this actually puts enormously high requirements on the DWN in that it has to be able to serve very high volumes of traffic. Serving that amount of traffic is one reason you use a photo sharing site like Flickr in the first place, so that's no good. This means that the data has to be served off of Flickr, not your node,^[3] but how does that work?

The obvious thing for Flickr to do is to just suck all the data off of your DWN and replicate it locally. So, instead of having the architecture I showed above, we actually have something more like the diagram below:

In this case, Flickr has a copy of your data which is what it uses to serve to other people, and then—at least in theory—it periodically syncs your data with the DWN. This sync has to be bidirectional, so that Flickr can discover when new pictures have been created, and in practice, it will actually need some way to be notified when that has happened. This probably means some kind of publish/subscribe framework for these notifications. Again, not impossible, but it needs to be specified.

Note that even in cases where the site doesn't need to serve high volumes of traffic, it's extremely convenient to have a site-local copy of the data. For instance, it lets you run algorithms (face recognition, machine learning, etc.) over the data quickly without having to constantly retrieve it from the DWN.

Another advantage of having a local copy is that it allows you to make changes that happen immediately without being dependent on the DWN for performance (remember that users will blame the site when it's slow, not the DWN). But then you have to worry about what happens when the user makes a big pile of changes on one site that conflict with changes on some other site and those changes have to somehow be resolved each site will have to implement all of this logic. The situation is somewhat better if you just write everything right to the DWN but you still have to deal with conflict resolution for any change that's not instantaneous.

This is of course a problem for any system that has multiple readers and writers, and while we do see systems that have shared data that multiple clients can concurrently write to (e.g., the Strava API), application authors have to take real care not to step on each other. One common pattern you see in practice is for site A to import data from site B but not to write it back and just to keep any changes locally. For obvious reasons, this is a lot easier, especially if you already have to keep a local copy anyway for other reasons.

Access Control #

Next, we need to ask how access control will work. As noted above, the DWN is responsible for denying or granting access to your data, but the unit of access control is the site, not the user. Consider the case of the photo site from the previous section: once you have shared your photos with the site it is free to show them to anyone it wants without any involvement from your DWN. Of course, the site will likely have its own access control settings, but you're trusting the site to enforce those, not the DWN.

Moreover, those access control settings have to be stored somewhere. If it's in the site's database, then you've just lost control of some of your data; if it's in the DWN then we have to specify how access control is stored, which is likely to be very complicated given that each site has its own access control model (share with friends, share with specific people, etc.) Of course, the site could just store some site-specific blob on the DWN, but that's hardly better than storing it locally.

It's possible to imagine having the DWN make every access control decision somehow, either in an "advisory" capacity by serving as an oracle for the site, or in a "mandatory" capacity by requiring cryptographic controls for every action. For instance, every photo could be encrypted and if the site asks to share a photo with DID XYZ, you (or the DWN) then shares the encryption key with that DID. People have tried to build this kind of system (e.g., Tahoe-LAFS), but the results are technically complex and likely not easy to map onto everyone's existing access control systems. To take just one problem: how do you map the existing identifier space of Flickr (or Twitter) to DIDs?

This is just a specific instance of a general situation, which is that even if the data is stored on a device you control, the behavior of an application is dictated by the application logic, which is largely out of your control.

Final Thoughts #

I certainly understand the motivation for this work. Having all of your data locked up in various silos sucks—don't even get me started on streaming apps—and it would be great to have interoperability. With that said, I don't think this is a very promising technical direction. Long experience with standardizing protocols for applications as diverse as e-mail, calendaring, directories, and telephony teaches us that if you want to have interoperability you need to produce detailed specifications that encode the semantics of the application domain, and that this, not the mechanics of data storage and retrieval, is the hard part. The Web5 specifications—at least at present—almost exclusively focus on those generic mechanics, leaving the real problems unsolved.

In my opinion a better way to attack this problem would be to attempt to solve some specific set of application domains (start with Twitter-like microblogging, perhaps?) and see if you can build a protocol or protocol suite that would enable interoperability there. This would also require getting actual buyin from the various sites that you expect to be consumers of this protocol, which seems like it will be very challenging under the best of circumstances. Once you've done a few application domains, you can try to figure out what the common ideas are and perhaps try to build them into some generic infrastructure that makes future protocols easier. This is obviously a lot more effort, but I think it's far more likely to succeed than trying to build a generic system and hoping people will somehow make it work.

Though Solid apparently has a DID method as well. ↩︎
sometimes you can not know how to build some feature but at the end of the day you could ship without it. This is what happened with TLS Encrypted Client Hello, but then we actually figured out how to do it later. ↩︎
See here for some problems with more decentralized options. ↩︎

On Blockchains/Ledgers and Identity Systems

2022-06-06T00:00:00Z

OK, so I managed to get through my post on identity while only using the word "blockchain" twice. However, the story of self-sovereign identity/decentralized identity is inextricably intertwined with blockchains: much of the interest in decentralized identity comes out of the blockchain/Web3 quarter and a very large fraction of the proposals in this space involve blockchain in one way or another. This post tries to explain the role of the ledger in these systems, which, as we'll see, is surprisingly limited.

Background: DID #

The main specification in this space is something called (unsurprisingly) Decentralized Identifiers (DID). You don't actually need to know about DIDs to talk about decentralized identity, but most of the mechanisms are now defined in terms of DID, so it's most convenient to use DID terminology. The DID specification isn't actually a type of identifier but rather a generic framework for identifiers. A DID is a kind of URI that has the scheme did:, like so:

[Source: DID Core specification]

Each DID has a method and then a method-specific identifier. The method describes how you use the method-specific identifier to look up what's called a DID document which contains the actual identity information you are interested in in a JSON-LD structure, for instance:

{
  "@context": [
    "https://www.w3.org/ns/did/v1",
    "https://w3id.org/security/suites/ed25519-2020/v1"
  ]
  "id": "did:example:123456789abcdefghi",
  "authentication": [{
    "id": "did:example:123456789abcdefghi#keys-1",
    "type": "Ed25519VerificationKey2020",
    "controller": "did:example:123456789abcdefghi",
    "publicKeyMultibase": "zH3C2AVvLMv6gmMNam3uVAjZpfkcJCwDwnZn6z3wXmqPV"
  }]
}

[Source: DID Core specification]

What's sort of unusual about DID is that the specification defines the format of the DID document but not the methods. So, for instance, in the example above, if you know the method example then you can obtain (technical term: resolve) the DID document, but if you don't know that method, then you can't do anything with the DID.

`did:key` #

Pretty much the simplest kind of identifier here is just a bare public key, which is approximately what did:key provides. did:key DIDs look like this:

did:key:z6LSeu9HkTHSfLLeUs2nnzUSNedgDUevfNQgQjQC23ZCit6F

Keys vs. Hashes #

Note that for authentication purposes, it's not necessary to have the key; you could just carry a digest of the key and then have the signer supply the key along with their signature. However, DIDs also allow you carry keys for encryption, where this trick doesn't work. Of course, for modern elliptic curve algorithms, the public key is essentially the same size as the hash would be, so this is less useful. Of course, for post-quantum algorithms, the keys may be bigger.

This is basically just a type specifier that indicates what algorithm the key is associated with (z6Mk means X25519) followed by the public key. Because the concept of DIDs is that you resolve the DID into a DID document, there is a (somewhat hazily defined)^[1] way to expand the key into a DID document. However, for our purposes we can just think of this as a public key.

As described previously, this kind of explicit key system isn't very flexible. Because it binds your identity to your public key, it doesn't easily allow to to (for instance) update your keys.

`did:web` #

At the other end of the spectrum we have did:web in which the DID is effectively a URI that points to the DID document, as in:

did:web:example.com:user:alice

For some reason the slashes in the path are converted to colons, so this refers to:

https://example.com/user/alice/did.json

In order to resolve the DID, the RP connects to the server and retrieves the indicated document. This document can of course be arbitrarily rich and contain not only keying material but also other assertions about the user, including third party assertions such as "the State of California asserts that this user's personal name is Alan Smithee." This is all left kind of vague in the specs, but I think the idea is that if you were to obtain such an assertion, you would add it to your DID document and upload it to the server. Incidentally, this has horrifying privacy properties, but maybe you could find some way to encrypt them or have some other kind of access control.

It's important to realize at this point that the server is actually doing two things: authenticating the identity document and publishing it. That's sort of natural in the Web context, but there's nothing inherent about it, and it's quite possible to have systems in which authentication and publication are totally separate; you just need some way to distribute the data. Because Web servers exist to serve data, it feels natural to combine them into one service, but, as we'll see below, it's not as good an idea when the identity is rooted in a different system.

did:web and similar Web-based have effectively the opposite properties from did:key and other key-based identifiers. If you want to replace your key all you need to do is update the DID document. Regrettably, the did:web specification punts this issue but presumably one could invent something, whether it's a Web page that one could update manually, a Web API, or a standardized protocol like WebDAV. On the other hand, like an e-mail address, the identifier is completely controlled by the operator of the Web site it's being served off of. For instance, if you have the identifier did:web:example.com:users:fuzzy-dunlop, and the operator of example.com decides to change your public key, they can just do so, whenever they want.

This isn't really an issue for identifiers that are supposed to represent the domain operator, as in a vaccine passport system, but if you are a user of a system operated by someone else it means you don't control your own identity. Of course, in principle you could register your own domain and host your identity there, but few people do; as a practical matter if did:web was ever to be popular with ordinary users we'd expect most of the identifiers to be did:gmail.com:<username> or the like, with Google (or Yahoo or whoever) controlling the identities for those users.

Even for users who do register their own domain, at the end of the day did:web identities are just bootstrapping off the existing DNS namespace and the WebPKI, which asserts identities within it. Because the namespace is hierarchical, this means that your identity can be taken away by someone who can control the relevant part of the DNS, for instance if a government seizes your domain. If what you want is "self-sovereign identities" in which "A person’s digital existence is now independent of any organization: no-one can take their identity away." then this isn't it.

Online vs. Offline Authentication #

It's worth noting that even if we ignore these inherent structural issues in a system like this, it's somewhat odd to have an identity assertion require access to an online resource, in this case the Web server. By contrast, WebPKI certificates can be verified offline, which means you don't need to contact the certificate authority in order to validate them.^[2] The reason for this is that the certificates are digitally signed by the CA and that signature can be verified by anyone. Operationally, this means that if you send someone a message signed with the key corresponding to your DID, that person can't verify it without contacting the Web server; if that server—or the relying party—is offline, the relying party will have to wait.

did:web assertions aren't just online: because of the specific way in which TLS works, they are also deniable.^[3] What I mean by that is that if I connect to a Web server—any Web server—and retrieve some data, I have no way of proving the contents of that data (depending on the protocol details I may be able to prove that I made the connection). The reason for this is that the data is protected with a symmetric key which is jointly known both to me and the server and so either side can produce the same protocol messages (technical term: protocol trace).

There's no way to to prove that the Web server made a specific identity assertion, so, for instance, if you send me a document signed with a did:web identity and then change your key pair, you can just deny that the key that signed the document was yours and I can't prove otherwise. This is an attractive property in some situations, but limits the use of this kind of identity.

A related property is that it makes it harder to detect malfeasance by the Web server. Suppose that the Web server occasionally lies about the public key of a given identity (e.g., so that the attacker can impersonate Alice). How would you detect this? In a certificate system like the WebPKI you can use a transparency log which—at least in theory—allows you to detect this, but the problem is much harder when the data is retrieved from the Web. In principle you could have transparency just for the keys, but without a way to prove that the server actually sent you a given key, anyone can frame the server for sending a bogus key, which means that malfeasance is deniable. The transparency log is still of some value, but it needs to be checked in real time and it's still unclear what you do if a key isn't in the log.

Even if we discount malice by the server operator, we also have to deal with server compromise: if the Web server is compromised the attacker can serve any DID responses it wants. By contrast, if the DIDs were signed then the signature key could be kept offline and not be subject to online attack. For obvious reasons the Web server's authentication key cannot be kept offline, as it must be used with more or less every transaction.^[4]

Ledger-Based DID Methods #

With the above as background, it's useful to look at the systems we now see being proposed, and see what's going on. As I mentioned above, the DID specification is really a framework for different methods, each of which has their own way of resolving the DID document from the identifier. Quite a few of these are tied to some blockchain or another and while the details differ, the general concepts seem to be fairly similar. The following description is sort of a mashup of did:v1 and did:indy that hopefully captures the general flavor.

The main idea is that the ledger—however that's implemented— some set of functions, such as:

Creating an identity document
Updating a given identity document
Reading an identity document

Effectively, what this does is take did:web, cross-out web, and write <insert-ledger-here> in its place; it's kind of a rough fit.

As with did:key, each identity document is associated with a given cryptographic key and so the identifier is derived from the key, for instance by hashing it. The ledger is supposed to enforce this requirement.

Once a document has been created, it is possible to update it using the update function. Updates are authorized by the current key, which, again, is enforced by the ledger. Importantly, you can change the current key but this doesn't change the identifier, so it's possible for the public key to become totally decoupled from the identifier in such a way there's no relationship.

You can resolve an identifier by doing a read operation, which returns the identity document.

In any system like this, it's important to understand what the ledger is doing for us, because there is a tendency to think of ledgers/blockchains as magic. At a high level, then, the ledger is providing three services:

Storing the user's identity document(s)
Authenticating the user's identity document(s) to the RP
Providing a consensus timeline for changes to the identity document(s)

Arguably, the first of these is largely unnecessary, the second is bad, but the third is essential.

Let's start with storing the user's identity document(s). In the majority of authentication contexts, whether online authentication like login or messaging applications, the entity being authenticated is sending some set of data to the RP and that data is then signed with the appropriate key. The straightforward thing to do, then, is to provide the identity document(s) at the same time; this is how both channel security systems like TLS and messaging systems like OpenPGP or S/MIME work. Aside from being self-contained, this also has privacy advantages because it doesn't require the RP to query some service for the authenticating identity, which would leak who was talking to who.

There are, of course, some applications in which you want to send an asynchronous encrypted message to someone you haven't talked to before, in which case it's useful to have some way to look up their key. However, those systems typically already have some kind of key lookup service that's a lot more efficient than a blockchain, and there's no good reason to have that data on a permanent public ledger; instead you'd just publish the relevant encryption keys on the key lookup system and sign them with the authentication key, in which case you can bundle the identity documents along with the signed object. Even if there wasn't an existing key lookup service, it would be better to store this data in some kind of high performance non-ledger system like IPFS, because you don't need the ledger to attest to it (recall that the date is self-validating). Moreover, this has better privacy properties than the ledger, which is inherently public.

To understand why I say that authenticating the identity document in the ledger is bad, you have to think about the problem of updating your keys. This is special because unlike other kinds of metadata you can't just sign it yourself.

How to update your keys #

If you never allow anyone to update their keys, then life is very simple because the key is self-authenticating and you can sign any updates to the identity documents with that key. However, there are good reasons to want to update keys. For instance, you might have started with a 2048-bit RSA key A and move to an 256-bit Elliptic Curve key B. The secure way to update a key in a system like this is to have the old key sign the new one. This creates a chain of identities, like so:

A → B

When you want to authenticate with identity A you then present something like:

The original identity which points to key A
The signature using A over key B

The semantics of this is that the RP accepts key B as representing identity A even though it's a totally different key. You then use B for whatever you would use A for, for instance to sign a document or authenticate your login.

This can obviously be extended to have B sign C, in which case you have:

A → B → C

When the RP receives something like this, it needs to verify the chain of assertions going back to A. Note that the original key need not be online: you just use it to make the delegation to the next key and then you may never need to use it again—indeed in a true replacement you may want to destroy A so it can't be stolen. I say may because you might have some kind of limited delegation in which you aren't actually replacing A with B but instead authorizing B to be used in some contexts, or for a limited time, as in TLS delegated credentials. For this reason^[5] you probably won't be signing the bare key but some data structure (e.g., a JSON document) that describes the semantics of the delegation. This is effectively the same structure as a WebPKI certificate chain, except for a single identity.

Compromise of the Original Key #

This all seems fine, but what happens if key A is compromised (I'll get to the compromise of key B in a moment)? The attacker can then mint a new key X which he knows and sign it with key A, thus creating:

A → X

This is just as good an assertion as the actual one over B, so the attacker has just taken over your identity. Of course, this doesn't invalidate your delegation to B; it's just that you and the attacker now jointly control identity A.

One way of analyzing this situation is that the source of the problem is that you didn't actually replace A with B, because A is still valid. By this way of thinking, what we need to do is memorialize that transaction, which is where ledgers/blockchains come in.^[6] The idea is that when you delegate to another key, you record the transaction on the ledger, which thus provides a (partial) ordering of operations. I'll leave the details of how a blockchain-type distributed ledger works for another day, but briefly a ledger is a cryptographic data structure that's constructed in such a way that everyone agrees on what events occurred and the order in which they happened. So, in this case, we would have a situation like this:

The RP would then consult the ledger—either directly or perhaps would be provided with the relevant portions as part of the authentication transaction—and verify that each delegation that it was following was the first one chronologically. Note that the key must specify which ledger will be used to prevent confusion about which timeline is authoritative. Because the delegation from A → B happens first, then it is the right one and A → X would be rejected. Note that unlike many blockchain applications, you actually need to verify that there are no future blocks that contain new delegations; otherwise you might miss that a key had been deprecated.

It's important that the actual delegation signatures be recorded on the ledger and then checked by the RP (this seems to be a point of some confusion in existing designs). You do want the ledger to check the signatures on the delegation to avoid spam, but the critical service the ledger is providing is temporal ordering. It's not possible for even a compromised ledger to make a fake delegation unless the currently valid key has been compromised. Of course, if RPs don't check the delegation signatures then they're just trusting the ledger to behave correctly. More on this shortly.

Compromise of Current Keys #

However if the currently valid key is compromised, the attacker can redelegate the key to themselves and there's nothing you can do in this system, because that delegation will be chronologically first and so your redelegation will be perceived as an attack. Some systems, such as KERI attempt to address this by pre-committing to key K_{i+1} when delegating to key K_i. In the example above, when A was first registered it would come with a commitment to B in the form of a hash. Similarly, when B delegates to C, it publishes a hash of D, as below:

(A, H(B)) → (B, H(C)) → (C, H(D))

This provides protection against cryptographic attack under the assumption that the hash is irreversible: the attacker can break the current key but can't attack the next one because it only has the hash. However, it does not provide security against compromise of whatever device holds the next key, so it's only a partial solution, especially if users—as many users will—store all of their keys in one place.

Another approach is to have some sort of recovery key which can be used to override other transactions. Presumably that key is then kept in some super secure location. This key can then be used to recover your identity if the currently valid key is compromised. Note that this is semantically the same as a partial delegation to the new key which can then be revoked by the original key.

Signature Chain Verification #

We now have enough background to understand why I said above that we don't want the ledger to authenticate the identity documents to the RP: the validity of those documents is being defined by there being an unbroken chain of signatures from the original key, which is itself cryptographically bound to the identifier. If the RP doesn't check those signatures, then it's relying on correct behavior by the ledger, and you have no way of knowing if the nodes that added the latest entries actually went to the trouble of checking the signatures.

Worse yet, if it doesn't validate the correctness of the ledger (e.g., by authenticating the current state from multiple nodes), then it's just trusting whatever ledger node it queried. This is even worse than the situation with did:web because at least with did:web the server you are querying is nominally responsible for the identity. With a blockchain-based ledger you're just asking some random node you've never heard of.

If the RP is going to validate the signature chain anyway, then there's no security reason for the ledger to do so, though there may be a performance reason. It's probably useful if the ledger does some basic checking—especially before creating documents—in order to prevent DoS attacks on the ledger, but there may be other potential mechanisms for doing that, such as charging for ledger updates (this is what Bitcoin does).

Temporal Ordering #

What we do need the ledger to do, however, is guarantee the temporal ordering of events, because that's what prevents redelegation by the attacker in case of key compromise. Ideally, the RP would check the ledger for every transaction associated with a given identity and verify that each delegation was correctly constructed based on the chronology. However, as a practical matter this requires having access to a very large portion of the transactions on the ledger (naively, all of them!) which may run to tens of millions, so this presents a scaling problem.

In practice, it's common for clients to just trust that the ledger enforced consistency for a given transaction. For payment applications this means that the ledger accepted the payment transaction. In this case, it would presumably mean that the ledger had checked the chain of apparent delegations and gave you the latest valid document. In this case, it is necessary that the ledger verify the signature chain because otherwise an attacker could inject a bogus delegation, which is then sent to the client. As long as the client checks the signature itself, this won't cause the client to get the wrong key, but it will cause it to be unable to get the right key because it will get the bogus delegation and then reject it. Effectively, this is a DoS attack on the valid user.

Even so, it's probably better for the ledger node you are communicating with directly to do the checks on read rather than having the checks be done on write. The reason for this is extensibility: if ledger nodes check the signature chain on write/update, then you can't roll out a new signature algorithm until you are guaranteed that every ledger node that is checking accepts it, which precludes incremental deployment. By contrast, if checks are done on read, then full clients which have the whole ledger will be fine as long as they have the new algorithm, and even "light" nodes which don't have the whole ledger will be OK if they pick a ledger node which supports the new algorithm.

As described above, as long as the client verifies the signature chain, if the ledger cheats, then it can cause you to accept the wrong version of history but it can't cause you to accept the wrong key unless the keys are compromised.

Lost Keys #

Of course, none of this addresses the case where the user loses their keying material. The conventional response to this problem in the decentralized identity world is that users should make arrangements in advance, for instance by keeping your recovery key in a really safe place or maybe by sharing your recovery key with your friends via something like Shamir Secret Sharing. However, as I mentioned previously, we know that in practice many users do not do a good job of managing their keys, even when a lot is at stake (e.g., millions of dollars in Bitcoin), so while surely some users will in fact follow this kind of practice, many will just store all their keys in one place and may lose them.

As with blockchain-based name systems, if you want to have a system which lets you recover your identity even if you've lost all your keying material—for instance you dropped your phone in the toilet and you don't have a backup—you need some mechanism for recovery that ultimately depends on human discretion not technology.

The Bigger Picture #

To go back to the question I asked at the beginning, what is the ledger doing here?

The primary value proposition of these designs is, as in the passage I quoted above, that you're not dependent on others for your identity:

This is called “self-sovereign” identity because each person is now in control of their own identity—they are their own sovereign nation. People can control their own information and relationships. A person’s digital existence is now independent of any organization: no-one can take their identity away.

The technical feature that provides this property is not the ledger.^[7] Rather, it's that your identity is bound to—indeed, defined by—a cryptographic key pair. Similarly, if there are assertions bound to the key, as people seem to expect, then what makes that work is that those assertions include signatures over your identity (in this case the key). None of this requires any kind of ledger; you could just do it with did:key.

The main necessary function of a ledger in this kind of system is that it allows you to verifiably transfer control of an identity from one key to another in a way that is secure even if the initial key is later compromised. In practice, the ledger also seems to being used as a publication mechanism for identity information, but that's actually something that is better done by other mechanisms to the extent to which it's necessary at all. Publishing data in ledgers is super-expensive and so should be a last resort, not a first one.

Unfortunately, the ledger only provides a partial solution to recovering from key compromise and loss: if you lose all of your keys and/or the attacker gains control of them, then this is still unrecoverable without some mechanism external to the system that allows you to assign a new key to a given identity without any signature chain from the original key, which, of course, violates the value proposition stated above.

But once you have such a mechanism, then why not just use it all the time? What I mean here is to assign people human-readable identifiers (e.g., e-mail address or phone number) rather than random high-entropy ones and then have a mechanism to bind those identifiers to keys, a la the WebPKI or DNSSEC. If someone wants to change keys, you issue a new credential and invalidate the old one. This lets you avoid the bad ergonomics of key-based identities and the scaling (and privacy) issues of the ledger. My point here is not that whoever is empowered to issue those credentials isn't a weak point in the system; of course it is. But it's also a necessary one unless you're willing to accept having people be occasionally—or maybe not so occasionally—locked out of the system entirely.

By which I mean that there is an example that presumably you're supposed to imitate, but no actual specification for how to do it as far as I can tell. A number of the DID specs are like this. ↩︎
Yes, I know about revocation and OCSP, but I think this story mostly holds up in the face of OCSP stapling, CRLite, and CRLSets ↩︎
The classic term here is "non-repudiation" but that comes with a lot of philosophical baggage. ↩︎
Ignore TLS session resumption for now. ↩︎
And for others, such as cross-protocol attacks. ↩︎
I owe this observation to Manu Sporny. ↩︎
Yes, there are systems where the ledger is used to establish the original identity in a FCFS fashion, like the various proposed DNS replacements, but that's not what I'm talking about here. ↩︎

Understanding Online Identity

2022-06-02T00:00:00Z

You often hear a lot about "identity" on the Internet, but in my experience, the situation tends to be pretty muddled. This post is my attempt to try to unpack a number of different concepts surrounding identity as well as some of the relevant technologies.

The most basic function that people think of when they think of identity is what might more properly be called authentication, which is to say proving that you are who you say you are. In typical applications, this means proving that you own/are associated with a specific identifier, whether is an account name (e.g., ekr on Github), an e-mail address (e.g., ekr@rtfm.com), or a personal name ("Eric Rescorla").

This kind of identifier mapping is good enough for a wide variety of applications, but in a number of cases people also want to be able to prove other facts about themselves, such as that they are over 21, have a license to drive, or have a given address.

As an example of these concepts, consider a drivers license:

This driver's license contains two identifiers:

The driver's name: "Alexander J. Sample"
The driver's license number: I1234562

Authentication of the license holder is performed by matching the biometrics on the license (mostly the picture, but also the various listed characteristics such as sex, hair color, etc.) to the person in front of you.

The license also carries a number of other attributes that might be interesting, such as the date of birth, whether you're an organ donor, what driver's license class you hold, etc. The way that this all fits together is that you show the driver's license to the TSA agents, the cop who pulled you over, or your bartender. They compare the biometrics to your appearance and assuming they match, they know—or at least have reason to believe—that the identifier and the attributes apply to you.

Identity on the Internet #

The situation on the Internet is somewhat different: most sites don't really need your legal name and biometric authentication mechanisms don't translate well into mechanical verification systems. Instead, most services use a different metaphor: the account.

Driver's Licenses on the Internet #

It's actually worth a moment to think about why your driver's license isn't a useful form of identity on the Internet. The problem isn't that the information on the license isn't relevant, but rather that there's no really good way to use them for authentication: pretty much all of the information on the license is public so anyone who has seen your license knows it and so it can't be used for authentication. In most contexts, there's no good way to check the biometrics (it's not like you had to do a video call to make a GMail account, though some systems do actually require this). Finally, although licenses do have anti-forgery mechanisms, they're mostly tied to the physical plastic and so don't really work in online contexts. This all adds up to it not being a very useful form of online authentication.

Accounts #

The basic idea behind an account is fairly simple. For each service you interact with, you have:

An identifier (i.e., an account ID).
Some authentication mechanism. Historically, this is a password (see my series on passwords for more on the deficiencies of passwords).

When you first interact with a given service, you register, creating an account. The service then assigns you an identifier (sometimes you are allowed to choose one, unless it's already in use, etc.) and collects your authentication information (password) and creates the account. From then on, you can log in to the account using your authenticator.

Example: Gmail #

For example, suppose you want to use Gmail. You go to the site and pick a username and a password, as shown below:

The username becomes your email address (with @gmail.com appended to the end) and your password becomes the authenticator.

But what about those other fields you enter, like your name? Even though you're providing your name to Google and it gets attached to your identity in some sense (e.g., it's in the From line of your email), that Google isn't actually doing anything to verify that it's yours; if you want to call yourself Alan Smithee or Fuzzy Dunlop, that's your choice and Google will happily attach it to your account.

By contrast, Google is authoritative for your email address, so they know that's right: if they say it's postmaster@gmail.com then it is. If Google wants to take away your address and give it to someone else, then they can just do so.

Example: Amazon #

As another example, consider Amazon. You go to their site and click the right buttons and get the following:

Superficially this is just like the Gmail account creation dialog, with your email address acting as your account identifier, but there's actually one very important difference: Amazon doesn't just let you pick an account name; they ask you to provide a preexisting identifier in the form of either an email address or a mobile number, which they then use as your account identifier (i.e., username).

Amazon doesn't just trust that you have the e-mail address you claim to have: they check it as part of account creation process. Moreover, in an important sense the email address is used as an authenticator because if you lose your password, Amazon can reset your account with your email address. That's not something that works with Gmail (if you lose your password you can't read your mail!), which is why they encourage you to set a recovery account with a separate address.

Of course, on the Web once you've logged in with whatever mechanism (passwords, SMS, etc.) you need to authenticate subsequent requests. This is done with a cookie. Cookies can be incredibly long-lived, so in some sense the cookie is the authenticator.

What I'm getting at here is that Amazon is bootstrapping their identities off of another identity system, in this case either the email address or the public switched telephone network (PSTN). They rely on those systems to maintain people's identities, assure they are unique, and ultimately for authentication. A system like this really has two kinds of authenticators:

The password
The ability to receive a message at the indicated address.

It's not uncommon to see systems where you have to demonstrate both of these in order to log in; this is one form of multi-factor authentication (MFA). I've also seen systems which don't have passwords at all and just require you to demonstrate the ability to receive at a given address.

Federated Authentication #

In the example above, the Amazon account is bootstrapped off of your email or phone number, but once that's happened, you authenticate to Amazon directly using your password. In other words, Amazon has outsourced your identity to the e-mail/phone system but still controls authentication for itself. It's possible to go further, however, and outsource authentication as well. Consider, for example, the account creation interface for the popular sports social network Strava:

The "Use my email" option is basically the same as with Amazon, where they use your e-mail address as your identifier but thereafter use a password, but "Sign up with Google" (or Facebook or Apple) is different. In this case, you authenticate with Google (or Facebook or Apple) as well. The way this works is that if you already have an account with one of these big services they can act as an identity provider (IdP) which authenticates you to third parties. The technical details are fairly complicated, (see OAuth and/or OpenID [Edited to add OpenID 2022-06-02]) but at a high level, what happens is that the service either (1) exposes an API called by the third party site (the technical term here is relying party (RP)) or (2) provides the client with a token [Edited to add tokens -- 2022-06-03] which it gives to the RP. In either case, this allows the third party site to:

Verify that the browser contacting it is associated with a particular account on the IdP
Learn some details about that account.

When you first register with the RP, they will typically bounce you to the IdP so you can approve information sharing with the RP and then from then on, they can talk to the RP without explicit consent. For instance, here's what Google shares with Strava:

This mechanism, generally referred to as "federated authentication"^[1]

has a number of important advantages from the perspective of the RP. First, it avoids needing to create your own credential management system: you don't need to check password quality, store passwords (and worry about the password hashes leaking), or deal with users losing their passwords and needing to reset them (this is surprisingly common!). In addition, it streamlines the user account creation process, by eliminating the need to create a password—or often an account name—as well as the need to process the email verification from the RP, which can be a place that user account creation can stall, causing you to lose potential users.

Finally, the IdP may also offer APIs that give the RP additional capabilities, such as learning more information about the user's account (for instance, your name and your social contacts) or even to interact with the IdP on the user's behalf. For instance, it's common for developer services sites like CircleCI to use GitHub authentication and then ask for fairly broad permissions such as to read from and write to your git repositories. This allows them to integrate tightly with your developer experience, but of course without having your password.

As with a direct 1:1 authentication system like a password, sites will generally persist the user's information in a cookie. However, if the user clears their history, moves to a new computer, or the cookie just expires, instead of asking for the user's password instead the site will re-validate the user with the IdP.

Enterprise Single Sign-On (SSO) #

The previous examples were largely for end-users, but suppose that you operate a company and want to outsource employee services such as payroll or expenses. These services are now frequently packaged as what's called Software as a Service (SaaS) which is a fancy name for "we have a Web site that your employees use".

Obviously, your users need to authenticate to these SaaS services, and in principle you could have them create an account on each of these services, have the service check their e-mail addresses, and move forward. However, this has a number of obvious drawbacks, including:

Increased friction for each user, especially if you have a lot of these services, which is not at all uncommon.
Lack of unified access control policies. For instance, if you want to require 2FA, you can enforce this centrally rather than having to reach out to every SaaS provider you use.
Lack of control. For instance, if a user quits, how do you notify each SaaS provider to terminate their account?

These drawbacks can be addressed by using essentially the same technologies as described in the previous section. In this case, the company (or more likely some third party like Auth0 or Okta [Edited to add Okta -- 2022-06-03] acts as the IdP, with each of the SaaS providers acting as the RP.^[2] When an employee wants to use one of your SaaS providers (e.g., to do their expenses), they first authenticate to your IdP and then use the IdP to authenticate to the provider. The IdP login can be long-lived, allowing the user to authenticate to multiple IdPs without logging in repeatedly (hence the "single sign-on" name). This kind of system also allows the company to track logins, manage access, and disable/suspend accounts.^[3]

Real-World Identities #

You may have noticed that none of the above does much about your real world identity. As a general matter, sites just take your assertions about your identity at face value, allowing you to use whatever name you want, as well as to claim to be any age you want etc. Some social networks try to require you to use your "real name" (see, for instance, Facebook's real name policy), but not too much hangs on this and they generally don't try super hard unless you claim to be someone famous or your name looks fake (though, as the link above indicates, "looks fake" is a subjective standard and lots of people have names that someone—or some algorithm—at Facebook might think were fake.)

In some cases, sites will make an attempt to actually verify your name, but the mechanisms are often kind of weak. For instance, in order to get a Twitter "blue Verified badge" you can send Twitter a photo of your driver's license. This isn't nothing, but it's also not at all difficult to photoshop yourself a fake driver's license, given that it doesn't have to pass much scrutiny and the anti-counterfeiting mechanisms such as holograms and the like don't work through the Internet.

There are a few situations in which a service will attempt to create a stronger binding between your legal identity and your account, typically where money is involved. For instance, you might need to provide your social security number, account number, mother's maiden name, your ATM PIN, or demonstrate that you know the amounts of some recent transactions. Often, these mechanisms work by leveraging some preexisting relationship (account) you have with the service and then linking your online account to that preexisting account, so it's not like they are trying to authenticate someone they have never heard of.

What's wrong with this picture? #

As noted above, the ergonomics of having to make an account on every new system are fairly bad: it requires the user to have a large number of passwords, which is more opportunities to use a bad password or to lose your password and have to recover. There are some opportunities for improvement around the margin (e.g., WebAuthn instead of passwords for authentication), better form fill-in so users don't have to type their name over and over, etc, but at the end of the day, there's only so much you can do.

On the other hand, the existing federated authentication mechanisms have a number of pretty serious drawbacks.^[4]

Centralized Control #

The first big problem with the existing federated identity systems is that they inherently tie you to a small number of centralized identity system. First, for RP A to accept an identity from IdP B, A needs to actually make some kind of arrangement with B. This is typically pretty lightweight, but probably involves establishing some kind of pairwise API key. Second, because A has no way of knowing which IdPs a user has accounts with, it has to offer the user a separate button for each one, like so:

Fixing the NASCAR Problem #

The reason that the NASCAR problem is hard to fix is that these federated identity systems use existing Web technologies and there's no way with those technologies to know which IdPs the client has an account with, so it just has to show all the logos. If there were such a way then we would have a privacy problem, because then you could use the set of IdPs the client had an account with to track them, or, worse yet, use the same mechanism to encode the user's identity by creating a pattern of account/no-account states with various sites you controlled.

This is sometimes called the NASCAR problem because it resembles the various advertiser logos you see on NASCAR cars. This of course contributes to a lousy user experience but also discourages the site from adding additional IdPs, because each one adds to user confusion.

When put together, existing federated authentication systems provide a strong incentive to only accept identities from the biggest IdPs, which promotes centralization and makes it hard for new providers to enter the market.

Privacy #

In general, the privacy properties of existing federated authentication systems are quite bad. Every time you log into site A with IdP B, B learns about it. This allows your IdP to track you around the Internet whenever you use it to log in. This is made worse by the high level of centralization in two ways. First, because it is hard to start a new IdP it is hard for users to find one that has better privacy, whether in terms of better policies or better technology. Second, because there are a small number of IdPs, this creates concentration of this tracking information. In addition, many of the existing IdPs already do a lot of Web tracking via other mechanisms.

Another privacy problem is that IdPs typically provide the same identifier (e.g., your e-mail address) to each RP. Sites can use these identifiers to track users (see this post by Steve Englehardt on this topic). This is actually technically soluble by having the IdP give a new identifier to each site, but this is not general practice, in part because sites want the user's true identifier so that they can contact you. This problem also exists with conventional e-mail/password systems but can be addressed with e-mail masking systems like Firefox Relay or Apple's Private Email Relay.

Improving Federated Identity #

There has been a fair amount of work over the years on building federated identity systems with better properties.

End-User Certificates #

In the early days of the Web—well before things like Google Login existed—a lot of people thought that users would authenticate with certificates: every user would be issued a certificate with their identity, much like Web sites have certificates that attest to theirs. Presumably these certificates would have the user's e-mail address and maybe their name. They would then be able to use TLS certificate-based client authentication to authenticate to every server. This has much the same identity properties as federated identity, but has better privacy properties because the CA doesn't need to be involved in the authentication transaction and so doesn't learn what sites you are going to.

Client certificates also potentially have better centralization properties.^[5] In particular, client certificates have the potential to fix the NASCAR problem because the client knows which certificates you have, so the site doesn't need to display the logos of every CA you might have a certificate with.

Needless to say, this never happened; TLS client authentication is in use in some settings, typically for enterprises which issue their own certificates but never really became a plausible competitor to passwords and then federated authentication came along. There are quite a number of reasons for the failure of client certificates, but any list would probably include:

The lack of certificate authorities which would issue convenient free client certificates (this was true for server certificates too until Let's Encrypt).
The TLS interaction is pretty bad in a number of ways, such as playing badly with TLS intermediaries such as CDNs and, prior to TLS 1.3, leaking the client's certificate if you did authentication at the beginning of the connection.
A truly hideous UI. I've shown the Edge UI below but all of the browser client auth UIs are pretty bad.

[Source: Eric Lawrence]

In addition, because you use the same certificate for every site, it can be used to track you across sites, which is obviously a privacy problem, though, as noted above, is not a property unique to client certificates.

Persona and FedCM #

Although client certificates never really took off, they have a number of good properties and are a natural starting point for trying to improve the situation.

Mozilla took a fairly serious run at this some years back with Persona. Effectively Persona worked by making every site its own certificate authority; they could then issue certificates to browsers which used them for authentication, so for instance example.com could issue certificates for addresses ending in @example.com. The browser would then use those certificates to sign into sites. This was intended to have the benefits of certificate-based authentication but be easier to deploy and more compatible with Web technologies. One very important property was that because the site could use the certificate to authenticate to any server, it didn't allow the IdP to track the user.

The obvious way to implement Persona was with browser support: when the user creates an account with an IdP, the browser would keep track of it. When the user wants to log into a site, it calls a browser API, which causes the browser to present a list of acceptable IdPs which the user can choose from, thus avoiding the NASCAR problem and giving the user more direct control over how their information is being used. In practice, the initial Persona deployments depended in a trusted web site to help mediate this interaction, thus avoiding the need to modify browsers.

Persona ultimately failed to gain much market traction and Mozilla stopped working on it, but it inspired other designs, such as Chrome's Federated Credential Management API (FedCM). FedCM is a more modest increment on the current federated authentication model intended largely to make federated identity continue to work in environments where third party cookies have been removed, but also to have some additional privacy benefits. Unlike Persona, it doesn't really address centralization, though it's possible that it could be extended to do so.

FedCM is relatively new and so hasn't seen any real deployment. It's an open question whether it will get any deployment or whether any of the big IdPs such as Google or Facebook will support it (see deployment below).

Other Cryptographic Identity Systems #

Recently there has been increasing interest in the use of cryptographic identity systems that are often called "decentralized" or "self-sovereign" what's called "self-sovereign" or "decentralized" identity. Here's how Sovrin describes this:

Everyone (including businesses and IoT) has different relationships or unique sets of identifying information. This information could be things like birth date, citizenship, university degrees, or business licenses. In the physical world, these are represented as cards and certificates that are held by the identity holder in their wallet or safe place like a safety deposit box, and are presented when the person needs to prove their identity or something about their identity.

Self-sovereign identity (SSI) brings the same freedoms and personal autonomy to the internet in a safe and trustworthy system of identity management. SSI means the individual (or organization) manages the elements that make up their identity and controls access to those credentials– digitally. With SSI, the power to control personal data resides with the individual, and not an administrative third party granting or tracking access to these credentials.

The SSI identity system gives you the ability to use your digital wallet and authenticate your own identity using the credentials you have been issued. You no longer have to give up control of personal information to dozens of databases each time you want to access new goods and services, with the risk of your identity being stolen by hackers.

This is called “self-sovereign” identity because each person is now in control of their own identity—they are their own sovereign nation. People can control their own information and relationships. A person’s digital existence is now independent of any organization: no-one can take their identity away.

Controlling your own identity sounds good, but it's remarkably difficult to get a clear picture of precisely what people have in mind here. For example, in an early post on the topic, Christopher Allen writes:

With all that said, what is self-sovereign identity exactly? The truth is that there’s no consensus. As much as anything, this article is intended to begin a dialogue on that topic. However, I wish to offer a starting position.

Rather than try to offer a definition, the rest of this section instead focuses on what's technically possible in this space.

In general, the starting point for these systems is to root identity in a cryptographic key. I.e., I create a public/private key pair and my public key then becomes my identity. This has the convenient property that it's self-authenticating: I don't need to use a password or any other authenticator because I can prove my identity just by signing a challenge with my private key. In principle I could just create an account by giving you my public key and having that be the account ID.^[6]

Attributes #

Unfortunately, as-is this system also has a number of significant drawbacks. First, as we've seen throughout this post, sites don't want to address users through opaque identifiers, they want to attach them to some means of contacting them, like an e-mail address or a phone number. This is partly because sites want to actually be able to contact their users and—at least at present—it's not really practical to message users via their public key pair and partly because it lets them deal with exceptional cases like account recovery.

Most of the decentralized identity systems I have seen proposed have some mechanism to attach more meaningful attributes to a given identity. The simplest version is effectively a certificate, i.e., a signed statement that a given public key belongs to someone with the following properties (e-mail, name, date of birth, etc.). A number of these systems use fancy cryptography to allow for selectively disclosing pieces of these attributes (e.g., "I am over 21" but not my birthday). However, it's a bit unclear who would do this signing; for instance, who would you trust to attest to my personal name? The government? Which government? How about my email address?

Key Recovery #

The second problem with this kind of system is that if you lose your private key you lose access to your account—or more likely, all your accounts. There are a lot of proposed mechanisms for addressing private key loss (e.g., secret share your key with 10 of your closest friends) but you can be sure that plenty of people won't do them. Long painful experience shows that users lose their credentials quite frequently, don't do much to plan ahead for that event, and any system that doesn't recover gracefully if the user drops their phone in the toilet is going to have a lot of dissatisfied customers.

Of course, you can always create a new key and then get the same attributes attached to it—and potentially detached from the old key. Depending on the precise structure of the system, this may or may not be technically possible (for instance, you could have a system where each e-mail address was registered on the blockchain and nobody could ever re-register it). However, as we saw with blockchain-based DNS systems, the problem becomes that the same mechanisms which are designed to give you complete control of your identities independent of third parties also make it difficult for those third parties to help you recover your identity if you lose your keys. Obviously, this makes a lot more sense for attributes which aren't unique, such as your age, but at the end of the day you're still at the mercy of the people attesting to your attributes, and those, not your key, become your true identity.

Independence #

At the end of the day, I'm not sure how much these systems really deliver on the independence value proposition of self-sovereignty that I quoted above. The problem here is that there are two kinds of identities in play:

A trivial form of identity which is basically "I am the person with this public key".
A deeper form of identity which ties that key pair to other attributes which people actually care about, such as your name or e-mail address.

The first type of identity is indeed independent in the sense that it's hard to take away from you and you don't need anyone's help to exercise it. The second, however, depends on a whole infrastructure of third parties who are busily attesting to various properties that are then somehow attached to your key pair. And for the system to function properly, you need them to do that attestation not just once but regularly. This statement may come as a surprise, but in real identity systems you generally need some way to revoke assertions when you discover (for instance) that people's keys have been compromised or that the assertion was issued incorrectly. You need to be able to do this without the cooperation fo the subject, and so that means that in practice the attesting entity needs to be involved pretty regularly and so you're not really able to exercise those forms of identity independently from them.

This is not to say that you can't use cryptography to build identity systems that will have better properties than our current third-party identity systems, especially in the area of privacy and tracking by the IdPs. However, it seems to me that it's mostly the decoupling of the identity assertion from the IdP—as in Persona—that provides that value, not having them be decentralized or rooted in an identity tied to a specific cryptographic key.

Deployment #

A major challenge with any new identity system is getting broad-scale deployment. Specifically, it's not worth it for RPs to support a new IdP unless that IdP has a lot of existing users. Conversely, it's not worth users creating accounts with an IdP unless a lot of RPs accept that IdP. This deadlock makes it hard to get going with something new, and it should come as no surprise that all the major public IdP systems are associated with services like Google, Facebook, or Twitter which already have large user bases of people who use the service for some other reason. This allows them to easily offer a valuable authentication service and makes it worthwhile for RPs to accept them. Any new identity system will somehow have to get past this.

Right now, this dynamic makes it difficult for a new IdP to enter the market even if its APIs are basically identical to an existing IdP, both because the existing systems tend to need prior arrangement and because the NASCAR problem makes it expensive for RPs to support a new IdP. However, this need not be the case: it's possible to design an identity protocol which works with any IdP without prearrangement—indeed Persona was such a protocol—but in order for that to get off the ground you'd still need some large IdP to support it in order to bootstrap RP support. For obvious reasons, that kind of interoperability is not really in the interest of existing IdPs, and most of the proposals I have seen for improving the situation don't come from IdPs.

The same basic situation applies to cryptographic identity systems. It takes extra work on the part of the RP to support such a system and that work is hard to justify if there's no additional benefit, either in terms of getting a lot of users that you couldn't get before, or in terms of some new capability that you can get for a lot of existing users (like learning information you couldn't learn before).

It's important to recognize that this dynamic applies even if the new systems are better for users, because the users can only really choose between the systems supported by the RPs. For instance, if you as a user use some new identity system X that has much better privacy, but the site you want to go to only supports Google Login, you can either use Google Login or not, but you can't force it to use X. Once an IdP is well established and widely supported then users choosing it has some impact at the margin, but it's hard to make a system take off through user choice along.

Summary #

Identity on the Internet is a difficult problem. Having to make an individual account for each site is clearly bad. On the other hand, between a high level of centralization and a low level of privacy provided by third-party authentication systems is also not great. However, the network effect dynamics of identity systems make it very hard to deploy something new without the cooperation of some system that has a lot of users, which is to say the services who are benefiting from the existing system. For that reason, my first question whenever someone proposes deploying a new identity system, my first question is "who is going to provide the identities and how many users do they have already?"

The terminology here is a bit confusing. For instance some people draw a distinction between "delegated" identity systems in which the RP is outsourcing identity to a given IdP and ones in which the RP can use any IdP. in practice, it seems to me that most of the deployed RPs allow a small number of IdPs but not any IdP. To some extent there is a policy decision about which IdPs to support, but as described in this post, it's also the case that some technological approaches are more suited to allowing an arbitrary number of IdPs than others. My sense is "federated" is the more common term, so I'm using that here. [2022-06-03] ↩︎
In the third party case, the third party would somehow hook into your identity system so it could authenticate users. ↩︎
Because of cookies, this doesn't necessarily happen instantaneously, but you can configure things so that the RP requires the user to re-authenticate frequently, thus giving the IdP a chance to say that the user's account is suspended. ↩︎
I'm largely excluding enterprise SSO systems, as they serve a different purpose, and while in my experience they're a bit clunky, it's more just generic software kludginess than it is architectural/ecosystem issues. ↩︎
Given that roughly half the Web certificates in the world are issued by Let's Encrypt, we shouldn't get too optimistic about decentralization in the certificate market. ↩︎
We actually do see the use of public keys for authentication in practice, but usually in the form of attaching a public key to an existing account, rather than using it as the account identifier. ↩︎

Notes on Multiple Encryption and Content Filtering

2022-05-22T00:00:00Z

As I mentioned in my post on EU's proposed CSAM regulation, any content filtering system has to worry about nonconforming clients which are trying to evade filtering. One obvious approach is to lie about message contents or the output of filtering algorithms. Another method of nonconformance that is often proposed is multiple encryption, in which you use an ordinary messaging system like WhatsApp or iMessage, but before you send messages you first encrypt them yourself, so that even if the main messaging system were broken, your data would still be secure.

Why not just use a different system? #

As noted in the Wikipedia page I linked to above, one reason to do multiple encryption is just to provide defense in depth in case the outer system is broken, but in this case, we are assuming that the outer system is broken because it is subject to some detection/monitoring requirement, so it's not adding much security value. It's not that hard to build your own messaging system, so why not just use one that isn't being monitored, for instance because it's too small to be subject to regulations, is located outside of the relevant jurisdiction, or has just decided not to comply?

The most obvious reason for using a common system is to conceal your activities: if most people use a messaging system that is subject to monitoring and you choose to use one that is not, that's a potential signal that you really want to hide and so are worth investigating in some other fashion.^[1] This is especially true if you are using a program that is explicitly associated with an activity that the authorities want to investigate as with something like Mujahedeen Secrets. Moreover, if you have to run your own messaging servers, then that's a point of attack, which you don't have if you encrypt messages and just send them over WhatsApp.

Detection and Steganography #

One obvious problem with multiple encryption is that the messaging system—which, recall, we assume is compromised—can just change their filtering algorithms to detect your inner encrypted messages and block or report them. How effective this is depends on precisely how the monitoring is done. At a high level, there are two main possibilities:

Targeted monitoring in which communications are generally not monitored but the authorities can target specific people or messages for monitoring. This is sometimes referred to as "exceptional access".
Continuous monitoring in which much or all of the content is scanned (this is what the EU regulation seems to contemplate).

In an exceptional access regime, because communications are generally encrypted and therefore can't be routinely scanned, your use of multiple encryption won't ordinarily be detected. Of course, if you are one of the people who is subject to surveillance, then that will be detected, but then all that is revealed is that you are using an inner layer of encryption, which may look suspicious, but then you wouldn't (at least in theory) be subject to exceptional access unless you were already suspected. It may even not result in your messages being blocked because law enforcement and intelligence agencies often want surveillance to be secret, and blocking your messages would reveal that they had been decrypting them.

By contrast, in a continuous monitoring regime, most if not all messages will be scanned and so just encrypting will be easily detected and can be blocked. This blocking doesn't reveal anything useful to the people using inner encryption because the fact of monitoring isn't a secret.

This doesn't mean that it's not possible to multiply encrypt in these situations, but it does mean that you have to do more than just encrypt; you need to have the encrypted data look like ordinary messages. There has been a fair amount of work on what's called steganography, which involves hiding messages in other messages. For instance, one might hide the true message in the first word of each line, like so:

[Source: Yahoo News, original by Sairam Gudiseva]

There are a lot of possible techniques here, such as hiding data in the low order bits of images or audio files. In general, anywhere that there is room for variation there is room to conceal data. The rise of machine learning techniques for generating content (e.g., GPT-3) also makes it easy to generate new plausible content which you can then hide your message in, as opposed to requiring you to take some existing content and tweak it (thus making it susceptible to detection based on comparing it to the original template).

Steganography has seen less work than other areas of communications security, so if this kind of thing sees wide use it will probably be a bit of an arms race for a while between concealment and detection, but I would expect concealment to win most of the time, just because there are is already so much natural variation in messages and so many ways to conceal information. False positives are even more of a problem here, because—unlike CSAM—it won't really be possible to manually determine whether something is steganography or not and so you're just left blocking a bunch of users.

Key Management #

If you're going to encrypt data, you need to have encryption keys that aren't known to the attacker, otherwise they will just try to decrypt everything that goes by with each key and see what works (this is known as "trial decryption"). Naively, this involves setting up a whole new identity system, as you're effectively running your own messaging system on top of someone else's (see here for a bit on what this involves) which is really a pain, but actually I think you could get a lot of value with much less.

More on Active Attacks #

Suppose that the multiple encryption system works by embeddeding DH keys in the low order bits of specific pixels in each image. When Alice and Bob first exchange messages, an active attacker could just stomp them with its own bits, which would result in either (1) establishing a pair of keys with Alice and Bob (2) or establishing what is apparently a pair of shared keys but is actually nothing (we could in principle have some kind of error check but obviously we don't want to do that because it makes inner encryption easy to detect). They then look for the first message that should be encrypted and try to decrypt it: if it works, then multiple encryption was probably in use; if it's garbage, then probably not.

But now what happens if there is another kind of multiple encryption which encodes a different kind of key in the same bits? The service can only try one of these, and if they get it wrong, then people can't establish keys, which they might notice, at which point word gets out that they are mounting active attacks. Similarly, if there is any method for double-checking the established keys (e.g., something like Signal's "safety numbers") then this will be quickly detected.

The basic idea would be to just do unauthenticated key establishment over the existing messaging system. What this means is that you use the same cryptographic protocols that you would use to set up keys (e.g., Diffie-Hellman) but you don't bother to authenticate the other side. This is much technically easier because you don't need an identity system at all; you're just relying on the identities provided by the existing messaging system you are running on top of (another good reason to use an existing messaging system rather than building your own). One could also imagine something intermediate where people publish their keys on Facebook or Twitter.

Of course, unauthenticated encryption leaves you open to active attack by the messaging system where it tries to establish its own keys with each side, but this kind of attack is going to be a lot more work than just passively monitoring each message, and they'll have to do it for every potential kind of inner encryption and for every pair of users. Moreover, this inherently involves damaging the messages, which is something that is likely to get noticed quite quickly if anybody bothers to check. So, while you're potentially vulnerable to a very dedicated attacker, in practice this would give you a lot of security.

One Versus Two-Sided Systems #

One very important limitation of multiple encryption systems like this is that they only work when both sides participate: each user needs to install some kind of new software that will handle the multiple encryption, and if you are just running the standard software, you'll either get something that looks like random junk or like whatever innocuous cover traffic is being used to hide the encrypted data in, depending on whether steganography is in use. This means that multiple encryption can be used to evade filtering in contexts like trading CSAM or buying drugs where (presumably) both sides have an interest in concealment, but can't really be used to evade filtering in cases like solicitation of minors because the minor isn't going to have installed the new program (and of course the service can fairly easily scan for a suggestion that they do so).

Of course some people just like their privacy, but the question is whether this is a useful signal on average. ↩︎

End-to-End Encryption and the EU's new proposed CSAM Regulation

2022-05-19T00:00:00Z

Last week the European Commission published a new "Proposal for a Regulation laying down rules to prevent and combat child sexual abuse". This regulation would require Internet communications platforms to take various actions intended to prevent or at least reduce what it terms "online sexual abuse".

Proposal Summary #

The proposed regulation runs to 135 pages and is somewhat light on detail, but here's a brief summary of the most relevant points (with the disclaimer that I am not a lawyer).

Requires all "hosting services and providers of interpersonal communications services" to perform a risk assessment of the risk of use of their service (Article 3) for online sexual abuse and to take "risk mitigation" measures (Article 4), said measures being required to be "effective in mitigating the identified risk".
Allows the "Coordinating Authority" of a member state to issue a "detection order" (Article 7) which would require the service to set in place technical measures that are "effective in detecting the dissemination of known or new child sexual abuse material or the solicitation of children, as applicable" (Article 10(3)(a)) based on indicators created by a new EU Centre.
Creates a new EU Centre which will develop technologies for detecting the above types of content and make them available to providers as well as generating indicators of contraband content (Article 44).
Impose various transparency and takedown requirements on providers, for instance requiring them to block/takedown specific pieces of content.

It's a bit unclear to me what the line is being required to have measures that are "effective in mitigating the identified risk" versus "effective in detecting the dissemination of known or new child sexual abuse material or the solicitation of children", but I would expect that any significant-sized service is likely to be served with a detection order, given that the standard for issuing the orders, as set out in Article 7 (4) is that "there is evidence of a significant risk of the service being used for the purpose of online child sexual abuse", which is probably the case for any major service, just because there is so much traffic; even if detection were perfect—which it isn't—there would always be new users wanting to exchange prohibited material. For that reason, it's probably most useful to focus on the implications of the detection order requirement.

Technologies for Detecting Online Sexual Abuse #

This proposal is concerned with three main types of material:

Known child sexual abuse material (CSAM).
New CSAM that hasn't before been seen.
Solicitation of children

The standard techniques for detecting known CSAM mostly depend on perceptual hashing, in which we compute a short value that is characteristic of the image (or video). You start with a database of known CSAM objects and compute their perceptual hashes. The idea is supposed to be that:

If two images look "the same" then they will have the same hash, even if they are slightly different. For instance, a color and black-and-white version of the same image.
If two images are "different" then they will have different hashes with very high probability.

Note that this is different from cryptographic hashing because similar looking images will have the same hash, whereas with a cryptographic hash even a single bit difference should produce a new hash. In order to scan a new piece of content you compute its hash and then look up the hash in the table of known hashes. If there's a match, then the content is potentially CSAM and you take some action, such as alerting the authorities. (see here for some limitations of this kind of system).

Hashing doesn't work for unknown images, however, because you won't have their hashes, and won't work for detecting text messages and the like that are designed to solicit children. The state of the art for detecting this kind of material is to train machine learning models ("classifiers") that attempt to distinguish innocuous material from contraband. This kind of technique is already in wide use for spam filtering, but there are also technologies like this that attempt to identify CSAM and solicitation; as I understand it, these technologies are already in use in some systems.

Traffic Encryption #

Most services do encrypt traffic, but often it's only in transit between the client and the server, which doesn't prevent the service from doing any analysis on it they want. You'll also often hear that services store data encrypted, but that usually just means it's encrypted with keys they know. This isn't worthless: it migh protect you if someone steals one of their hard drives, and depending on things are built might make certain forms of inside attack difficult—for instance if administrators can't get the keys—but doesn't do anything to get in the way of the service itself inspecting your data.

It's important to recognize that these technologies require having access to the content itself, whether to compute the hash or to run the classifier. If you have a system where the service sees the data in plaintext, then this is straightforward, but if the data is end-to-end encrypted, meaning that that service doesn't see it, then life gets more complicated, by which I mean "there isn't really a good solution".

Content Filtering on Encrypted Data #

The obvious way to address the problem of content filtering on encrypted data is just not to encrypt it, but of course this has a very negative impact on the security of people's communications (see my previous post on E2EE and encrypted messaging for more on this), and so there has been quite a bit of work on content filtering with encrypted data. The EU proposal relies heavily on an EU-sponsored Experts Report (see Annex 9 of their impact analysis) describing their analysis of the situation and making some recommendations. I'll address this report below, but at a high level, there are two main approaches:

Filter on the client and report results back to the server.
Filter on the server or some other central point.

However, neither of these really works very well, for reasons I'll go into below.

Client-Side Filtering #

Aside from just not encrypting at all, the obvious solution is to have the client filter the data; after all, it already has the plaintext. However, there are a number of challenges to making client-side filtering work in practice.

Algorithmic Secrecy #

The first major challenge for client-side filtering is the desire to keep the algorithms used to determine whether to flag a given piece of content should be secret. For instance, many server-side filtering systems use a perceptual hashing technology called PhotoDNA. Although the general structure of the algorithm is known, the precise details are secret. In addition, the hashes themselves are secret.

As far as I can tell, there are two major reasons for this secrecy. The first is that it's intended to deter evasion. If you have the hash algorithm and the list of hashes, then you can check for yourself whether a given piece of content is on the list and either avoid transmitting it or alter the content so that it has a a different hash that's not on the list. Even if you just know the hash algorithm and you have a piece of content that might be on the list, you can easily alter the content so that it has a different hash, thus reducing the chance of detection. Or, in the case of a detector for solicitation, the client might warn the user to cut off the conversation when the classifier score got too high.

If the algorithm is secret, it's harder to know if two slightly different inputs will have the same hash (recall that the idea of a perceptual hash is that visually similar inputs produce the same hash), but if you know the algorithm, it's trivial. It's also possible to go in the other direction, where you generate a piece of innocuous content that matches a hash and send it to someone to "frame" them. This is much easier if you know the hash.

The second reason is that it might be possible to use the hashes themselves to reconstruct a low-res version of the original image, which would obviously be undesirable, as it would mean that distributing the hash database was kind of like distributing a low-fi version of the original images with an unusual compression format.

Apple's proposed client-side CSAM scanning system (see my writeup here) partly addresses these issues by using advanced cryptographic techniques to conceal the hash list from the client. Briefly, the way this works is that the service provides the client with an encrypted copy of the hash database. The client computes a "voucher" based on the content and the hash database, and sends it to the service, but the service can only decrypt the voucher if the content matched one of the hashes. This prevents the client from knowing whether their content matched a hash but actually requires the client software to know the hash algorithm, which they have to be able to compute locally^[1] so it would still be possible for an attacker to change content so it has a different hash.

Moreover, Apple's system only works for known hashes, and it's not known how to extend it to the problem of having a client-side classifier that is itself secret (unlike NeuralHash). As we'll see later in this document, the need to run arbitrary computation rather than just hash matching makes this whole problem space a lot harder. It's maybe possible you could use some kind of encrypted computation solution in which some server ran a classifier on an encrypted copy of the content and then told the client whether it was contraband, but then we'd have the problem that the client could use the server as an oracle for whether a given piece of content was OK, which, as noted above, is undesirable.

Client Nonconformance #

The other major problem with executing the classifier on the client is that there's nothing requiring the client to actually run the classifier on the true input, or on any input at all. For example, in the Apple system, the client sends an (image, voucher) pair up to iCloud but there's nothing in the system that forces the image to match the voucher.^[2] Instead, the client can just compute a voucher on an innocuous image (in the Apple system, it can actually just produce a random voucher, but one might imagine a different design where that was not possible) and upload that voucher along with the image.

The major barrier to this kind of attack is how inconvenient it is for the user—who recall, is the attacker in this system—to run a nonconformant client. Of course, if you're using an iOS device, then you're running Apple's software, which is designed to behave correctly, and it's a pain to replace it with your own (though nothing like impossible), and in any case, this isn't a generic solution to the problem of tens to hundreds of apps, including those which run on systems much less locked down than iOS (including MacOS). This problem is much worse for "open" systems in which the protocols are public or in which the clients are open source because in those systems anyone can build their own client that interoperates with the system but doesn't correctly run the classifier (i.e., it lies!), which makes the system far less useful. Of course, some people will still use the default client, but in many of the scenarios of interest, people know that they are sending contraband and so will be willing to use custom tools that evade filtering, in which case almost any system other than having the client send the data in the clear won't work.^[3]

Server-Side Filtering #

The other set of the designs use a server for filtering ("don't encrypt" is the trivial version of this). Similarly, you could send a copy of the data (or, in the hash version of the system, a copy of the hash) to some "trusted" server which does the filtering. The nominal advantage of such a design is that the service provider (e.g., WhatsApp) can't see your data (or the hash) but of course this third party would and it's not clear how that's better, as it comes down to trusting some server operated by someone you don't know not to spy on you.

The EU Experts Report^[4] proposes two fancy cryptographic mechanisms for addressing this problem:

Having the client upload encrypted hashes and use multiparty computation (MPC) to determine whether one of the hashes matches.
Using fully homomorphic encryption (FHE) to compute the perceptual hash over the content and determine if it matches the hash list.

As far as I can tell, the encrypted hash/MPC design is inferior to Apple's proposal in that it's more complicated and still only does hashes. The EU report frames the FHE system as being about hashes, but if it works at all, I think it's likely to work with classifiers too, because it involves the server running am arbitrary computation. With that said, it's also not clear to me how it's intended to work. Here's the diagram from their report:

FHE is a bit outside my main area of expertise, but I'm having trouble making sense of this. The point of homomorphic encryption is that you can perform a computation on encrypted data. In the typical FHE setting, the client encrypts the data and sends it to the server which operates on the encrypted data and returns the result, as shown below:

Partially Homomorphic Encryption #

It's been known for a very long time how to do partially homomorphic encryption. As a concrete example, consider the case where you encrypt some data by XORing it with a key, i.e.,

$$Ciphertext = Plaintext \oplus Key$$

With this system, you can have the server compute the XOR of two plaintexts, $P_1$ and $P_2$ The client sends:

$$ (C_1, C_2) = (P_1 \oplus K_1, P2_2 \oplus K_2)$$

The server returns:

$$ C1 \oplus C_2 $$

Which the client XORs with $K_1 \oplus K_2$, i.e.,

$$P_1 \oplus K_1 \oplus P2_2 \oplus K_2 \oplus K1 \oplus K_2 $$

When you cancel out the keys ($A \oplus A = 0$) you get:

$$ P_1 \oplus P_2$$

The idea here is that the client has some input that it wants some expensive computation done on. It could just run the computation in some cloud service like AWS but it doesn't want the cloud service to see the data. Instead, encrypts the data and sends the encrypted version to the server. The server then performs the computation on the encrypted data, but without seeing the data (ordinarily this would not be possible but there is some extremely fancy math involved). The computation is structured so that the server doesn't get to see the result but just an encrypted version of the result, which it sends back to the client. The client then decrypts the result and learns the answer.

What makes this use of of FHE weird is that the response doesn't go back to the client but rather the server somehow sees an encrypted hash that it compares with a list of other encrypted hashes, which doesn't seem to be the customary FHE setting. It's possible I'm missing something, but as described, it seems like this design would allow the server to learn the actual content, not just whether it matches a given hash. The issue is that the server determines the algorithm that it runs on the encrypted data, and so it can design an algorithm that allows it to extract the data. For instance, suppose you have an algorithm that looks at a single pixel of an image and emits:

The hash of a known piece of CSAM if the image is black.^[5]
A random value if the image is white.

You then run the algorithm in sequence over each pixel of the image at a time and you've extracted the content (assuming it's black and white).^[6] You could obviously extend this technique to be more efficient, or to work on text, etc.

It's possible that the design of the system might be able to somehow restrict the algorithms that the server can run—though usually homomorphic encryption does so at a lower level, like that you can only multiply but not add—but that restriction would have to be enforced by having the client encode data in a certain way, such that it was just partially homomorphic. This seems impractically inflexible, especially in light of the fact that we don't just want the server to compute perceptual hashes but to run generic classifiers, which tend to be fairly complicated systems, and that they are supposed to be based on whatever indicators are provided by the EU Centre. Restricting the classifier algorithm by controlling the inputs seems even more problematic if you want to keep it secret from the client, which, as noted above, is important for preventing evasion; if the client wants to evade and knows that only certain classifiers can be run, it can tune its content to evade those classifiers.

You could of course build a more traditional FHE-style system in which the server just told the client whether the content had been flagged, and count on the client to report the user. However, with that design, you're telling the user whether they have been flagged, which, as above, is undesirable, and you still have to worry about client nonconformance (i.e., just ignoring that the user was flagged). If the response is encrypted, then the server has no way of knowing that the client is behaving correctly.

I should also mention at this point that even the piece where you build the classifier using homomorphic encryption is kind of a research problem, as stated in the report:

Another possible encryption related solution would be to use machine learning and build classifiers to apply on homomorphically encrypted data for instant classification. Microsoft has been doing research on this but the solution is still far from being functional.

The bottom line here is that I don't think we're at the point where fancy crypto is going to help. Even if it's possible in principle to build something that allows the server just to tell if something is contraband without seeing the content (which is far from clear), it's not practical do do so with our current cryptographic tools.

Trusted Execution Environments #

One approach that has recently become popular for dealing with this kind of complicated trust problem—especially when it feels too hard for crypto—is to use what's called a Trusted Execution Environment (TEE) or an "enclave". A TEE is a processor feature that allows the operator of the processor to run computations on data without being able to see the data.

The basic way a TEE works is that:

The processor manufacturer installs a signing key when the processor is manufactured. This key is signed by the manufacturer's key.
The TEE internally generates a secret encryption key pair.
The operator installs a program onto the TEE.
The TEE then signs a statement (using the signing key) that attests to the program and to the public half of the encryption key pair.

The operator can then send this statement to someone else who knows that (1) they are interacting with the TEE rather than with the operator and (2) precisely what program the operator is running on the TEE. That someone else verifies the signature chain and compares the program to its expectations.

It's easy to see why a TEE is attractive, as in theory it ought to offer a generic solution to a huge number of privacy and security problems: there's no fancy crypto to be concerned with, you just write your program to do whatever you want and shove it in the TEE. You do have to be a little (well, more than a little) careful to write the program on the TEE so it doesn't leak information about the data its operating on via side channels and the like (remember what I said about the difficulty of safely computing on secret data), but one might hope that that's a problem that could be solved with the right programming practices and then you just have a magic box that securely executes any program you want.

Given such a box, the problem becomes a lot easier. For instance, the EU report suggests that the client send the encrypted messages to the TEE along with the encryption keys , which would run whatever filtering algorithms were needed on it and then either forward the encrypted message (if it was OK) or would report a violation (if it was not). You could also use the TEE to run filtering on the client because you could run the classifier secretly in the TEE without disclosing it to the user. (You won't be surprised to hear that one of the big uses of TEEs is for DRM for media.) Running a secret classifier is somewhat tricky, but you might imagine a system in which the classifier was revealed to some set of experts who would then attest that it was OK and publish a hash of it that clients could check.

There's just one tiny problem: TEEs are a lot less secure than one would actually like. There is a whole line of papers attacking the best-known TEE, Intel SGX (see here for a survey). Moreover, these attacks are all based on running code on the processor, which is a fairly weak form of attack. However, they generally don't provide defenses against physical attacks in which someone who has physical control, in part because this is hard to do in processor-sized package.^[7] For instance here's what Intel says:

Side-channel attacks are based on using information such as power states, emissions and wait times directly from the processor to indirectly infer data use patterns. These attacks are very complex and difficult to execute, potentially requiring breaches of a company’s data center at multiple levels: physical, network and system.

Hackers typically follow the path of least resistance. Today, that usually means attacking software. While Intel® SGX is not specifically designed to protect against side channel attacks, it provides a form of isolation for code and data that significantly raises the bar for attackers. Intel continues to work diligently with our customers and the research community to identify potential side-channel risks and mitigate them. Despite the existence of side-channel vulnerabilities, Intel® SGX remains a valuable tool because it offers a powerful additional layer of protection.

The problem here is that this is very high value data and so you have to worry about very motivated attackers. For instance, in the server-side TEE system described in the EU report, the TEE would effectively have access to the plaintext of everyone's messages, which means that any effective attack on the TEE breaks E2EE and enables universal surveillance by the server. Given the history of successful attack on systems like this, assuming that it cannot be broken even given the resources of a state-level adversary who wants to read everyone's communications seems unreasonably optimistic.

Finally, the whole security of a TEE system relies on the processor manufacturer not cheating, but those processor manufacturers are big companies, so users also have to worry about the manufacturers being compelled to assist in surveillance, for instance by signing a processor key for a processor which didn't actually provide the TEE security functions.

Algorithms and Systems Design #

Even if we ignore the security pieces, this is still a hard problem. Although automated content scanning is widely employed, these systems routinely misclassify data, which is why you still get spam messages in your mailbox even with best-in-class spam filters and why any big content system has to employ—or more likely subcontract—an army of humans to manually go through stuff that's been flagged by their algorithms. How well these algorithms work seems to vary a fair bit depending on what they are asked to do, and the EU impact analysis is fairly light on details:

Thorn’s CSAM Classifier can be set at a 99.9% precision rate. With that precision rate, 99.9% of the content that the classifier identifies as CSAM is CSAM, and it identifies 80% of the total CSAM in the data set. With this precision rate, only .1% of the content flagged as CSAM will end up being non-CSAM. These metrics are very likely to improve with increased utilization and feedback.

This 99.9% number is reported as "Data from bench tests". Thorn itself reports a 99% number, but doesn't provide details of how the tests are conducted.

By contrast, the problem of classifying "solicitation" seems to be much harder. The EU references some work by Microsoft and says "Microsoft has reported that, in its own deployment of this tool in its services, its accuracy is 88%.".

Reporting Test Accuracy #

I just want to take a moment here to complain about the way these numbers are being reported, which is really confusing. Any given test has two types of errors:

false positives in which you report a positive test (in this case a violation) when there is none.
false negatives in which you report a negative test (in this case no violation) when there is one

The typical way to report these is just like that. I.e., the false positive rate is the fraction of positives you would get if you performed tests on inputs which were truly negative. For example the iHealth COVID test "correctly identified 94.3% of positive specimens and 98.1% of negative specimens", which means that if you are negative, there is a 1.9% chance the test will report positive.

It's important to recognize that this is number is different from the fraction of positives which are actually negative, because that number depends on the population you are testing. For example, if you went back in time and administered COVID tests to people in 2010, then every positive test would be a false positive because nobody had COVID. The lesson here is that the use of a test is dependent on the properties of the population in which its being used; even a very accurate test can have a lot of false positives—to the point where most of the positives will actually be false positives—if the number of true positives is very low (see Schneier on the base rate fallacy).

Conversely, it's not possible to determine the accuracy of a test by reporting the fraction of errors without knowing the sample it was tested on. For instance, I could have a CSAM filter test that just reported "is CSAM" for everything and if I tested it only only CSAM inputs, it would look to be 100% accurate, even though it's obviously useless. So in this case, that 99.9% number on bench tests is useless without knowing the set of inputs it was tried on. The 88% number is even worse because "accuracy" could mean anything, and I wasn't able to find anywhere where Microsoft reported their own research.

Without this kind of information we can't tell how effective a system like this will be. Only a tiny fraction of the content on the Internet is CSAM or solicitation, and so even a very accurate filter is still going to produce a large number of false positives. Knowing about how many there will be is critical to understanding the practical effectiveness of this kind of system.

Manual Review #

As noted above, the possibility of false positives usually means that you need manual filtering as a backup. In a system without end-to-end encryption, this is straightforward: you already have the data because you ran a filter on it, so you just send a copy to whoever is doing the double checking.

If the data is end-to-end encrypted, however, the problem becomes much harder, because—with the exception of the TEE-type systems, which have other problems—the server doesn't have the data in the clear, so it needs to obtain either the encryption keys for the content or the content itself. The Apple system solves this problem automatically but as I mentioned above, it only works for hash matching, not for general classification algorithms. Of course, if the classifier shows a positive result, the server can always ask the client to send a copy of the plaintext, but then this isn't secret from the client, and of course a nonconforming client might lie about the content, so this doesn't seem like a great solution.

Policy Implications for End-to-End Encryption #

Both the proposal and the public communications around it have been fairly vague about the implication for end-to-end encryption, instead framing this as a "technology neutral" set of regulations:

Assuming the Commission proposal gets adopted (and the European Parliament and Council have to weigh in before that can happen), one major question for the EU is absolutely what happens if/when services ordered to carry out detection of CSAM are using end-to-end encryption — meaning they are not in a position to scan message content to detect CSAM/potential grooming in progress since they do not hold keys to decrypt the data.

Johansson was asked about encryption during today’s presser — and specifically whether the regulation poses the risk of backdooring encryption? She sought to close down the concern but the Commission’s circuitous logic on this topic makes that task perhaps as difficult as inventing a perfectly effective and privacy safe CSAM detecting technology.

“I know there are rumors on my proposal but this is not a proposal on encryption. This is a proposal on child sexual abuse material,” she responded. “CSAM is always illegal in the European Union, no matter the context it is in. [The proposal is] only about detecting CSAM — it’s not about reading or communication or anything. It’s just about finding this specific illegal content, report it and to remove it. And it has to be done with technologies that have been consulted with data protection authorities. It has to be with the least privacy intrusive technology.

However, for the reasons discussed above, designing a communications system that combines end-to-end encryption with robust content filtering is basically an open research question.^[8] This is not to say that it's not something that can never be solved, but rather that it's not something we know how to do today, even at the level of "we have a prototype that just needs to be tech transferred". Whatever the intent, it's hard to see how a mandate of this form that applies to all platforms isn't effectively a prohibition on end-to-end encryption.

Apple didn't publish NeuralHash but it was quickly reverse engineered and published and people started demonstrating the kind of attacks I mention above. ↩︎
Note that Apple doesn't presently E2E encrypt data in iCloud, so presumably they could check that the voucher matches, but the whole point of this system is to ensure that they don't need to scan the image, so we should model the problem as if the images were encrypted. ↩︎
For instance, the use of Tor Hidden Services to distribute CSAM in the 2014 "Playpen" case. ↩︎
Oddly, I couldn't find an author list, so I don't know which experts. ↩︎
This requires the server to know such a hash, but these hashes are fairly widely known, so shouldn't be an obstacle to a state-level attacker. ↩︎
You may recall this technique from its appearance in my post on side channels. ↩︎
You can purchase "hardware security modules" which aren't just part of the processor but rather are a separate computer in a tamper-resistant casing (the IBM 4758 is an early example.). These do better at resisting physical attack but are a lot less convenient to use, due to limited processing power and large size. ↩︎
Note that the EU report's recommendations implicitly concede this: "Immediate: on-device hashing with server side matching (1b). Use a hashing algorithm other than PhotoDNA to not compromise it. If partial hashing is confirmed as not reversible, add that for improved security (1c)." They recommend further research on the other avenues. ↩︎

Understanding The Web Security Model, Part V: Side Channels

2022-05-09T00:00:00Z

This is part IV of my series on the Web security model (parts I, II, outtake, III, IV). In this post, I cover data leaks via side channels.

Recall the discussion from part III about the basic guarantee of the Web security model, which is that it is safe to visit even malicious sites. As discussed in that post, the browser enforces a set of rules that are designed to provide that guarantee. It's of course possible to have vulnerabilities in the browser which allow the attacker to bypass those rules; for instance, there might be a memory issue that allows the attacker to subvert the browsers, at which point it can read the data directly. However, there is another important class of issue that has long been a problem in the Web, which is "side channel attacks".

Colloquially a side channel is a mechanism that isn't part of the specified API surface but which can be used to leak information. In a side channel attack, the program can be behaving correctly but there is some unintended observable behavior that allows an attacker to learn secret information it should not have. Historically, side channel attacks in browsers have had two main targets:

User browsing history (i.e., what sites the user has visited), in violation of the browser's basic privacy guarantees.
Data from other sites, in violation of the same origin policy.

As we'll see below, side channel attacks can be very hard to find and eliminate.

A Simple Timing Channel #

The general structure of most side channel attacks is that the there is some secret data that the attacker can't see directly but the attacker is able to observe some computation on the secret and use that information to learn about the secret. Consider, for example, the following code to check the correctness of a password.

bool checkPassword(const char *userPassword, const char *actualPassword) {
    size_t i = 0;
    
    for (;;) {
      // If the ith character doesn't match, then
      // return false.
      if (userPassword[i] != actualPassword[i]) {
        return false;
      }
        
      // If the ith character is '\0', then we are at
      // the end of the string and they match, so
      // return true.
      if (actualPassword[i] == '\0') {
        // userPassword must also be `\0` or we would
        // have returned false above.
        return true;
      }
      
      i++;
    }
}

The logic of this code is that you go through the both passwords one character at a time and if there is a mismatch at any character, we return false. It takes advantage of the fact that C strings don't have an attached length but instead use a character with value \0 to indicate the end of the string.^[1]

The explicit API of this function is that it just tells you whether a given password is valid or not. If you wanted to use this API to guess the user's password, you would in principle just have to check every password one at a time, and you don't learn any information unless you guess exactly the right password. If you assume 8 character passwords with only letters and numbers, then there are 62 possible values for each position and there are 62^8 possible (about 2^{48}) possible passwords. If you just try them one at a time, you'll find the right password about halfway through on average, so that is 2^{47} attempts, which will take quite some time (though is also practical on modern computers, which is why people tell you to use longer passwords).

Unfortunately, this function leaks more information than just that explicitly provided by the API. The problem is that this code is not constant time: because it checks the characters one at a time, the time to run the function depends on the number of characters that match. This means that if an attacker can very precisely measure the running time of the checkPassword() function, they can learn information not provided by the API, namely the first character which doesn't match.

Lockpicking #

Lockpicking exploits the same basic intuition about combinatorics. Your typical lock has a set of pins which prevent the lock barrel from turning. When you insert the key into the lock, the key pushes the pins up, as shown in the picture below. If the part of they aligned with given pin is the right height, it will push the pin up the correct amount, so it no longer blocks the lock barrel. If you get all the pins right, you can turn the key and the lock opens.

[Source: Wikipedia]

This would all be fine if the lock were perfect, because you'd have to get the key completely right and so you would have to try every possible combination in sequence, but in practice there's always some individual variation: When you try to turn the lock barrel, one pin will usually be the one that prevents it from turning ("binding"). You can exploit this by apply torque to the lock barrel and then using a tool to push up each pin in sequence. If you push up the pin that is binding the right amount (so that the break in the pin aligns with the lock barrel) the lock barrel will turn slightly until it binds on the next pin. You can repeat this process until you have all the pins and the lock opens.

If you just think naively about this function, you would expect its running time to be proportional to the number of matching characters. For instance, if it takes one nanosecond to check each character, then if the first three characters match, the function will take 4ns (to check the first three and then reject on the fourth). Note that real processors are much more complicated, as we'll see later in this post. You can use this fact to attack passwords very quickly. The basic idea is that you just generate a random candidate password and measure the running time of the function. If, for instance, it is 1ns, then this tells you that the first character is wrong.

You can then just iterate through all of the possible values for character 1 until the function runs in more than 1ns. When that happens, you know you have the first character right. You then keep the first character constant and iterate through the second character, and so on until you have broken the entire password. On average, you'll get the right character for each position about halfway through and so the total attack time is something like 31 * 8 (~250) attempts, which is obviously much faster than 2⁴⁷ attempts!

Of course, the signal here is very small: modern processors are very fast and there are other things happening on the computer besides just your task, so small timing differences can be hard to measure. However, there are now a more or less standard set of techniques for making this kind of attack work better. First, you can run the measurement a lot of times, which helps separate out the signal from the noise. Second, you can find ways to amplify the signal so that the slower operation gets a lot slower. We'll see an example of this below.^[2]

Cross-Site State #

The first class of side channel attacks I want to talk about take advantage of browser features which share state across sites. Consider the situation where site A wants to know whether the user has visited site B. For obvious reasons, this is sensitive information: we don't want arbitrary attackers to be able to see your browsing history. Thus, the Web platform doesn't allow sites to ask directly about browsing history, but that doesn't mean it can't get the answer indirectly.

The simplest mechanism is via the browser cache. As I discussed last week, performance is a very high priority for Web sites, and downloading big files from a remote site takes time and bandwidth. One way to address this problem is for the browser to cache data from the server. When the browser first downloads a resource, it stores it locally and can just reuse the local copy rather than the one retrieved from the server.

The actual details of HTTP caching are quite complicated because sometimes the cached value will be usable, but sometimes the server will change the resource and the client has to re-retrieve it. Under some conditions the client has to contact the server and ask if the resource has changed, e.g., via the If-Modified-Since header, and in others the server can just say this resource will never change. The common thread, however, is that getting resources from the local cache is (hopefully) faster than retrieving files from the server. That's the point of caching but when combined with the fact that it's possible to measure the load time for a cross-site resource, it gives us a timing leak.

The basic idea here is really simple: suppose that attacker.com wants to know if you have gone to example.com. It adds a large resource from example.com to its own site and measures how long it takes to load. If the load is fast, then it is likely that the data is in cache, which suggests that the user has been to example.com.^[3] This attack has been known at least since a 2000 paper by Felten and Schneider, and turns out to be part of a giant class of such issues, with browser state targets including: HTTP connections, DNS caching, TLS session IDs, HSTS state, etc. The general problem is that any time there is state that is shared between site A and B, activity on A can potentially affect behavior on B. The right solution was published by Jackson, Bortz, Boneh, and Mitchell in 2006: partition client-side state by the top-level origin as well as by the origin. For instance, in this case resources loaded from example.com would be in a different cache from those loaded from attacker.com, which means that when attacker.com goes to load the test resource it will have to retrieve it separately, even if example.com has already done so.

At this point you might ask why this hasn't been fixed. As far as I can tell, there are three main reasons: (1) the widespread use of cookie-based tracking made fixing these slower attacks less interesting (2) it's actually fairly complicated to address everything, in part because some of the required changes do change the observable behavior of Web browsers (3) there were concerns about the performance impact of reducing the effectiveness of caching. However, as Web privacy has become a bigger issue, browsers have started making a serious effort to address this class of attack, mostly via work in the W3C Privacy Community Group. This presentation by Anne van Kesteren does a good job of describing the situation.

Computation on Secret Data #

As discussed in part III, the same origin policy allows cross-origin use of data (for instance, embedding an image from another site) but forbids access to the data. In addition, it allows you to operate that data in a variety of ways that are intended to be safe because they don't allow you to see the result. It should surprise nobody to learn that these aren't actually safe.

Link Decoration #

Let's warm up with a simple example: Link coloring. CSS allows Web paged to apply styles (e.g., colors, underlining, etc.) depending on whether they have been visited or not. This helps the user know whether they need to click on a link or not. For instance, this fragment of CSS will turn all links red except those you have visited, which are blue.

a {
  color: red;
}

a:visited {
  color: blue;
}

The Basic Attack #

This would all be fine except that it turns out that the Web also lets you inspect the color of elements in the DOM using the getComputedStyle() function. In 2002, Andrew Clover observed that this combination of features creates a trivial attack in which the attacker puts a bunch of links on their page to sites they think you might have visited and then inspects the color to see which ones you actually have visited. Obviously, the attacker only gets to learn about pages it actually knows about, so in some ways this isn't as good as cookie-based tracking, but the attacker can send you a very big page with a lot of links, so it can extract quite a bit of information. Moreover, unlike cookie-based tracking, this attack can be used to learn whether you have visited sites which aren't cooperating with the attacker, such as their competitors!

This isn't just a theoretical issue. In 2010, Jang, Jhala, Lerner, and Shacham scanned the Alexa top 50,000 sites and discovered a number of sites doing history sniffing including two companies which that provided it as a service. For example, they found that the popular adult site Youporn used history sniffing to discover whether people were visiting their competitor Pornhub and third party ads on a number of sites checked to see if users had gone to various car-related sites.

The basic URL color attack is now fixed, though only as of about 2010. The fixes turn out to be fairly complicated, as described by David Baron in this post describing the fixes deployed in Firefox. The basic defense is to have the browser lie about various CSS selectors that let you query whether links were visited, by acting as if they were limited. However, this isn't enough because there are other CSS mechanisms that would let you (for instance) perturb the layout of the page and thus observe whether it reflowed. The complete fix requires also limiting the style changes that CSS can apply based on whether a link is visited to those which (hopefully) do not leak information. Other browsers have followed suit, in part due to pressure from the US Federal Trade Commission after Jang et al. published their work.

Side Channels #

Arguably this isn't even a side channel attack, because we're using an official API: the problem is just an unexpected result of combining two APIs. So, when we remove those APIs the problem will be solved, right? Of course not. Even without these APIs, the same data turns out to be accessible via a number of side channels. Many of these side channels work by observing that if you change the appearance of a link, this can cause the page to be repainted, which can be detected by the attacker's script. This means that is you have a link which is unvisited and then change the URL to be one that is visited, it causes a repaint, allowing you differentiate visited from unvisited links. Initially, the repaint was directly measurable in Firefox with the mozAfterPaint event, but it was later removed to avoid exactly this kind of link.

However, even without an explicit signal, it's still possible to detect repaints, as described by Paul Stone. Normally repainting is fast, but if you can make the repaint slower, then you can measure it. The trick is to apply some CSS effects to the link (e.g., drop shadows) that take time to compute. These effects aren't conditional on whether the link is visited, so they are allowed, but are slow to compute, thus allowing the attacker to measure the time taken to repaint. You can make things even slower by including multiple copies of the same link, thus making the attack work better even with fast browsers.

The hits just keep coming. In 2019, Smith et al. published three new side channel attacks on browser history via link styling:

Via the CSS Paint API.
Via CSS transforms
Via SVG

For example, the CSS Paint API allows you to register a JavaScript "paintlet" which can draw the background image for a given element, like a link. If you change the foreground element in certain ways—including changing the color—then this requires the paintlet to be re-run. The paintlet runs in a little sandbox that can't talk to the outside world, so you shouldn't be able to directly tell if it ran, but it turns out that you can measure how long it takes to run, using code like the following (adapted from Smith et al.)

const target = document.getElementById("target"); 

var start = performance.now();         // Get the current time
target.href = "https://example.com/";
var delta = performance.now() - start; // Get the time after the change
if (delta > threshold) {
  alert("Victim visited https://example.com/");
}

What makes this work is that when you change the DOM using JavaScript, those changes happen synchronously: the line of code changing the href field blocks until the DOM has changed, and the next line only executes after the change has happened. In this case, if the link has been visited, then the repaint has to happen which takes more time, and so you can measure it using this code.

Because browsers are quite fast, it would ordinarily be fairly difficult to measure the time difference, but Smith et al. observe that it's possible to deliberately make the paintlet slow by adding a loop in the paintlet code that takes extra time, which makes the difference easier to measure. This is a fairly simple technique for amplifying the size of a timing signal, and in some cases you need something fancier. For instance, later in this paper, Smith et al. describe a technique (due initially to Stone) in which they rapidly change a link back and forth (as above), thus forcing the browser to do a lot of computation, and measure the frame rate of the browser's renderer. Ordinarily the browser would render about 60 frames per second, but if you give it too much work to do, it will fall behind and this is detectable from JS.

Of course browsers fixed these issues (and the CSS paint issue only happened in Chrome because other browsers hadn't implemented CSS Paint, and Chrome eventually disabled CSS Paint for links). However, we still see new attacks on link history, such as CVE-2022-29916, fixed in the recently released Firefox 100, just as I was working on this post.

Pixel Stealing #

Another example of the risks of allowing sites to compute on data from other origins is what's known as "pixel stealing" attacks. Recall that it's possible for site A to embed content from site B (e.g., in an IFRAME or an <img> tag), but it's not allowed to inspect the content. However, site A is allowed to apply filters to that content to change its appearance; they just can't see the output of the filters. If this sounds like bad news, you're developing the right intuition.

A good example of what can go wrong here is provided by Paul Stone in the same white paper where he disclosed timing-based measurements.^[4] The basic idea is that you design an SVG Filter which runs at different speeds on black and white pixels (based on the the feMorphology primitive). You load the target content in an IFRAME^[5] and then apply the filter to one pixel at a time, measuring the time it takes to run (as before, we can use a bunch of techniques like running the filter a lot of times and magnifying the image so that each pixel is actually a lot of pixels in order to make the time difference bigger). This lets you extract the contents of the image one pixel at a time. Obviously, this isn't super efficient, but as Stone observes, if you want to read text out of a page, then you don't need that many bits because you only need to read some of the pixels to distinguish characters.

After these reports, browsers responded to these bug reports by rewriting the primitives in question so that they were closer to constant time—or by moving them to the graphics processor, where it was hoped they would be more constant time (though see here)—but it shouldn't surprise you that these are not the only cases where attackers can compute on cross-origin content with data-dependent results. A great example of this is a 2015 paper by Andrysco, Kohlbrenner, Mowery, Jhala, Lerner, and Shacham describing how to resurrect the SVG filter technique using a new timing channel based on floating point numbers.^[6] If nothing else, this serves as evidence of how difficult it is to remove this kind of timing channel.

Input #

The final class of attack I want to discuss are on user input. The basic observation here is that when people are typing into the browser or moving the mouse, this takes time to process, which temporarily stalls the processor. If you set up a loop in which you ask the browser to increment a counter very frequently, and measure the actual rate at which the timer increments, you find that it increments slightly more slowly during periods where the user has typed a keystroke, as shown in the following image from a 2017 paper by Lipp et al.

Just knowing when someone is typing doesn't seem that useful, but it turns out that by measuring the time between keystrokes, it is possible to learn a fair amount of information about what people are typing. The basic intuition is that people don't type at a constant rate and that different key combinations take longer time (consider the case where there are two keys typed with the same finger). This kind of problem has received a fair amount of study: in their original paper, Lipp et al. show how to determine with some confidence which URL people are typing; in 2001, Song et al. showed that it was possible to narrow down the range of user passwords in SSH from network traces; and there have been several papers about using accelerometers to measure typing on mobile phones or on adjacent keyboards using a mobile phone.

Because there is a a lot of redundancy in the characters people type (for instance, in English, the "q" is generally followed by "u" and not by "x"), some character combinations are more likely than others. This makes it possible to train a machine learning model that estimates which characters are being typed based on the available timing. The results aren't amazing, with accuracy rates in the 70-80% range, but they're a lot better than chance, and as Bruce Schneier observes, attacks only get better.

One very interesting thing about this class of attacks is that they aren't the result of deliberate browser decisions to mix data across origins. Rather, they're the natural result of some quite reasonable implementation decisions about how to share computing resources between sites. This is bad news because fixing them requires a lot of rethinking of the design of the browser.

The Bigger Picture #

Side channel attacks in browsers are a big topic, but a few common themes recur throughout the discussion.

State needs to be partitioned #

The main source of the various history sniffing attacks is that there is some piece of state (e.g., cached data, history) that is shared between site A and site B. As soon as you are in this state, you're going to have side channels and individually removing them is likely to be very expensive. It's now been recognized that the basic fix is to partition state by the top-level site. Unfortunately, there are a number of cases where this breaks functionality that people are used to, which is part of why it's taken so long to do. Moreover, as is the case with keystroke timing, there turn out to be resources which are unintentionally shared and hard to partition.

Safe computation on secret data is hard #

As I mentioned early on in this series, one of the key properties of the Web is the ability to make mash-ups of content from your site and from other sites, while still having them isolated by the same origin policy. However, the modern Web includes a lot of features that allow you not only to incorporate content from other origins but to compute on it. This is a very powerful mechanism but is also incredibly hard to do safely because that computation has to be done in a way that it is identical no matter what the data being computed on is. The lesson of the subnormal floating point case is that this is extremely tricky to do and depends on having very detailed knowledge of the processor and the operating system, all of which might change in some future version.

High resolution timing is dangerous #

A major building block of all of these attacks is the ability to precisely measure the duration of events. The more precisely you can measure events the smaller signals you can detect and thus the more careful the implementation has to be to suppress every difference between different code paths. You can often improve attacks by amplifying one of the code paths so that the timing difference is bigger and so less precise timing works, but the consequence is that attacks get slower and so it takes the attacker longer to extract a given amount of information.

There have been a number of attempts to provide systematic solutions to the timing side channel problem, such as Fuzzyfox by Kohlbrenner and Shacham. Techniques like this have the potential to really improve resistance to side channel attacks, but at a real performance cost and as far as I know no browser has yet been willing to deploy them in production.

Next Up: systematic solutions and microarchitectural attacks #

The history of side channel attacks in browsers—like many other security stories—is one of repeated cycles of attacks followed by ad hoc fixes for those specific attacks, followed by new techniques that resurrect those attacks, which themselves need to be fixed. The fundamental problem is that the behavior of the browser is simply too complicated a system to analyze with any confidence. The best known techniques for preventing this kind of attack depend on simplifying the problem so that security depends on a relatively small number of assumptions that are easier to verify and enforce. This is where techniques like partitioning come in.

This point was driven home in 2018 when it was discovered that a number of assumptions about the behavior of common processors were wrong, leading to a series of side channel attacks based on exploiting common processor optimizations. Defending against these attacks has forced browsers to make fundamental architectural changes. Those attacks and the changes they required will be the topic of the next post in this series.

Ordinarily, this feature, called "null termination", is considered a misfeature in C but in this case it's a bit convenient. ↩︎
A variant of this particular password checking bug was responsible for one of the very earliest side channel attacks, on the TENEX system. The attack, described in detail by Sjoerd Langkemper, took advantage of the fact that TENEX had virtual memory, in which the operating system could page out some data from memory to the disk and then bring it back in when needed. The attacker can exploit this bug by arranging the password so it crosses a page boundary with the second page having been paged out. The attacker can then learn the first mismatching character by observing whether the password check function tried to touch a page which had been paged out and needed to be paged back in (a "page fault"). ↩︎
Note that this measurement itself loads the data into cache, so repeated measurements will be fast, but the attacker can set a cookie to detect this case or try loading multiple resources. ↩︎
Similar attacks appear to have been discovered contemporaneously by Kotcher, Pei, Jumde, and Jackson, but their paper is behind a paywall, so this discussion focuses on Stone's work. ↩︎
You can also use this for link-based history sniffing, btw. ↩︎
It turns out that some processors have multiple representations for floating point numbers and that computations with one such representation ("subnormal" or "denormal") are slower than those with the regular representation. The attack involves applying a filter that translates black pixels into zero (which is normal) and non-black pixels into a subnormal value. If you then compute with the results, the non-black pixels are slower, which gives you the signal you need. ↩︎

Challenges in Building a Decentralized Web

2022-04-25T00:00:00Z

There's been a lot of interest lately in what's often termed the Decentralized Web (dWeb), though now it's quite common to hear the term Web3 used as well. Mapping out the precise distinctions between these terms—assuming that's possible—is outside the scope of this post (though it seems that Web3 somehow involves blockchains), but the common thread here seems to be replacing the existing rather centralized Web ecosystem with one that is, well, less centralized. This post looks at the challenges of actually building a system like this.

The infrastructure of the Web is centralized in at least two major ways:

There are relatively few major user-facing content distribution platforms (Google, YouTube, Facebook, Twitter, TikTok, etc.) and they clearly have outsized power over people's ability to get their message amplified.
Even if you're willing to forego posting on one of those content platforms, the easiest way to build any large-scale system—and almost the only economical way unless you are very well-funded—is to run it on one of a relatively small number of infrastructure providers, such as Amazon Web Services, Google Cloud Platform, Cloudflare, Fastly, etc., who already have highly scalable geographically distributed systems.

In this context, decentralizing can mean anything from building analogs to those specific content platforms that operate in a less centralized fashion (e.g., Mastodon or Diaspora) to rebuilding the entire structure of the Web on a peer to peer platform like IPFS or Beaker. Naturally, in the second case, you would also want to make it possible to reproduce these content platforms—only better!—using a mostly or fully peer-to-peer system; at least it shouldn't be required to have a bunch of big servers somewhere to make it all work. This second, more ambitious, project is the topic of this post.

Distributed Versus Decentralized #

An important distinction to draw here is between systems which are distributed (also often called federated) and those which are decentralized (often called peer-to-peer). As an example, the Web is a distributed system: it consists of lots of different sites operated by different entities, but those sites run on servers and operating a site requires running a server yourself or outsourcing that to someone else. Those servers have to be prepared to handle the load for all your users, which means they have to be somewhere with a lot of bandwidth, scale gracefully as more users try to connect, etc.

By contrast, BitTorrent is a decentralized system: it uses the resources of BitTorrent users themselves to serve data, which means that you don't need a giant server to publish data into the BitTorrent network, even if a lot of other people want to download it. This has some obvious operational advantages even in a world where bandwidth is cheap, but especially if you want to publish something which others would prefer wasn't published, perhaps because of government censorship or more frequently for copyright reasons. If you run a server, it's pretty hard to conceal that a million people just connected to download John Wick: Chapter 3 - Parabellum (a pretty solid outing by Keanu, btw), and you should expect the copyright police to come after you (see here, Kim Dotcom) but if you just publish your copy into the BitTorrent network, it's a lot harder to figure out who it was, especially if 50 other people did the same.

Note that it's possible to have mixed systems that are largely decentralized but depend on centralized components. For instance, in a peer-to-peer system, new peers often need to connect to some "introduction server" to help them join the network; those servers need to be easy to find and one—though not the only way—to do that is to have them be operated centrally.

Historically, peer-to-peer systems have seen deployment in relatively limited domains, mostly those associated with some kind of deployment outside of the aforementioned censorship-resistance use case. However, there has certainly been plenty of interest in broader use cases, up to and including displacing large pieces of the Web. This is a very difficult problem, in part because this kind of system is inherently less efficient and flexible than a centralized or federated system. This post looks at the challenges involved in building such a system. This isn't to say it's not also challenging to build something like Twitter or Facebook in a more federated fashion, but the problems are of a different scale (and perhaps the subject of a different post).

Peer-to-Peer versus Client/Server #

The opposite of peer-to-peer is client/server, i.e., a system in which the elements take on asymmetrical roles, with one element (often that belonging to the user^[1]) being the "client" and the other element (often some kind of shared resource associated with an organization) being the "server". This is, for instance, how the Web works, with the client being the browser. By contrast, peer-to-peer systems are thought of as symmetrical.

In practice, however, the lines can be quite blurry. For instance, common to have systems in which the same protocols are used to talk between clients and servers and also between servers, with the second mode more like a typical "peer-to-peer" configuration. For instance, mail clients use SMTP to send e-mail but mail servers also use SMTP to send e-mail to each other, with the sender taking on the "client" role; obviously in this case, each "server" is both client and server, depending on which direction the mail is flowing. Even in systems which are nominally peer-to-peer, it's common to use protocols which were designed for client/server applications (e.g., TLS), in which case the nodes may take on client/server roles for those protocol purposes even if the application above is symmetrical.

Basics of Peer-to-Peer Systems #

We all (hopefully) know how a client/server publishing system like the Web works (if not, review my intro post, but how does a peer-to-peer (hence-forth P2P) publishing system work? Let's start by discussing the simplest case, which is just publishing opaque binary resources (documents, movies, whatever). This section tries to describe just enough basics of such a system to have the rest of this post make sense.

In a client/server system, the resource to be published is stored on the server, but in a P2P system, there are no servers, so the resource is stored "in the network". What this means operationally is that it's stored on the computers of some subset of the users who happen to be online at the moment. In order to make this work, then, we need a set of rules (i.e., a protocol) that describes which endpoints store a specific piece of content and how to find them when you want to retrieve it. A common design here is what's called a Distributed Hash Table, which is basically an abstraction in which every resource has a "key" (i.e., an address) which is used to reference it and a "value" which is its actual content. The key determines which node(s) are responsible for storing the value and is used by other nodes to store and/or retrieve it.

As an intuition pump, consider the following toy DHT system. This is an oversimplified version of Chord, one of the first DHTs, so let's call it "Note". In Note, every node in the system has a randomly generated identifier which is just a number from $0$ to $2^{256}-1$ (sorry for the LaTeX notation, newsletter folks). It's conventional to think of these being organized in a circle, with the ids being assigned clockwise, so that node $2^{256}-1$ is right next to (before) node $0$, as shown in the following diagram:

Each node in the network (the "ring") maintains a set of connections to some other set of nodes in the ring (the arrows are colored according to the node maintaining the connection). I won't go into detail about the algorithms here, except to say that having that work efficiently is a lot of the science of making a DHT. In Note, we'll just assume that each node has a connection to the next node (i.e., the one with the next highest identity) and to some other nodes further along the ring, as shown in the figure above.

In order to communicate with a node with id $i$, a node sends a message to the node that it is connected to with id $j$ that is closest to but not greater than $i$ (i.e., that if you went around the circle clockwise, there would be no node that you were connected to that was in between them). Node $i$ does the same. When you finally reach a node that is connected directly to $j$, it delivers the message. For instance, if node 0 wanted to send a message to node c it would send it to b who would send it to c. When c wants to reply, it sends it to node e which is connected to node 0 and so sends it directly. Note that this means that a request/response pair takes an entire trip around the ring.

Storing Data #

So far we just have a communications system, but it's (relatively) easy to turn it into a storage system: we give each piece of data an address in the same namespace as the node identifiers and each node is responsible for storing any data with an address that falls between it and the previous node. So, for instance, in the diagram below, node c would be responsible for storing the resource with address k and node e would be responsible for storing the resource with address l.

If node a wants to store a value with address k it would craft a message to c asking to store it. Similarly, if node d wants to retrieve it, it would send a message to c.

Of course, there are several obvious problems here. First, what happens if node c drops off the network? After all, it's somebody's personal computer, so they might turn it off at any moment. The natural answer to this is to replicate the data to some other set of nodes so that there is a suitably low probability that they will all go offline at once. The precise replication strategy is also a complicated topic that varies depending on the DHT, and we don't need to go into it here.

Second, what if some value is both large and popular? In that case, the node(s) storing it might suddenly have to transfer a lot of data all at once. It's easy for this to totally saturate someone's link, even if they have a fast Internet connection. The only real fix is to distribute the load, which you can do in two ways. First, you can shard the resource (e.g., break up your movie into 5 minute chunks) and then store each shard under a different address; this has the impact that different nodes will be responsible for sending each chunk and so their share of the bandwidth is correspondingly reduced. You can also try to make more nodes responsible for popular content, which also spreads out the load.

Finally, if every message has to traverse several nodes in order to be delivered, this increases the total load on the network proportional to the path length (the number of nodes) as well as decreasing performance due to latency. One way to deal with that is to have the two communicating nodes establish a direct connection for the bulk data transfer and just use the DHT to get the in contact so they can do that. This significantly reduces the overall load.

Naming Things #

In the previous description, I've handwaved how the addresses for things are derived.

One common design is to compute the address from the content of the object, for instance by hashing it. This is what's called Content Addressable Storage (CAS) and is convenient in a number of situations because it doesn't require any additional content integrity in the DHT. If you know the hash of the object you can retrieve it and then if the hash comes out wrong, you know there has been a problem retrieving it.

Of course, given that you need the object in order to compute its hash, this kind of design means that you need some service to map objects whose names you know (e.g., "John Wick") onto their hashes, so now we either have a centralized service that does that or we need to build a peer-to-peer version of that service and we're back where we started.

Another common approach is to have names that are derived from cryptographic keys. For instance, we might say that all of my data is stored at the hash of my public key (again, maybe with some suitable sharding system). When the data gets stored we would require it to be signed and nodes would discard stored values whose signatures didn't validate. This has a number of advantages, but one critical one is that you can have the data at a given address change because the address is tied to the cryptographic key not the content. For instance, supposing that what's being stored is my Web site; I might want to change that and not want to have to publish a new address. With an address tied to keys this is possible.

Obviously, cryptographic keys don't make great identifiers either, because they are hard to remember, but presumably you would layer some kind of decentralized naming layer on top, for instance one based on a blockchain.

Security #

Any real system needs some way of ensuring the integrity of the content. Unlike the Web, it's not enough to establish a TLS connection to the storing node, because that's just someone's computer and it could lie (though you still may want to for privacy reasons). Instead, each object needs to be somehow integrity protected, either by having its address be its hash or by being digitally signed.

Aside from the integrity of the content, there's still a lot to go wrong here. For instance, what happens if the responsible node claims that a given object (or a node you are trying to route to) doesn't exist? Or what if a set of nodes try to saturate the network with traffic via a DDoS attack? How do you deal with people trying to store or retrieve more than their "fair share" (whatever that is) of data. There are various approaches people have talked about to try to address these issues, but our operational experience with DHTs is at a smaller scale than our operational experience with the Web, and in a setting that was much more tolerant of failure (Disney doesn't lose a lot of money if people suddenly can't download Frozen from BitTorrent) and so it's not clear that they can be made to be really secure at scale.

A Decentralized Web Publishing System #

Now that we have a way to store data and find it again, we have the start of how one might imagine building a decentralized version of the Web. As we did when looking at how the Web works let's just start with publishing static documents.

Recall the structure of URIs:

What we need to do is to map this structure onto resources in our P2P storage system. So we might end up with a URL like the following:

The Origin #

A critical security requirement in this system is that data associated with different authorities has different origins (see here for background). If data published by multiple users has ~~different origins~~ the same origin [2022-04-25 -- EKR], then they could attack each other via the browser, which is an obvious problem.

The note: at the start tells us that we need to retrieve the data using Note and not via HTTP. In the middle section, instead of having a "host" field which tells us where to retrieve the content in an ordinary HTTPS URI, we instead have an "authority" field which just tells us the identity of the user whose key will be used to sign the data for the URL. As above, I'm assuming we have some way of mapping user friendly identities to keys; some systems don't have that, which seems pretty user-hostile, but feel free to just think of the authority as being a key hash if you prefer.

The resource itself is stored at an address given by Hash(URL) (this is a small but simple change from my description above), and as above, is signed by key associated with the authority.

This is all pretty straightforward if you assume the existence of the P2P system in the first place. In order to publish something, I do a store into the DHT at the address indicated by the URL and sign it with my key. I can then hand the URL to people who can retrieve the data from the DHT by computing the address and then verifying the signed resource. Note that because the address is computed from the URL and not from the content, it can be updated in place just by doing a new store.

Taking a step back, this really does sort of deliver on the value proposition I described above: anyone can publish a site into the network without having to have a room full of computers or pay Amazon/Google/Fastly, etc. And so if you don't look too closely, it seems like mission accomplished and it's easy to understand the enthusiasm. Unfortunately this system also has some pretty serious drawbacks.

Performance #

Performance—in this case the time it takes a page to load—is a major consideration for Web browsers and servers. What mostly matters for Web performance is the time it takes to retrieve each resource. This is different from, say, videoconferencing or gaming, where latency (the time it takes your packets to get to the other side) or jitter (variation in latency) really matter. In the Web it's mostly about download speed.

Connections #

In order to understand the performance implications of a shift from client/server to peer-to-peer it's necessary to understand a little bit about how networking and data transfer works. The Internet is a packet-switched network, which means that it carries individually addressed messages that are on the order of 1000 bytes. Because Web resources are generally larger than 1K, clients and servers transfer data by establishing a connection, which is a persistent association on both sides that maps a set of packets into what looks like a stream of data that each side can read and write to. The sender breaks the file up into packets and sends them and the receiver is responsible for reassembling them on receipt. Historically this was done by TCP, though are now seeing increased use of QUIC, which operates on similar principles, at least at the level we need to talk about here).

The figure below shows the beginning of an HTTPS connection using TCP and TLS 1.3 for security.

Increasing the number of HTTP Requests on a Connection #

When HTTP was originally designed, you could only have one request on a single connection. This was horribly inefficient for the reasons I've described here, and—in large part due to the work of Jeff Mogul—a feature was added that allowed multiple requests to be issued on the same connection. Unfortunately, those requests could only be issued serially, which created a new bottleneck. In response, browsers started creating multiple connections in parallel to the same site, which let them make multiple requests at once (as well as sometimes grab a larger fraction of the available bandwidth, due to TCP dynamics). In 2015, HTTP/2 added the ability to multiplex multiple requests on the same TCP connection, with the responses being interleaved, but still had the problem that a packet lost for response A stalled every other response (a property called head-of-line blocking), which didn't happen between multiple connections. Finally, QUIC, published in 2021, added multiplexing without head-of-line blocking, even over a single QUIC connection.

As you can see, the first two round trips are entirely consumed with setting up the connection. After two round trips, the client can finally ask for the resource and it's another round trip before it finally gets any data. Depending on the network details, each round trip can be anywhere from a few milliseconds to 200 milliseconds, so it can be up to 600ms before the browser sees the first byte of data. This is a big deal and over the past few years the IETF has expended considerable effort to shave round trips from connection setup time for the Web (with TLS 1.3 and QUIC).

Once the connection has been established, you then need to deliver the data, which doesn't happen all at once. As I mentioned before, it gets broken up into a stream of packets which are sent to the other side over time. This is where things get a little bit tricky because neither the sender nor the receiver knows the capacity of the network (i.e., how many bits/second it can carry) and if the sender tries to send too fast, then the extra packets get dropped. To avoid this, TCP (or QUIC) tries to work out a safe sending rate by gradually sending faster and faster until there are signs of congestion (e.g., packets getting lost or delayed) and then backs off. Importantly, this means that initially you won't be using the full capacity of the network until the connection warms up (this is called "slow start"), so the data transfer rate tends to get faster over time until a steady state is reached.^[2]

The implication of all this is that new connections are expensive and you want to send as much data over a single connection as you can. In fact, much of the evolution of HTTP over the past 30 years has been finding ways to use fewer and fewer connections for a single Web page.

Peer-to-Peer Performance #

This brings us to the question of performance in peer-to-peer systems. As I mentioned above, if you want to move significant amounts of data, you really want to have the client connect directly to the node which is storing the data. This presents several problems.

First, we have the latency involved in just sending the first message through the P2P network and back. This will generally be slower than a direct message because it can't take a direct path. Then, it's not generally possible to simply initiate a connection directly to other people's personal computers, as they are often behind network elements like NATs and Firewalls. So-called "hole punching" protocols like ICE allow you to establish direct connections in many cases, but they introduce additional latency (minimum one round trip, but often much more). And once that's done you then still have to establish an encrypted connection, so we're talking anywhere upward from 2 additional round trips. To make matters worse, there will be many cases where the storing node is quite topologically far from you and therefore has a long round trip time; big sites and CDNs deliberately locate points of presence close to users, but this is a much harder problem with P2P systems. And of course, even once the connection has been established, we're still in slow start.

This is all kind of a bad fit for Web sites, which tend to consist of a lot of small files. For example, the Google home page, which is generally designed to be lightweight, currently consists of 36 separate resources, with the largest being 811 KB. If each of these resources is stored separately in the DHT, then you're going to be running the inefficient setup phase of the protocol a lot and will almost never be in the efficient data transfer phase. This is by contrast to HTTP and QUIC, which try to keep the connection to the server open so that they can amortize out the startup phase.

It's obviously possible to bundle up some of the resources on a site into a single object, but this has other problems. First, it's hard on the browser cache because many of those objects will be reused on subsequent loads. Second, it makes the connection to a single node the rate limiting step in the download, which is bad if that node—which, recall, is just someone else's computer—doesn't have a good network connection or is temporarily overloaded. The result is that we have a tension between what we want to minimize individual fetch latency, which is to send everything over a single connection, and what we want to do in order to avoid bottlenecking on single elements, which is to download from a lot of servers at once, like BitTorrent does.

All of this is less of an issue in contexts like movie downloading, where the object is big and so overall throughput is more important than latency. In that case, you can parallelize your connections and keep the pipe full. However, this isn't the situation with the Web, where people really notice page load time. As far as I know, building a large P2P network with comparable load-time performance to the Web is a mostly unsolved problem.

Security and Privacy #

Even if we assume that the P2P network itself is secure in the sense that attackers can't bring it down and the data is signed, this system still has some concerning properties.

Privacy #

In any system like the Web, the node that serves data to the client learns which data a given client is interested in, at least to the level of the client's IP address. This isn't an ideal situation in the current Web, hence IP address concealment techniques like Tor, VPNs, Private Relay, etc., but at least it's somewhat limited to identifiable entities that you chose to interact with (though of course the ubiquitous tracking in Web advertising makes the situation pretty bad).

The situation with P2P systems is even worse: downloading a piece of content means contacting a more or less random computer on the Internet and telling it what you want. As I noted above, you could route all the traffic through the P2P network but only by seriously compromising privacy, so realistically you're going to be sharing your IP address with the node. Worse yet, in most cases the data is going to be sharded over multiple nodes, which means that a lot of different random people are seeing your browsing behavior. Finally, in many networks it's possible for nodes to influence which data they are responsible for, in which which case one might imagine entities who wished to do surveillance trying to become responsible for particular kinds of sensitive data and then recording who came to retrieve it; indeed, it appears this is already happening with BitTorrent.

Access control—putting the public in publishing #

Much of the Web is available to everyone, but it's also quite common to have situations in which you want to restrict access to a piece of data. This can be the site's data, such as the paywalls operated by sites like the New York Times, or the user's data, such as with Facebook or Gmail. These are implemented in the obvious way, by having an access control list on the server which states which users can access each piece of data and refusing to serve data to unauthorized users. This won't work in a P2P system, however, in that there's no server to do the enforcement: the data is just stored on people's computers and even if the site published access control rules, the site can't trust the storing node to follow them. It might even be controlled by the attacker.

The traditional answer to this problem is to use to encrypt the content before it's stored in the DHT. Even if the data in the DHT is public, that's just the ciphertext. This actually works modestly well when the content is the user's and they don't want to share it with anyone because they can encrypt it to a key they know and then just store it in the DHT. This could even be done with existing APIs (e.g., WebCrypto), and the key is stored on the user's computer. It works a lot less well if they want to share it with other people—especially with read/write applications like Google Docs—because you need cryptographic enforcement mechanisms for all of the access rules. There has been some real work on this with cryptographic file systems like SiRiUS and Tahoe-LAFS, but it's a complicated problem and I'm not aware of any really large scale deployments.

The paywall problem is actually somewhat harder. For instance, the New York Times could encrypt all its content and then give every subscriber a key which could be used to decrypt it, but given the number of subscribers, and that only one has to leak the key,^[3] the chance that that key will leak is essentially 100%.^[4] Of course, people share NYT passwords too, but what makes this problem harder is that the password then has to be used on the NYT site and it's possible to detect misbehavior, such as when 20 people use the same password. I'm not aware of any really good P2P-only solution here.

Non-Static Content #

Access control is actually a special case of a more general problem: many if not most Web sites do more than simple publishing of static content and those sites depend on server side processing that is hard to replicate in a decentralized system.

Non-Secret Computation #

As a warm-up, let's take a comparatively easy problem, the shopping site I described in part II of my Web security model series. Effectively, this site has three server-side functions that need to be replicated:

Product search
Shopping cart maintenance
Purchasing

The second and third of these are actually reasonably straightforward: the shopping cart can be stored entirely on the client or, alternately, stored self-encrypted by the client in the P2P system, as described in the previous section. The purchasing piece can be handled by some kind of cryptocurrency (though things are more complicated if you want to take credit cards). However, product search is more difficult. The obvious solution would just be to publish the entire product catalog in the network, have the client download it, and do search locally. This obviously has some pretty undesirable performance consequences: consider how much data is in Amazon's catalog and how often it changes.

Obviously, the way this works in the Web 2.0 world is that the server just runs the computation and returns the result, and at this point you usually hear someone propose some kind of distributed computation system a la Ethereum smart contracts (though you probably don't want the outcome recorded on the blockchain). In this case, instead of publishing a static resource, the site would publish a program to be executed that returned the results (often these programs are written in WebAssembly).

Aside from the obvious problem that this still requires the node executing the program to have all the data, it's hard for the end-user client to determine that the node has executed the program correctly. Even in a simple case like searching for matching records: if those records are signed then the node can't substitute their own values, but they can potentially conceal matching ones. There are, of course, cryptographic techniques that potentially make it possible to prove that the computation was correct, but they are far from trivial. So, this doesn't have a really great solution.

Secret Information #

A shopping site is actually a relatively simple case because the information is basically public—though in some cases the site might not want their catalog to be public—but there are a lot of cases where the site wants to compute with secret information. There are two primary situations here:

The site's secret information, for instance Twitter's recommendation algorithm is not public.
The user's secret information, for instance which other users they have "swiped right" on in a dating app, or even just users' profile details.

In Web 2.0, the way this works is that the server knows the secret information and uses it for the computation but doesn't reveal it to the users. As with the search case, though, that doesn't port easily to the P2P case because it's not safe to reveal the information to random people's personal computers.

There are, of course, cryptographic mechanisms for computing specific functions with encrypted data. For instance, Private Set Intersection techniques make it possible to determine whether Alice and Bob both swiped right on each other and only tell them if they both did, but they're complicated and more importantly task specific, so you need a solution for each application, and sometimes that means inventing new cryptography (to be clear, this is far from all that is required to implement a secure P2P dating system!).

This is actually a general problem with cryptographic replacements for computations performed on "trusted" servers. The positive side of cryptographic approaches is that they can provide strong security guarantees, but the negative side is that essentially each new computation task requires some new cryptography, which makes changes very slow and expensive. By contrast, if you're doing computation on a server, then changing your computations is just a matter of writing and loading it onto the server. The obvious downside is that people have to trust the server, but clearly a lot of people are willing to do that.

Hybrid Architectures #

One idea that is sometimes floated for addressing this kind of functional issue is to have a hybrid architecture. For instance, one might imagine implementing the shopping site by having the static content of the catalog served via the P2P network but having a server which handled the searches and returned pointers to the relevant sections of the catalog. You could even encrypt each individual catalog chunk so that it was hard for a competitor to see your entire catalog. You could even imagine building a dating site with—handwaving alert!—some combination of P2P and server technology, with the logic for determining which profiles you could see and which to match you with implemented on the server, but the (encrypted) profiles distributed P2P.

At this point, though, you have pretty substantial server component that is in the critical path of your site and so you're mostly using the P2P network as a kind of not-very-fast CDN (see, for instance, PeerCDN). This gives up most of the benefits of having your system decentralized in the first place: you still have the problem of hosting your server somewhere, which probably means some cloud service, and at that point why not just use a CDN for your static content anyway? Similarly, if you're worried about censorship, then you need to worry about your server being censored, which makes your site unusable even if the P2P piece still works.

Closing Thoughts #

It's easy to see the appeal of a more decentralized Web: who wants to have a bunch of faceless mega-corporations deciding what you can or cannot say? And there certainly are plenty of jurisdictions that censor people's access to the Web and to information more generally. It's easy to look at the success of P2P content distribution systems—albeit to a great extent for distributing content for which other people hold the copyrights—and come to the conclusion that it's a solution to the Web centralization problem.

Unfortunately, for the reasons described above, I don't think that's really the right conclusion. While the Web sort of superficially resembles a content distribution system, it's actually something quite different, with both a far broader variety of use cases and much tighter security and performance requirements. It's probably possible to rebuild some simpler systems on a P2P substrate, but the Web as a whole is a different story, and even systems that appear simple are often quite complex internally. Of course, the Web has had almost 30 years to grow into what it is, and it's possible that there are technological improvements that would let us build a decentralized system with similar properties, but I don't think this is something we really understand how to do today.

Though see X in which these roles are sort of reversed. ↩︎
Interestingly, within certain limits latency doesn't have that much impact on how fast you can send the data because the rate control algorithms can adjust for latency. ↩︎
Allan Schiffman used to call this a "distributed single point of failure". ↩︎
Or, as the nerds say, "unity". ↩︎

Understanding The Web Security Model, Part IV: Cross-Origin Resource Sharing (CORS)

2022-04-19T00:00:00Z

This is part IV of my series on the Web security model (parts I, II, outtake, III). In this post, I cover cross-origin resource sharing (CORS), a mechanism for reading data from a different site.

As discussed in part III, the Web security model allows sites to import content from another site but generally isolates that content from the importing site. For instance, example.com can pull in an image in from some example.net and display it to the user, but it can't access the contents of the image. This is a necessary security requirement because it prevents attackers from exploiting ambient authority to access sensitive data but it also prevents legitimate uses for cross-origin data, such as a cross-origin API.

Cross-Origin APIs #

Consider the case where there is a Web service that has an API, like Wikipedia or Bugzilla, and you want to write a Web application which takes advantage of that API. For instance, suppose I have a little Web service which lets you get the weather at a specific location indicated by ZIP code. This service might have an API endpoint at the following URL.

https://weather.example/temperature?94303

With the response being a JSON structure:

{
  "temperature":"25",
  "units":"C"
}

A Web site could access this API and display the local temperature using the fetch() API like so, with the zip code being 94303 (Palo Alto).

fetch("https://weather.example/temperature?94303")
    .then(a => a.json()).then(a => {
        console.log("Temperature is "+ a.temperature + " degrees " + a.units);
    })
    .catch(a => {
        console.log("Error " + a)
    });

Obviously, a real application would do something more interesting, but I'm just giving an example here; as with many things Web, the platform capability is simple but the complexity is in the application logic.

Server-to-Server APIs #

It's mostly possible to replace all of these client-side APIs with server-to-server APIs in which the API-using Web site talks directly to the Web service. This is a pretty common pattern on the Web: the user authorizes site A to perform operations on its behalf on site B (typically using OAuth) and then send the data to the client. This is, for instance, how Github integrations work.

However, there are plenty of situations where it's more efficient to send the data directly to the client, especially if there is a lot of data. Note that from the perspective of site B it's not really safer to have the data sent to to a Web page served off of site A than it is to send it to site A directly, because the JS is of course under control of site A and can always just send it back to A.

This all works fine if the site that is consuming the temperature API is the same as the one hosting it, but what if it's not? There are a number of ways this can happen:

The sites are operated by the same entity, but they site is built as a Web app that runs in the browser and consumes data from the API. The app might be downloaded from one server and the API be on another server.
The sites are operated by different entities, for instance if the Web service is public, as in my temperature example.

However, if the sites are different, then this request violates the the same origin policy, as described in part III. If I try to do this, the browser will generate an error (on Firefox, TypeError: NetworkError when attempting to fetch resource) triggering the catch clause in the code above.

This restriction exists for a good reason. Even though this particular application seems safe, because the temperature API is public, others might not be. Because (1) the Web threat model assumes that any site can be malicious and (2) requests from the browser contain the ambient authority of the client. If you allow an attacker to use the ambient authority of the client, you are asking for problems. For example, Gmail is a "single page app" in which the server loads a JS program onto the browser and then that browser uses Web APIs to read your messages. If other Web sites can do that, then this would obviously be bad!

Instead of restricting what you can do with the cross-origin requests, you might think that browsers could get away with just removing cookies whenever you use cross-origin fetch().^[1] This is only a partial solution though, because cookies are not the only kind of ambient authority. A particularly important case is where the victim browser is able to connect to network resources that the attacker cannot directly, for instance if the browser is on the same local network as the server and there is a firewall preventing external access, but the server doesn't use cookies for access control. In this case, if an attacker could do cross-origin fetch() then they might be able to steal data from the server even if the browser strips cookies.

Even with the same-origin policy it is still possible to attack machines behind the firewall under certain conditions. For instance, if they are not using HTTPS, then it is possible to mount something called a DNS rebinding attack in which the attacker loads their page and then changes their DNS to point their site (e.g., attacker.example) to point to the server behind the firewall. This causes the browser to think that the behind-the-firewall server is actually the attacker's server and hence same-origin to the attacker's site (another reason to use HTTPS).

What we need here is a controlled way of allowing cross-origin requests that ensures they can't be used for attack.

JSONP #

It turns out that even without CORS, the Web platform actually had a mechanism that lets you make cross-origin requests; it's just super-hacky. You may recall from part III that JavaScript executes in the context of the loading page, even when it's loaded from another origin. This means that you can simulate a Web services API by having the main Web page load a script from the Web services site. That script then inserts the data into the context of the loading Web page.

In order to make this work, you need to do two things:

Instead of using fetch() the API-using page needs to use <script src=""> to load the API point from the server.
Instead of returning JSON, the server needs to return actual JavaScript which the inserts the data in the page.

For instance, the API-using page might do:

<script>
function temperatureReady(a) {
    console.log("Temperature is "+ a.temperature + " degrees " + a.units);
}
</script>

<script src="https://weather.example/temperature?94303&callback=temperatureReady">

And then the Web service API would return:

temperatureReady(
    {
        "temperature":"25",
        "units":"C"
    }
});

This code just calls the temperatureReady() function that already exists in the page (the way the Web service knows which function to call is that it's passed in query parameter in the URL) with the data as the argument to the function. Because the script runs in the context of the page, this is permitted and the result is that the data gets imported into the page as well. Mission accomplished!

Note that in the real world the API-using page wouldn't just statically include the script. Rather, when you wanted to make an API call, JS on the page would dynamically insert the script tag (remember that JS can manipulate the DOM), inserting whatever URL was necessary to make the correct API call.

This idiom, invented (or at least popularized) by Bob Ippolito, is conventionally called JSONP, because it's commonly used to wrap APIs which use JSON-formatted data and that JSON data is "padded" by wrapping it to make it valid JavaScript (otherwise it will be rejected by the browser as JSON is not well-formed JavaScript). However, there is no rule that the JavaScript returned by the site has to have embedded JSON in it. For instance it could return XML and invoke the XML parser, or just return a bare value such as the temperature as an integer. The API contract just requires that the JS served by the server calls the callback function that the API-using page indicates; as long as it does that everything will work.

Attacks by the API Server #

Moreover, nothing restricts the Web services server from doing other things besides calling the indicated callback: it can do anything it wants, including changing the DOM in any way it pleases, stealing the user's cookies, or making API calls to the Web site that the page was served off of. In other words, a naive use of JSONP requires large amounts of trust in the Web service you are using; this is obviously not ideal.

It's possible to address these issues by adding a third origin into the mix, as shown in the diagram below:

The idea here is that instead of loading the JavaScript directly from the API server into your page, you instead load it into an IFRAME which is hosted on a second origin that you control (e.g., proxy.example.com). That IFRAME ends up with the data but because it's cross-origin to your site it can't impact your site, and thus it is safer to load potentially malicious JS into it. You then use the postMessage() API to talk to the IFRAME to get the data in and out. Effectively, this creates a little proxy which protects you against the Web services API JS.^[2] I've actually never seen this trick written down (readers: if you're aware of a published description, please send me pointers) but I'm pretty confident it will work.

Of course, this is all a bit clunky, but it work (to quote Spinal Tap, "it's such a fine line between stupid and clever."). If you wanted to do cross-origin queries before CORS you didn't have a lot of options.

Attacks by the API Client #

Maybe the API-using site trusts the Web service site or uses something like the proxy technique above to protect itself, but that just gets us back to where we were without JSONP, with the need to find some way to protect the Web service from the API client.

There are actually two related problems:

Preventing the API client from reading data it shouldn't from the service.
Preventing the API client from causing unwanted side effects on the service.

The way to think about both of these is that the attacker is abusing the user's authority to talk to the Web service, and so is able to cause the Web service to do things on behalf of the user. It's important to understand that the server is trusting the browser to follow the rules; if the browser behaves incorrectly then all bets are off. The reason this is (mostly) OK is that the threat model is that the attacker is attempting to abuse the user's access to the service. Nothing stops the user from extracting the cookies themselves and making any requests they want. The server has to have its own access control checks that prevent abuse by the user.

The basic defense here is to ensure that the client site which is making the request is authorized to do so. A common pattern is for the service to require you to authorize that site, with a dialog like the one below. Note: this dialog is actually for a different kind of access where CircleCI talks directly to GitHub, but the idea is the same and how would you know if I didn't tell you?

If you approve access for site circleci.com, then the Web service (in this cases GitHub) would add an access control entry to your account that indicated that the other site (in this case circleci.com could make requests on your behalf. Of course, then it to actually enforce those rules, which is where things get a little bit tricky. This is done using either the Referer header or the newer Origin header to determine which site is making the request. The service then looks that up against the access control list to determine whether to allow the request or not. Neither of these headers can normally be controlled by the attacker (they are on the forbidden header list of headers which JS cannot modify) and therefore can be trusted by the server (remember, that you're worried about attack by a site, not by the user, who can of course make their browser do whatever they want).

The major drawback of using Referer or Origin in this way is that they are sometimes missing and the checks can be tricky to get right in which case you will inadvertently deny service to a legitimate client. As far as I can tell, however, they fail "safe" in that if you implement them correctly you won't accidentally give access to someone who should not have access.^[3]

From one perspective, JSONP solves our problem: it lets us make cross-origin API requests. In principle, we probably could build everything we want with JSONP, but in practice it's a seriously clunky mechanism—especially the part where we inject<script> tags into the DOM—that takes a huge amount of care to use correctly, and has big risks if used incorrectly. A lot of that can be hidden with libraries but we still know it's there. With that said, many big sites (e.g., Google, Twitter, LinkedIn, etc.) deployed JSONP APIs which just shows how useful a capability it is. What we needed was a mechanism that did much the same thing but was simpler and safer. This brings us to CORS.

CORS #

The basic idea behind CORS is that it allows the site from which the resource is being retrieved to make limited exceptions to the same-origin policy.

Simple Requests #

The simplest version of CORS allows the API-using site to read back the results of its cross-origin requests, which, you'll recall, is normally forbidden. In order to allow this, the server sends back an Access-Control-Allow-Origin header listing the origin that is allowed to read back the data. There are two main options here:

* indicating that any origin is permitted
An actual origin, such as https://example.com indicating that only that origin is permitted

For example, here is an example of a successful CORS request, in which example.com serves a page that makes a fetch() request to service.example. In this case, the service wants to allow the request so it sends an appropriate Access-Control-Allow-Origin header, with the result that the browser delivers the data to the JS.

Sites use the * value when they don't care who can read their data—effectively for public data—and an actual origin if they want to restrict it to certain origins (or to authenticated users, as described below). You're only allowed to specific a single origin, so as a practical matter the server needs to look at the client's Origin header and provide something matching in response. This is already useful as it allows for effectively public data, and it mostly doesn't enhance the attacker's capabilities as in most cases the attacker can just connect directly to the server and retrieve the data (with the exception of topological controls as described above).

Where things get interesting is if the client provides a cookie, because that cookie is (likely) tied to the user's authentication and therefore is not something that an attacking Web site could get unless they had compromised the user's credentials. Allowing cross-origin reads in these circumstances is more dangerous and CORS requires the service to add another header, Access-Control-Allow-Credentials, in order for the data to be readable. By default, cross-origin requests don't include a cookie, which means that if the server sets a cookie for some other reason (this is quite common) and no authentication is required, things will still work even if the server doesn't set this header.

Non-Simple Requests #

This all works fine for situations where the security property you need to enforce is one where the client can't read data from the server, but what about cases where you what you're concerned about is not about the site reading back the data but that the request itself is dangerous even if the client can't read back the response (for instance, the request might delete some of the user's data).

For this category of requests, CORS requires what's call a "preflight", which is basically an HTTP request in which the browser asks "Is it OK if I were to make this request?", and then only makes the request if the server says "yes", as shown in the diagram below.

Note that the preflight uses the OPTIONS method. Because OPTIONS is not used for ordinary HTTP requests, this prevents side effects from the preflight itself.

So, what requests need preflighting? Those which meet any of the following conditions:

Using an HTTP method other than GET, HEAD, or POST
Using non-automatic values for any headers other than Accept, Accept-Language, Content-Language, Content-Type, Range
Having any media type other than application/x-www-form-URL-encoded, multipart/form-data or text/plain
Not having any event listeners for the upload
Not using a ReadableStream on the request

This is a sort of odd list, isn't it? Take the method for example. You can do plenty of damage using the POST method? And why can you do POST and not PUT, for instance? For many of these properties, the answer is that these are the capabilities that JavaScript already had pre-CORS. For example, if you have an HTML form, you can generate a HTTP request with any of these methods and the allowed media types. I haven't checked the other restrictions in detail, but I believe they map onto similar "you can already do it" contours: for instance, HTTP forms let the site upload stuff, but if you can track the process of the upload, then you can see if the server processed some part of it and then took some action (for instance, rejected it). This would let you learn some information about the behavior of the server in response to this request, which you otherwise would not be permitted to do.

In other words, simple requests are (approximately) those you could do without CORS, which means that they are safe to do with CORS, as long as the server agrees to the JS having access to the data. However, if you couldn't have done it without CORS the client needs to do a preflight.

Failing Safe #

One thing that's key to note here is that the server has to opt-in to any of the new CORS behavior. For simple requests, if the server doesn't respond with the appropriate header, then the response won't be available to the JS, as shown in the example below.

For non-simple requests, if the server doesn't accept the preflight, then the request never happens at all. Because not sending these headers is just the existing pre-CORS behavior, this means that CORS fails safe: if you have a server which you didn't update then the browser just falls back to the pre-CORS behavior. This is a really critical property when rolling out a new Web feature: we don't want that feature to be a threat to existing sites.

The Web's Design Values #

Pulling back, the story of CORS is a good example of how the Web platform evolves.

Don't Break Anything #

As detailed in part III, the basic structure of the same-origin policy and the capabilities it gives sites was well in place before we really understood the security implications. This means that sites had come to depend on those properties and that made them really hard to change. Because those properties were hard to change, sites had to build defenses under the assumption that browsers weren't going to change their behavior, hence compatible hacks like anti-CSRF tokens rather than more principled solutions like SameSite Cookies that depended on the browser changing.

Conversely, when we are rolling out a new feature, it's critically important that it not create a new security threat for the Web. In particular, sites depend on the existing browser behavior, so you can't change that in a way that would make existing behavior unsafe.^[4] However, this means that it's generally safe to deploy new functionality as long as it stays within the existing assumptions that sites have made about browser behavior, which is how you get to the design of CORS.

Paving the Cowpaths #

If there's any consistent pattern in the Web, it's that if there is something people want to do and there is a way to do it—no matter how hacky—people will find that way and use it; hence JSONP (see also, long poll).

Much of the job of evolving the Web platform consists of looking at people do with the Web in a hacky way and designing better mechanisms that (1) does what people want and (2) is convenient, or at least more convenient than whatever they are doing now (3) doesn't create new risks. If this is done right, the new mechanism will gradually replace the old hacky one and the Web gets a little better.

Next Up: Side Channels #

Everything I've written so far assumed that browsers actually do enforce the guarantees that they are supposed to enforce. Unfortunately, this turns out to be a lot harder to do than you might think. In particular, there are a number of of situations where attackers can use side channels (e.g., timing) to learn information that it can't learn directly. I'll be covering that in the next post.

Removing them from any cross-origin load would break cases where sites load cross-origin images and the like. ↩︎
I believe it's also possible for the Web Service to know that it will be loaded inside an IFRAME and thus dispense with the extra site, but I'm not 100% sure. ↩︎
Referer checking is also common defense in depth measure against Cross-Site Request Forgery (CSRF) attacks, but it's not entirely sufficient because of the way HTTP handles redirects. Specifically, if a victim site redirects a page to an attacker site and the attacker-re-redirects back to the victim site to mount a CSRF, the Referer header will be the victim site, which creates an attack vector. This is not really an issue for JSONP because if you load JS off an attacker site, you already have much bigger problems than CSRF. ↩︎
WebSockets was delayed for some time after Huang, Chen, Barth, Jackson, and I found low a incidence risk from deploying it as-is and the WG had to add a defense called "masking". ↩︎

Lake Sonoma 50 Race Report

2022-04-12T00:00:00Z

Last weekend I raced the Lake Sonoma 50 mile up in Northern California. In ultra circles, Sonoma is well known for being very runnable, which—in the ultra context—means that there aren't a lot of long or steep hills and it mostly consists of dirt fire roads and smooth non-technical single-track (i.e., one person wide) trails, so you can plausibly run almost the whole thing if you are strong. This is by contrast to some other races I've done like Bigfoot 73, which were steeper and had more difficult footing, so as a practical matter you were going to be doing a lot of hiking.

There's almost nothing in Sonoma that I couldn't have run on its own or in a 25 mile event, but it has around 10,500 ft (3000m) of elevation gain (and also 10,500 ft of loss because it's an out and back course), which means that it's full of rolling hills and small creek crossings and you're almost never running on the flats. To do well you have to have good fitness and the discipline to keep the right pace and so it seemed like a good opportunity to test out my early season fitness, so I put my name into the lottery and got waitlisted, but then apparently a lot of people decided not to do it, as they cleared the waitlist and then re-opened entries to everyone. This gave my training partner Chris a chance to sign up and we ran most of the race together.

[Screenshots from Runalyze]

My plan here was to run the first 25-30 miles at "long run" pace, which is basically what people would call an "easy" effort level (for me this ranges from about 8:00/mile on the flats to 12:00/mile on the a very hilly course) and then try to maintain it for the second half, which is of course progressively harder as the fatigue builds up. I had been doing my long runs on comparable courses at about 11:00/mile, so I was hoping for low 9 hrs (50 miles at 11:00 is 9:10). This didn't entirely work out and I definitely slowed down throughout the race, coming in at 9:44:09, which was good enough for 47th (out of 252 finishers, 310 starters). This is about the 40th percentile of my expectations. In retrospect having seen the course low 9 hours seems too aggressive, but I do think I could have done <9:30 if I had paced things better. On the other hand this is quite a bit faster than my previous 50 PR, which was on the easier Firetrails 50.

Pre-Race #

Sonoma logistics are pretty easy. It's only a few hours away and Chris and I drove up the afternoon before and stayed in Healdsburg about 20 miles from the race start. We managed to pick up our race packets (including your race number) that afternoon so it was possible to prep everything the night before and then just show up at the race start. Regrettably we got there just as main parking closed so had to drive about a quarter mile to overflow parking (up a hill, which was really not amazing to walk up afterwards). Got to the start in plenty of time to use the bathroom (twice!) and take a pre-race photo (not online yet) with Chris, my friend Lisa, and some of her friends, who were doing their first 50.

It was about 45-50 at the start so I got a bit cold standing around for 25 min, but of course the day warmed up soon enough and I'd rather be cold at the start than really hot later in the day.

Start to Island View [4.26 mi, +725/-988 ft] #

The first 2.4 mi or so are on the road, so even easy distance pace is fairly fast. This was good because we started out a bit too far back in the pack and ended up gradually working our way up through the pack by the time we hit the singletrack and the sharp downhill. It wasn't too congested at this point and we mostly just settled into a pace with the other people in our general pace range. Generally, I'm a little faster than average on the flats and uphill and slower on downhill, so there was some yoyoing, but we tried not to do too much passing unless it was a real problem, because we'd just get passed right back.

We rolled through Island View at a really hot pace (<10:00/mile) and were still feeling good. It's water only on the way out so we didn't even bother to stop.

Drinks and Gels #

If you're gonna run for 10 hours you're going to need to eat some stuff. Each race serves different stuff at their aid stations but generally there will be at minimum some kind of sports drink (basically carbohydrates + electrolyes) and some kind of "gel", which is basically a carbohydrate paste. There are a lot of different companies that make this stuff and each one has a different mix of macronutrients and different flavors, so it's very possible you'll like one just fine and find another disgusting. My drink preference is Tailwind, which is pretty common but not ubiquitous; I'm less picky about gels. Before a race I usually figure out what they are serving and try it out beforehand to see if I can stomach it (literally). In this case, Sonoma was serving Gu Roctane which I've had before and like OK.

Island View to Warm Springs [6.97 mi, +1,421/-1,447 ft] #

This next section is pretty much all single-track rollers and, we were still feeling strong. We ended up in a paceline behind a group of women who were all working together and given that the pace seemed about right, we just sat behind them through the next aid station. As before, the basic pattern is we'd pull back a bit on the downhills but then catch up on the uphills and flats. During this section we were running the uphills until we were caught up; they were hiking some of the uphills so we would hike behind them to the top of the climb, then repeat.

We were still going really fast into Warm Springs, though even at this point it was starting to feel warmer. Had a little bit of a glitch at the aid station because they were (at least I thought) only serving the Strawberry Lemonade Roctane, which is caffeinated and I didn't want to start on caffeine this early. I was down to 200 or so ml Tailwind at this point so I just filled up with water water and then had a gel + water, which should be roughly equivalent to 250ml Tailwind.

Warm Springs to Wulfow [5.05 mi, +1,138/-909 ft] #

We were a bit slower coming out of the aid station but quickly caught back up with the pack we had been running with. This section was on average more up than down and you can see our pace starting tall off a bit to 11:30/mi (10:05/mi GAP) but it still looks pretty good. This section was still quite smooth and I was still feeling strong. Because of the Tailwind issue, I was consuming more like 200cal/hr than my target of 300 cal/hr here but otherwise things were pretty fine. Wulfow is water only, so I just refilled on water and (I think) grabbed a gel, as it was only 2 miles to Madrone.

Wulfow to Madrone [2.06 mi, +302/-331 ft] #

This time we got out of the aid station ahead of the pack, but there is a sharp downhill right after, so our previous pack was on our heels pretty quickly. There didn't seem to be too much interest in passing us, so I just lead almost all the way to Madrone. Towards the very end it opened up into uphill fire road and so things got a little jumbled. This is actually the steepest climb, but it was early enough in the day that it didn't feel too bad.

Madrone had decaf Roctane so I was able to completely fill my bottles. At this point it was starting to get a fair bit warmer, so I was starting to drink some fluid at the aid station and then fill my bottles.

Madrone to No Name [5.86 mi, +1312/-1066 ft] #

The pack sort of separated at this point and Chris and I found ourselves pretty alone for the big descent out of Madrone. This is when we started to see the first people coming the other way, which meant they were about 7-8 miles ahead of us at this point.

We knew that there was a big climb and then the lollipop around the halfway mark, so we were just kind of anticipating the climb, and it was a relief when we finally got there. It's just a long trudge up that and we naturally hiked. It's fire road so we just passed some people and got passed by others. We were still seeing a substantial number of people going the other way, but we also knew we were ahead of the main body of people. It was definitely a relief to get into the lollipop, though, because then you're no longer having people pass you going the other way (except for a short out and back to the aid station).

We rolled into No Name at 4:24, which was pretty far ahead of schedule and I was starting to have visions of a sub-9 finish (4:25 * 2 = 8:50, right?). I stopped at the bathroom and drank a bunch of fluid as I was definitely starting to feel hot and dehydrated. I also was able to grab my drop bags which had extra Tailwind bottles, so I could be back on Tailwind for the next few hours. Also grabbed my buff and had some ice put in it. This aid station stop was pretty long, 5:14, but we were still out right at 4:29, so ahead of plan.

No Name to Madrone [5.22 mi, +933/-1230 ft] #

Chris and I did this section pretty much on our own again, and it was slower than it should have been. The rollers from the lollipop to the big descent were starting to get to me and the the descent was steep enough that we mostly just jogged down it without taking it too fast, which did nothing for our pace. Then it's some rollers and the climb back up to Madrone, which we hiked.

At Madrone I had the opposite problem as before which is that I wanted caffeine but they didn't have either caffeinated Roctane or Coke, so I ended up just grabbing a caffeinated Gu, which has only 35 mg of caffeine.

Madrone to Wulfow [2.09 mi, +348/-315 ft] #

This section is where we really noticeably started to slow down. As opposed to before, we were hiking any significant uphill, rather than just when we were behind someone or it was really steep. My theory here is I was starting to get tired and that I wouldn't be moving much faster—if at all faster—if I was running, so I was conserving energy a bit. At this point I was definitely starting to feel pretty hot and dehydrated, and also maybe a little stomach discomfort from drinking a lot of water at Madrone. Wulfow was water and gels but unfortunately no salt, and I was running out of my own salt tabs. Can't remember if I grabbed another caffeinated Gu here.

Wulfow to Warm Springs [5.07 mi, +919/-1125 ft] #

This was probably the hardest section for me, both in how I felt and in terms of of my pace, which was the worst of the race, both absolutely and grade adjusted pace (GAP). As above, I was running out of salt and just generally starting to feel kind of beat. We were hiking anything that was even modestly uphill and even so it was tough. Was just generally feeling kind of wobbly and the log bridge that was a little iffy on the way out felt downright scary. However, I also started to notice that I was gapping Chris more and more on the uphills, though he'd mostly catch up on the downhills. This isn't too unexpected as I'm a stronger hiker, but it was the first time it was really happening.

Fortunately, this section was a little shorter than we expected, so we managed to get into Warm Springs OK. This was the longest aid station stop at 5:23, mostly because we were messing around with drinks, etc. This was the last set of drop bags and so I had another Tailwind bottle. They also had Coke so I pulled out my third bottle and ended up with one Coke, one Tailwind, and water (?). Had to wait a bit for Chris to leave this aid station as he was still getting ready to go.

Warm Springs to Island View [7.09 mi, +1417/-1470 ft] #

Started to feel better in this section, probably due to the caffeine, and I shifted out of "hike when it won't be much slower" mode into "run whenever you can" mode. About 3 miles in I noticed that I was really starting to gap Chris and so he gave me the car keys and I went ahead on my own, trying to push the pace as much as I felt comfortable with, consistent with still having 9ish miles to go. You can see this in the pace, which was faster than the previous two segments and with the GAP being quite a bit better. At this point I was starting to really pass a lot of people, including finally catching the last of the women from the pack we were running with.

Still was pretty glad to see the turn off down to Island, as that meant I was <5 to go. Hit the aid station and was frankly a little disoriented and spent some time filling up on Coke and trying to figure out which gels had caffeine even though I had Coke in my bottles. Left the aid station right as Chris rolled in.

Island View to Finish [4.66 mi, +1010/-676 ft] #

Hiked the hill out of Island View and then really tried to get into the vibe of "fast finish", given that I had less than 5 miles to go and I've done plenty of fast finish runs where you run the last few miles harder. Was still a bit unstable on my feet and tripped a bunch of times. Stayed up but it made me cautious. Was really just feeling like I needed to get to the big climb out and then into the final rollers. Hiked that part and then just tried to push through to the finish. Spent the last two miles chasing the two guys in front of me and felt like I closed on them a bit but never quite enough to catch them.

Right leg started to cramp a bit in the last mile or so but just toughed it out and it want away. Was able to finish strong, and it's nice to be under the round number of 9:45 (9:44:09). Chris came in at 9:46:69, so he must not have lost much if anything on me the last segment.

Retrospective #

A bit of a mixed result. On the one hand, I think it's clear I went in with too high expectations about what I could do here; I don't think sub-9 or even 9:15 was in reach, at least on this day. It wasn't crazy hot, but it did get to 80ish and I hadn't done any heat training. Sunday was a lot cooler and I think that might have shaved 10-15 min off my final time.

My pacing was a bit off here. I think if I hadn't gone out as hard and gotten to halfway in more like 4:40, I would have had a decent shot at 9:30 even on this day. I also wonder whether it would have been better to push more in the third quarter. I lost a lot of time there and clearly I was able to pick up the pace when I needed to in the fourth quarter. I'm not sure how much longer I could have sustained that, but maybe it would have been better to go more evenly in the last half. I did know that the rollers would be tiring but I don't think I anticipated how tiring they would be in the second half and how tempting it would be to hike.

I more or less hit my nutrition plan. My target was to drink half a bottle of Tailwind or Roctane every 3 miles (as a proxy for every half hour) and a 100 cals of gel or bar every 6 miles, for a total of ~300 cal/hr. I mostly managed this except where I got thrown off by aid station logistics and then towards the end when I was subbing in coke. I rotated my gels reasonably well so I never got too tired of anything and was glad to have Spring gels so it wasn't quite so much all space food. I had some Maurten gels in my drop bag but opted not to use them because of not trying new stuff on race day.

Wore my Salomon Pulsars the whole way and I have mixed feelings here. On the one hand they're super light, but the platform is really narrow and they're more for toe strikers and the traction isn't great so I slipped a bunch of times that I don't think I would have in (say) the Sense Pro/4s, which are my usual race shoe. Also, you're really not going that fast so having a super lightweight seems less important than it would be on a shorter race; you're not going to be going all-out. This will probably be my last race with them as the new Salomon shoes are out soon and the Pulsars are definitely too light for UTMB.

My going in expectations aside, this is arguably a pretty good result. Top 25% of finishers and almost top 15% of starters is better than I've finished in a long time. I was top 3rd at SOB and just barely top half at Bigfoot, so this seems like an indicator that this is actually a comparatively better performance than usual, even if the time isn't quite what I was hoping for.

Results Summary #

Finish Time: 9:44:09
Actual distance: 48.4 miles
Finish Place: 47th overall, 37th male, 310 starters

Segment	Distance	Elevation	Time	Pace	GAP
Island View	4.26 mi	+725/-988 ft	39:29	9:16/mi	8:25/mi
Warm Springs	6.97 mi	+1,421/-1,447 ft	1:12:27	10:23/mi	9:17/mi
Aid	-	-	1:30	-	-
Wulfow	5.05 mi	+1,138/-909 ft	58:03	11:30/mi	10:05/mi
Aid	-	-	0:26	-	-
Madrone	2.06 mi	+302/-331 ft	21:48	10:36/mi	9:52/mi
Aid	-	-	1:49	-	-
No Name	5.86 mi	+1,312/-1,066 ft	1:08:18	11:39/mi	9:54/mi
Aid	-	-	5:14	-	-
Madrone	5.22 mi	+988/-1,230 ft	1:03:12	12:06/mi	10:36/mi
Aid	-	-	2:09	-	-
Wulfow	2.08 mi	+348/-315 ft	26:42	12:51/mi	11:45/mi
Aid	-	-	1:02	-	-
Warm Springs	5.07 mi	+919/-1,125	1:07:13	13:16/mi	11:56/mi
Aid	-	-	5:23	-	-
Island View	7.09 mi	+1,417/-1,470 ft	1:29:26	12:37/mi	11:08/mi
Aid	-	-	2:44	-	-
Finish	4.66 mi	+1,010/-676 ft	57:12	12:16/mi	10:32/mi

End-to-End Encryption and Messaging Interoperability

2022-04-07T00:00:00Z

The news the the EU will require that messaging companies provide interoperability has gotten a lot of attention, both positive (matrix.org) and negative (Alex Stamos, Alec Muffett, Steve Bellovin), as detailed in this Wired article (see also this ISOC white paper). At a high level, I'm more positive on the idea of interoperability for messaging systems than some others are, but it's certainly not a trivial problem and at least some of the EU timelines seem pretty unreasonable. Read on for more.

Critiques #

At a high level, there seem to be three broad critiques of messaging system interoperability:

It will weaken security, for instance by requiring decryption and re-encryption at system boundaries or by creating confusion about user identities.
It will hold back innovation by forcing messages to be sent using only features that are common to all systems.
It will make abuse (especially spam) worse.

It's useful to keep these in mind throughout the rest of the discussion.

Before covering messaging, however, it's helpful look at an existing system that has had interoperability for a long, where we can see the resulting dynamics: e-mail.

An Interoperable System: E-mail #

E-mail has the opposite problem from messaging: where messaging consists of a number of independent islands of encrypted messaging with no way to talk between them, email is a globally interoperable system that—despite a number of attempts—doesn't have anything like universal encryption.^[1]

E-mail operates on a hub-and-spoke model in which every user is associated with a given mail domain, represented by a domain name (e.g., example.com) as shown below:

Telephone Addressing #

Telephone numbers actually are hierarchically structured but don't map 1-1 with providers.

The basic structure of a phone number is given by the E.164 standard and consists of a country code followed by a subscriber number, with the structure of the subscriber number being defined by the country code. For instance, in the North American Numbering Plan, identified by country code 1, numbers look like: 415.555.1111.

Description	Digits	Example
Numbering plan area (aka area code)	3	`415`
Central office prefix	3	`555`
Line number, denoting subscriber	4	`1111`

I don't know too much about the non-North American setting, so the remainder of this aside is about North America. Until 1984, North American telephony was basically monopolized by the Bell System. In that system, the number hierarchy was geographic, with the area codes and central office prefixes corresponding to geographic regions and specific switches and the line number corresponding to lines on a given switch. However, with the advent of local number competition following the breakup of the Bell System and then mobile telephony, things started to get more complicated.

Initially, central offices were controlled by a single carrier and so the phone number could be used straightforwardly for routing. However, subsequently the US required carriers to provide Local Number Portability, which allowed you to take your number from carrier to carrier. Thus, even if you were originally assigned a number out of Verizon's block, you could "port" it to T-Mobile, which means that this kind of hierarchical routing no longer works. Instead, there's basically a giant—well, not so giant, given that there are only 10 billion possible numbers—database that indicates which carrier has responsibility for each number.

E-mail addresses are hierarchically assigned, which means that if your mail service is example.com, then your address will end in @example.com, as in alice@example.com. It's helpful to work through an example here. For instance, here is what happens when Alice (alice@hotmail.com) wants to send a message to Bob (bob@gmail.com):

First, she transmits the message to her mail server over a protocol called the Simple Mail Transfer Protocol (SMTP), along with the addressing information for bob@gmail.com.
The sending mail server looks up the receiving domain name—in this case gmail.com—in the DNS to get the server associated with it.^[2] It then connects to that server—again over SMTP—and transfers the message, along with the addressing information bob@gmail.com.
Assuming that bob@gmail.com is actually a valid user on the receiving server, that server stores the message somewhere (on disk, in a database, whatever) and waits for Bob to come pick it up.
Finally, Bob connects to his mail server (historically over a protocol called Internet Message Access Protocol (IMAP)) and retrieves any new messages.

This structure has a number of important properties:

Addresses #

Because addresses are scoped by the mail domain they are associated with, it's possible to immediately know where a given message should be delivered just by looking at the right-hand side (RHS) of the address, namely the stuff after the @-sign. That tells you which domain an address is associated with. This is in contrast to addresses on most popular services (e.g., Twitter), which are unqualified: if all I have is the identifier ekr____ I don't know if that corresponds to Twitter, Github, or LinkedIn..^[3]

Conversely, the fact that names are hierarchical means that two people can have the same left-hand side (LHS) as long as the RHS is different (and vice versa). So, bob@gmail.com and bob@hotmail.com are totally distinct addresses and quite likely belong to different people. This is of course true with Twitter handles and the like, but because they are unqualified, the bare address isn't enough to tell you who is who. This becomes a real issue when you want to import identities from another namespace, for example, when your address for messaging is actually your telephone number.

Finally, it means that the semantics of the LHS are opaque to the other end. For instance, if you had your own mail domain (for instance your-lastname.name) you might have every address that ends in @your-lastname.name delivered into the same mailbox. Another example is that Gmail allows you to create new addresses by adding a plus sign to the end of your actual address, so example@gmail.com and example+newsletter@example.com go to the same place. This is a useful trick to let you sort your email by giving different addresses to each sender.

Hosted Domains #

Although mail is scoped by domain, as a practical matter many domains are actually hosted by the same service. For instance, Gmail allows you to host your "custom domain" on Gmail (that is how rtfm.com works), but your address can still have your domain in it rather than gmail.com. It's also possible to have your mail delivered to service A and have most of your accounts there but send mail from service B. This is useful if you want to send bulk email using a service like Mailgun.

Interoperability #

Because SMTP and IMAP are standardized, any mail endpoint can talk to any other mail endpoint. If you own example.com and want to send and receive mail there, all you have to do is stand up a server—or more likely, use an existing hosting server—set up the right DNS records, and you're good to go. Similarly, most mail services will provide IMAP service and so you can use any number of clients (the built in mail client on your Mac, Thunderbird, etc.) to read your mail.

Conversely, nothing says that a mail system has to have a separate client at all. For instance, instead of having people use IMAP to read their email you can just put up a Web front end that accesses it directly and, tada, you have Gmail. Or, as is common, you can both have a Web interface and an IMAP interface. As long as you properly speak SMTP, everything will work fine and the other end doesn't even need to know how you have everything set up; it's just a matter of having the right protocol interfaces. In particular, it doesn't matter to the receiver how the sender talks to their mail server and it doesn't matter to the sender how the receiver talks to their mail server. All that's required is that the servers speak SMTP to each other.

This is in contrast to most messaging systems, which are basically silos that don't interoperate with each other.

Extensibility #

The cost of interoperable protocols is a limited range of format extensibility. The format of the emails is standardized using a format called MIME, and if you send a compliant MIME message the receiver should be able to process it, at least to figure out what the type of the message is.

Identifying the type of the message is only the first step. Suppose that you want to introduce a new mail feature, say memoji in emails. Even if you write a new standard for it and Alice adds it to her email client, what happens if Bob hasn't upgraded? Ideally, the client would get some clear message that something was wrong, and yet would still see the part that was interpretable, but this doesn't always work. Depending on exactly how the new feature is designed, it either might not work properly—for instance, the memoji might be replaced with some unknown character like �— (for a long time, emails from Outlook would render the :) emoji to "J" on non-outlook systems) or the message might just not be readable at all (though hopefully you wouldn't design a feature like that). At the end of the day, this kind of mismatch can create a pretty degraded experience and change the meaning of the message.

The converse of this property however, is that email processing is highly extensible. Because mail formats are open and standardized, any client that speaks the protocol will work. I gave the example of Webmail before, but this also means that if you want to use a mail client which offers some new feature—automatic email summarization say—that's your business. By contrast, most messaging systems are closed and so you're limited to the features supported by the official client.

Security? #

Like many things on the Internet, the e-mail system was designed before modern encryption and so initially everything was in the clear. This allowed for a broad range of attacks:

Anyone on the connection between you and the mail server or between mail servers could read or modify your messages.
Senders weren't authenticated and so it was trivial to forge messages that appeared to come from someone else.
If your mail server was compromised, then it could read your messages in transit or change them.

Some of these issues have been gradually sort-of addressed with partial solutions such as TLS encrypting the traffic between you and the mail server, TLS encrypting the traffic between the mail servers, and server-based signing mechanisms like DKIM. However, they're incompletely applied (for instance, the client-server connection is generally strongly authenticated but the server-server connection often is not) and still don't provide any protection against a malicious or compromised mail server. For that you need end-to-end encryption (E2EE), in which the messages are encrypted (and authenticated) between the sending and receiving endpoints.

There have been quite a few attempts to provide end-to-end encryption for e-mail (PGP, S/MIME, etc.) but I think it's fair to describe them as having largely failed. This isn't to say that there isn't any encrypted mail but it's a fairly small fraction of overall traffic. The reasons for the failure of encrypted email are complicated, but there were a number of deployment problems that most likely contributed.

Key Management #

Like any cryptographic system, encrypted email depends on knowing the cryptographic keys of the people you are talking to. In e-mail, you use keys in two ways:

You sign your messages in order to authenticate them
People who want to send you secure messages need to encrypt them to your key.

It's technically possible to just start sending people messages with unauthenticated keys, for instance by signing all of your messages and expecting people to remember that this is your key (this is often called trust on first use (TOFU)). Once they have received a message from you, they can use your key to encrypt the return message. Obviously, TOFU is susceptible to attack if the that attacker is the first person to send you a message pretending to be someone else, which makes the system less than ideal, especially for interactions with people you don't talk to frequently. If my bank sends me a signed message, then I want to know it's my bank right away. It's also a problem if you want to send an encrypted message to someone you have never talked to before. What you really want is some system that lets you find out what people's keys are, which means solving two problems:

You need to somehow associate your key(s) with your email address.
You need some way to look up people's keys so that you can send them encrypted messages.

Deploying the infrastructure for both of these has proven to be quite challenging. The basic problem is that there was never a good way to automatically issue the credentials. This meant that people had to go to a lot of effort to get credentials, which of course meant that most people didn't get them. On the other side of the equation, there was never really a great way to discover people's credentials, which meant that you couldn't send encrypted email to new people. It's in principle possible to build mechanisms for this (ACME and WebFinger respectively are examples of the kind of thing I'm talking about), but we have the usual deployment network effect problems.

Confusing Semantics #

In addition to the keying problems, the fact that email encryption was added after the fact to an established system has resulted in some confusing semantics.

For example, the major extension point in e-mail is via the message body. As noted above, the bodies use an extensible message format called MIME. However the message subject line isn't extensible. This means that the subject line that appears in the email isn't either encrypted or authenticated. It's of course possible to have an inner subject line inside the encryption envelope, but it's an obvious challenge for users to understand that they can trust the body but not the subject.

Second, because some messages are protected and some are not, you need some way to indicate to the user which are which. This kind of indicator is a notorious source of confusion, especially in a situation where most messages are unprotected, because you don't want a big scary warning for nearly every message. But this also reduces the incentive for people to use secure e-mail, especially to send signed e-mail: if recipients don't notice or care whether messages are signed, then signing them doesn't add a lot of value, as an attacker can just impersonate you with the recipient being none the wiser.

Network Effects #

All of this should be a familiar story to EG readers: you have a situation where it's inconvenient for people to do something—in this case, deploy encryption—and there's not much benefit to doing it. In these cases, you get the expected result which is limited or minimal deployment. By contrast, most modern messaging systems were either built with E2EE from the start or underwent some mass upgrade that enabled it for everyone, rather than relying on people to do it themselves.

Messaging Systems #

Modern messaging systems have addressed these issues by making encryption both mandatory and automatic. This is comparatively easy because the messaging service is (usually) vertically integrated: all—or nearly all—users have clients which are provided by the service operator and can be updated as desired. The service operator also provides message routing and identity. This kind of uniform integrated system has a number of operational advantages:

The service can automatically issue credentials based on the user's account information, thus ensuring that every user has a credential. They can also run a directory which makes it easy for any client to learn the credentials for every other client.
When the service wants to add a new feature it can automatically upgrade everyone's client to support it. This means that they don't need to deal with massive heterogeneity of client functionality for very long, and can eventually just refuse to support older clients.^[4]
Spam and other kinds of abuse are easier to handle because all messages are authenticated by a user in the system. Of course, if you have a single central point where all messages are handled, and no end-to-end encryption, then content filtering is more difficult.

Of course, many of these advantages depend on having a closed system: if a significant fraction of people use third party clients to talk to such a system then you can no longer update the clients whenever you want to, which makes central extensibility much more difficult. In other words, you're trading off user control and extensibility for users for control and extensibility by the system operator. This is in stark contrast to the design of the Web, which is dominated by the principle of end-user control as documented in the HTML Priority of Constituencies and the Mozilla Web Vision.

Another consequence of a closed system is a lack of universal connectivity: with e-mail—or telephony—you can contact anyone no matter which service provider they are on. In fact, you don't even have to think about it: you just e-mail (or dial). Messaging, however, is different: if I want to send a message to someone on WhatsApp, I need to have a WhatsApp account myself. And because people choose different messaging systems, this means that it's now common to have accounts on a variety of messaging systems (I myself use three regular messaging systems, plus countless Slacks).

All of this creates a set of market dynamics dominated by network effects (Metcalfe's Law) and getting big: if you have a lot of users, then people have a strong incentive to join so they can talk to their friends. Conversely, if you are a new entrant into the market it is hard to break in because your early users don't have that many people to talk to. This is probably why we see a lot of regional variation in which apps are popular, because people want to use whatever app their friends use. Unsurprisingly, this produces some fairly lopsided market numbers, with Meta controlling two of the top three messaging platforms (WhatsApp and Facebook Messenger):

This brings us to the topic of interoperability: if it were possible for anyone to start a new messenger app that could still talk to WhatsApp and Messenger users, then this would remove a big barrier to entry into the market. I don't want to sound too optimistic here: even in a nominally open system like e-mail, we still see a huge amount of market concentration on the big mail systems like Gmail, Outlook, and Yahoo. This isn't too surprising: it's a lot of work to run a good mail system and so we'd expect well-funded players to dominate. However, it's also quite possible to use one of the smaller services like Fastmail, ProtonMail, or DreamHost or even run your own server, whereas there's really no way to run your own WhatsApp server.

Technical Interoperability for Messenging #

The details of what the DMA will actually require are extraordinarily sketchy; as I understand it they would need to be filled out by some regulatory agency. However, broadly speaking, there seem to be two options for providing interoperability, as laid out by ISOC:

Require services to offer stable APIs.
Require services to actually interoperate over a standardized protocol.

These require a bit of unpacking.

Stable APIs #

The idea behind a stable API is that the service would design and publish interfaces that others could use. There are actually two ways to offer stable APIs:

To clients, allowing someone else's messenger client to work with your service.
To services, allowing someone else's messenger service to gateway messages in and out of your service.

The first of this is actually a familiar concept in instant messaging: because there was never a single standardized protocol, it was fairly common to have messaging clients, such as Trillian, which would speak multiple protocols but provide a unified interface to the user that hid the details. This isn't really a conceptual change in the architecture of the system as it would still be a monolithic identifier space and the clients would still have to conform to whatever rules the service laid out; indeed, some services have open source clients, and so this is already possible for them, though of course third party clients might not get upgraded when the official clients do, potentially resulting in stability problems. The main result would be some decreased flexibility for the service because they would need to get users of the API to update when they wanted to change something that affected interoperability. However, as a practical matter, this probably wouldn't have that much of an impact on interoperability and market concentration because most people will just use the official client, and people who don't will be annoyed when the service changes something and breaks them.

The second version is less familiar, but the idea is presumably that WhatsApp would have some published API that would allow ekrMessage (TM pending!) to gateway messages into and out of WhatsApp. As with e-mail, each side would handle messages according to its own rules, with the gateway just transiting messages between the systems. This comes with two main problems:

How do you handle identities? For instance, if ekrMessage and WhatsApp both use phone numbers for identities, how do you know which messages stay on WhatsApp and which go to ekrMessage?
How do you manage different encryption protocols? Currently, each messenger has their own encryption protocol; while many of these are built along similar lines, they're not necessarily identical. Making this work either requires gatewaying at the provider—thus breaking end-to-end encryption, which is extremely undesirable from a security perspective—or having each client speak multiple encryption protocols, as in the multi-protocol client case.

Of course, this would all be a lot easier if there was some standardized protocol that everyone spoke, as with e-mail. Note: the difference between a stable API and a standardized protocol isn't really technical so much as social and depends on whether there is some standard or just a document published by the service.^[5]

Standardized Protocol #

Having a standardized protocol is not an all-or-nothing proposition: there are actually a number of levels at which one might have standardization, with the other levels potentially not being standardized:

Key establishment and message encryption
Use identity
Message transport
Message contents and features

I go into these in some more detail below.

Key Establishment and Message Encryption #

The basic structure of most messaging encryption systems is that you have an identity (e.g., your phone number) which is tied to a cryptographic key or keys. When Alice and Bob want to exchange messages, there is some protocol that lets them use their keys to establish a pairwise (or groupwise in the case of more than two people) cryptographic key which they then use to encrypt messages.^[6] Obviously, if Alice and Bob don't speak the same protocol, then they will not be able to establish pairwise keys and will not be able to encrypt messages end-to-end, so this is probably the most important place for everyone to use a common protocol.

Fortunately, while there are technical differences between the various protocols in use, they're similar enough that it would probably not be prohibitive for everyone to converge on a common protocol: a number of the existing messenging systems are based on the Signal protocol or one of its variants such such as Proteus or Megolm, and the IETF is currently in the final stages of standardizing a protocol called Messaging Layer Security (MLS) which contains a number of similar concepts but is intended to be more optimized for group communication. It's too soon to know how much adoption MLS will get, but the WG has had participation from a number of messenging services such as Facebook Messenger, Matrix, Wickr, and Wire (full disclosure: I have also been heavily involved in this effort). It would be a big lift for companies to change out their protocols, but, because right now they're noninteroperable silos, it's still technically feasible.

Identity #

As I said above, we need to have some notion of user identity. Identity is used for two purposes:

By the end-user clients (in an end-to-end system) to establish the keys to use to encrypt a message.
By the service to know how to route messages.

Both of these require identifying other people you want to exchange messages with.

iMessage #

iMessage is actually quite an interesting case because the Apple client is actually two clients in one, containing both an SMS client for talking to non-Apple users (the green bubble) and an iMessage client for talking to Apple users (the blue bubble). iMessages are sent over the Internet ("over the top") and are end-to-end encrypted. SMS messages are sent over the phone network and are not. However, both categories of users have the same type of addresses in the form of phone numbers iMessage (which also supports email addresses) and Apple automatically detects the capabilities of the message recipient and sends a message of the appropriate type.

iMessage might be one of the strongest cases for the benefits of interoperability because it already interoperates with Android devices, just in the clear over SMS. If iMessage was forced to interoperate and Android played along, then a large fraction of traffic would suddenly be encrypted.

At a high level, there are two main identity architectures we can have:

Hierarchical naming in which a given identity indicates which service it is attached to, as in e-mail.
A shared namespace in which a given identity could be attached to any service (like phone numbers).

With messaging, the situation is even more complicated because multiple messaging services use the same identifier (e.g., WhatsApp and iMessage both use phone numbers) so that means that even in an interoperable system, we'd need to find some way to manage that case, which seems like a real open question (though of course we already have that problem now when you tell someone "I'm 1.415.555.1111 on WhatsApp", so in the worst case scenario, we could just punt the problem to the user.) We also have the potential problem that alice on system A may be a different person from alice on system B; this shouldn't happen with phone numbers because they are uniquely assigned but it happens all the time with user-chosen handles.

The hierarchical design is obviously easier to manage, but it may be quite hard to retrofit to the existing non-hierarchical system.^[7] One possible approach is to have a hierarchical system under the hood but have UIs present unqualified namespaces, e.g., "Connect with 1.415.555.1111 on WhatsApp" in the UI turns into "Connect with 1.415.555.1111@whatsapp.com at the protocol layer." This is likely to work OK if there are a small number of messaging systems but less well if there are hundreds because the UI gets too cluttered. It's also possible to have a kind of hybrid UI like existing e-mail systems do for there accounts where you have a chooser for the common systems and then people can enter something freeform:

This brings us to the question of how users learn other users keying material. In a fully distributed/federated world like e-mail, you'd need some sort of analog to the WebPKI in which there was a set of agreed up on roots of trust and those roots then somehow were able to attest to identities in a uniform manner, no matter which messaging service people used. This in contrast to the current situation where each service runs its own disconnected identity service. If there is a totally shared namespace, then this has a lot of the same problems as the WebPKI in which anyone can attest to any name, but if the names are arranged hierarchically—even if that's not visible to the user—then we could potentially dodge some of those problems, as only WhatsApp would be able to attest to names for @whatsapp.com, etc.^[8]

It's also possible that one could do something less universal: if there are only a modest number of messaging services, and you have to make special arrangements to federate between services, then each service could continue to maintain its own identity system and just publish documentation about how it works, forcing the other systems could implement that. The likely outcome here would be that the big gatekeeper systems would each have something and if you wanted to talk to them, you would need to both consume and publish that, which is a burden on the smaller systems, but perhaps a bearable one (the tricky part is when Alice has accounts on WhatsApp and iMessage and wants to talk to someone on ekrMessage: which credentials does she use for the ekrMessage user?).

Message Transport #

Once we have established keys and are sending messages, we still need some way to transport them. There have been attempts to design standardized protocols for this, in particular XMPP and SIMPLE (which is not), but neither has seen the kind of adoption that would make it the obvious choice here.^[9]

As with identity, while it would be convenient to offer something standardized, it's probably not a dealbreaker not to have it, as long as services are required to offer interoperable APIs for message sending and delivery. The good news here is that unlike the cryptographic pieces, those APIs can largely be handled by the messaging service, rather than the client, so my ekrMessage client just needs to know that a given message is destined for someone on WhatsApp and it can route it there.

Message Contents and Features #

All of the above is just concerned with getting messages from point A to point B, but what people actually care about is the messages themselves. In order for messaging to work properly, when the messages finally get to the recipient, they need to be readable, which won't work if (say) system A uses ASCII messages and system B encodes them as images. Moreover, if system B wants to add some new feature, it's a problem if system A doesn't have it (critique 2).

As noted above, this is a sort-of solved problem in e-mail in that you can send MIME-encoded messages that describe their contents. But of course, describing the contents doesn't help if someone sends me a message of type image/avif and I don't know how to parse that. The conventional solution here is to have some common format that it's assumed that everyone can read (in e-mail this is 7-bit ASCII text). The sender then sends two copies of the content bundled in the same message: (1) the "basic" version that everyone should be able to read and (2) the "enhanced" version that only newer clients can read.

This is a workable, if not ideal, solution, but actually it's probably possible to do quite a bit better. The reason is that unlike e-mail, where you send messages to people based solely on their address, in order to send someone an encrypted message you need their key. When people publish their keys then can also publish other capabilities such as the various media types they understand, which gives senders some information about what messages are safe to send (Rohan Mahy has described such a mechanism for MLS.) Unfortunately, it's still possible to get into trouble with larger groups with mixed capabilities, where you probably end up having to send a lowest common denominator version. This isn't ideal for ordinary features, but is potentially more problematic for security features, as discussed below.

As should be clear from the discussion above, any form of interoperability places some limits on the freedom of each service to change their offerings whenever they want. Some of these costs—like using a standardized encryption protocol—are relatively modest, but others may be larger. It's certainly a lot more work to detect the capabilities of every client and carefully craft messages which will work for all of them than it is to just generate messages for one client type which you know works.

Security Implications of Interoperability #

As discussed above, if connecting service A and service B requires some kind of bridge that decrypts and reencrypts messages, then this has a pretty negative impact on security (critique 1). However, it's also possible to have interoperable end-to-end encryption; I would also argue that with sufficient care it's even possible to design an identity infrastructure that doesn't badly weaken the system as a whole. However, that isn't to say that there are no security implications of requiring interoperability.

First, even if you have a common protocol, there may be differences in application semantics. For example, when WhatsApp detects that a recipient has changed their keys and so a message is undecryptable, it automatically re-sends the message. This is a usability feature but is a difference from Signal, which does not automatically re-send—even though they use the same protocol as WhatsApp—because Signal is concerned that the new key might be compromised. This is an application behavior and it's of course harder to frame the security guarantees of a system where there is more than kind of client; in this case, the security decision is made by the sender, but in other cases it might not be.

One case where that's so is that messaging systems support "disappearing messages" which get automatically deleted after a certain time. This is not a cryptographic feature but rather a client side feature and depends on the receiving client complying with the sender's request to delete the message. Obviously, if the remote client doesn't comply, then it's not going to work. I'm less sympathetic to this case because this kind of feature is mostly an example of hope-based security: even in a closed system you have no way of knowing what software is running on the receiver's computer; it could have been hacked or they could have reverse-engineered non-compliant system (the virtue of standards is that they allow for interoperability without reverse engineering). Even if that's not the case, nothing stops them from taking a photo of the screen, or, depending on the system, a screenshot. This seems like a case where the recipient can advertise its capabilities and you just have to trust them.

There might also be new security features that would not end up in whatever new standardized protocol was settled on, such as metadata protection or post-quantum security. This isn't ideal, of course, but standardized protocols do evolve, and it's possible for messaging services to use private protocol extensions for groups that just consist of their users on new clients, so this doesn't seem like a fatal objection.

Probably the most serious problem is spam and abuse (critique 3). As I mentioned earlier, this is a much easier problem if you have relationships with all the users and don't need to accept messages from arbitrary counterparties. End-to-end encryption also presents a problem here because it means you can't do content filtering centrally. I'm not sure how serious this would actually be in practice: a lot of what makes email spam work is that you have to accept email from non-contacts, which is somewhat less of an issue in messaging systems, but this still seems like a problem that needs more work.

Critique Recap #

It's probably useful to recap the critiques from the beginning of this post. I don't think they are entirely without merit, but I also believe that interoperability would have real benefits that need to be weighed against these concerns.

Interoperability will weaken security #

It's certainly true that there are ways to implement interoperability which would have a very negative impact on security. However, as I argue above, I think it's also possible to implement interoperability in ways which would minimize those impacts, in particularly by maintaining end-to-end encryption across system boundaries. Clearly, the resulting system would be more complex, which is bad for security, but having a common system would provide a single target for analysis and improvement, which is good.

It's also important to look at the non-technical picture here: right now users largely choose their messaging systems based on who they want to talk to and get whatever security properties those systems have. Interoperability would allow people to choose systems based on security properties—for instance that they have key transparency and reproducible builds—while still talking to people who have made other choices. Of course, those mixed conversations tend to have the security properties of the weaker system, but at least it would be easy to also talk to people who had made stronger choices. In addition, we see many cases today where people use back to unencrypted channels in order to interoperate (e.g., iMessage falling back to SMS), which would be improved by end-to-end interoperability.

Interoperability will hold back innovation #

Here too, the situation is complicated. On the one hand, it's clearly true that messaging services would be less free to innovate than if they were totally vertically integrated (although they would still retain substantial freedom). On the other hand, there would be more room for innovation on the clients themselves, something which is currently very difficult. It's worth noting that the Web is one giant mostly interoperable system which is still experiencing plenty of innovation, so I don't think it's a foregone conclusion that interoperable systems can't innovate; you just need mechanisms to manage compatibility and change.

Interoperability will make abuse worse #

It does seem likely that interoperability will make abuse worse: if you have to accept messages from basically anyone then reputation and similar systems become harder, and e-mail abuse (especially spam) is a serious problem. However, we already see abuse even in monolithic systems, so it's also clear that being closed isn't a panacea. Moreover, messaging is fundamentally different from e-mail in a number of important ways (we'll have authentication from the start, which was a huge problem in e-mail, there is much less expectation that you'll just accept messages from anyone, etc.) so it's not clear how much worse interoperability will make things.

Final Thoughts #

As the extremely long writeup above should indicate, this is far from an easy problem. We have a giant installed base of software that doesn't interoperate and changing that would be difficult even if the big players wanted to. Famously, Facebook has been trying to get Messenger and WhatsApp to interoperate in an end-to-end secure fashion for years, and it seems likely that they're going to be a lot less excited about interoperating with others. However, that's separate question from whether it's actually technically possible to do, which, as the analysis above suggests, I think it is. With that said, this is also a much harder problem than the EU guidelines seem to contemplate: for instance, they require that basic 1-1 messaging be available within three months, and group messaging within two years. Given that the MLS standardization process is just about complete after four years, two years seems pretty aggressive, and three months seems fairly implausible.

Note that email frequently has transport encryption where messages are encrypted between users and mail servers and between mail servers, but they are generally in the clear on the mail server. ↩︎
What it looks up is mail exchanger (MX) record. ↩︎
And Alice doesn't even need to know that much. For instance, if Gmail suddenly decided to support domains rooted in the blockchain, this would just work transparently for Alice, because only Gmail needs to know which server handles example.eth. ↩︎
Of course, users don't always upgrade instantaneously, so it's possible to have some heterogeneity, but it's typically fairly short term, especially because the service provider can force you to update to continue using the service. ↩︎
Note: The difference between "APIs" and "protocols" is largely a matter of terminology: protocols are just the rules for what go over the network, but things that run over HTTP are often called "APIs". ↩︎
In many protocols, that pairwise key is itself changed ("ratcheted") frequently. ↩︎
As an aside, am I just the only person who thinks that the proliferation of these non-hierarchical namespaces is a huge regression? I'd much rather be ekr@rtfm.com everywhere than ekr on Github and ekr____ on Twitter. ↩︎
There are also questions about key transparency and the like, but they're largely downstream of these bigger architectural questions. ↩︎
Google chat used to offer an XMPP interface but no longer does. ↩︎

What's with the www prefix in www.example.com?

2022-03-28T00:00:00Z

You might have noticed that it's common for sites to have a domain name like www.example.com and a URL like https://www.example.com. You might wonder what the www is doing here. You're most likely loading this from a Web browser, so surely the browser knows you're on the Web. Why does it need the www prefix? The answer, like many things on the Internet, is that it was the quickest way to get to a result without having to change anything and now we're at a local minimum which is hard to change.

Protocol Separation #

In the early days of the Internet, it seemed like sites would be running a number of user-facing services (email, Web, gopher, NNTP, etc.) It quickly became apparent that even though it was technically possible to multiplex them on different TCP ports, you didn't actually want to run them all on the same machine, for several reasons.

First, you may not want them to be managed by the same person. The bigger your system gets, the more you want division of labor, and, for instance, you might not want your mail administrator to have access to your Web server.^[1] Second, you might want to use multiple machines to manage load, initially by separating each service onto its own machine and then potentially later by having multiple Web servers. Load is generally more of an issue for Web than it is for other services, principally because it's possible to get flash crowds that suddenly dramatically increase the load on your Web server. For obvious reasons, you don't want a flash crowd that slows your Web server to a crawl to also bring down your mail server, which you may be using to coordinate fixing your Web server.

Unfortunately, in those early days, the DNS had no way to say that if you had the name example.com you should connect to machine A for Web and machine B for NNTP. Recall from an earlier post that a domain name is just an index into a distributed database, with the primary value in the database being the IP address associated with the name. This means that Web and NNTP for example.com have to point to the same IP address and hence the same machine. As you have probably guessed by now, the solution is to give each service a different domain name, e.g., www.example.com for Web, nntp.example.com for NNTP, etc. This allows you to configure a separate machine for each service with its own IP address. This also allows them to be in totally different data centers or even operated by different hosting providers.

Interestingly, it was possible to say that you should deliver mail for (say) example.com to mail.mailserver.example via something called an MX record; this allowed someone else to run a mailserver on your behalf. However, there was no generic mechanism to do so for other protocols. There are now several such mechanisms, starting with the the SRV record and now including the HTTPS record. However, the SRV record never got wide deployment—to the best of my knowledge, no browser supports it—and the HTTPS record is new. The problem with deploying any such record is that there are a significant number of browsers which don't support it, so if you want to steer Web traffic and other traffic to different places, you need to keep doing www.

CNAME and the Apex Zone #

Of course, at this point, there are mostly only two domain names that users regularly come into: email (e.g., ekr@example.com) and Web (https://example.com). As I mentioned above, it is possible to run email and Web on different machines without the www prefix. So, why does the prefix persist?

In part this is just inertia, but it's also partly a result of another shortcoming of the DNS which is that it's not possible to have a CNAME at the apex of a zone. Suppose that I want to have my web site hosted by cdn.example. The natural way to do this is with a CNAME record, which is basically an indication that the real (canonical) name of a domain is what's in the record. So, for instance, consider the following CNAME record:

www.example.com -> www.example.com.cdn.example

This would tell anyone that if they wanted to know about www.example.com they should go look up the records for www.example.com.cdn.example. This works well because it means I don't need to know anything about how the CDN's network is laid out or what IP addresses they have for their machines. I just set up the CNAME and then the CDN can have the name resolve to whatever IP address(es) they want. This allows them, for instance, to provide different answers based on load or where clients are geographically.^[2] You can also use a CNAME to point to a service like Cedexis (now Citrix) which will steer traffic to different CDNs depending on network conditions. Unfortunately, while you can use a CNAME for www.example.com, you can't use it for example.com. The reason is that a CNAME is an all or nothing proposition: it means "look over here for every record" and because you also need to have NS records (as well as probably MX records) for the example.com, if you CNAME example.com and you just said "look over here for the name server for example.com, now you've created a circular dependency because how do people look up the name server (the NS record) that they need to look up the CNAME?

The result of all this is if you you want to host your Web site on a CDN and you want it to have a www (or some other) prefix, you have two main choices:

Host your own DNS and populate your records with the CDN's IP address (this is what I do).
Have the CDN host your DNS, so that they can then resolve the actual IP address however they please.

Neither of these is ideal. If you host your own DNS, you have more control but it's brittle because the CDN has to maintain a stable IP for your domain. If they decide to move things then your site breaks. It also means they can't do DNS-based load distribution.

It's generally a better idea in this case to have the CDN host your DNS, as then they can control how any given name resolve. Of course, if they don't also host your email, you'll need to populate the domain with MX records for your email server, but most anyone who hosts DNS will allow this. Of course, this is only a partial solution because as far as I can tell you still can't use a traffic management service to steer between CDNs. As I understand that, if you want to do that, you need to have some prefix (like www.) in front of your domain.

One way to try to split the difference here is to serve a page on example.com but then have most of your content on cdn.example.com, which can be load balanced invisibly. You can also redirect users from example.com to www.example.com, which isn't as invisible but lets you load balance even more because (1) the redirect is a short message and (2) you can tell the browser to remember the redirection, thus saving the trip to example.com in the future.

One more thing: because the the HTTPS record is needed for Encrypted Client Hello we should expect to see browsers support it for that reason, and so there should eventually be a fair amount of HTTPS record support, though it won't be universal. Sites will then be able to use a HTTPS record to steer modern browsers (those that support HTTPS) to something that can be load balanced. Of course, older browsers will just go to whatever non-load balanced site example.com is served off of but that will be an increasingly small fraction, so you'll still get a fair amount of value.

Final Thoughts #

The lesson here is the same as for most features on the Internet: if you want people to deploy something, then it has to be incrementally deployable and provide value with low levels of deployment. If your solution doesn't have this, then people will find some solution that does. And that, kids, is why we have www.example.com.

Of course, having access to your mail server is often enough to get a certificate for your Web server, but we just won't talk about that. ↩︎
Though it's also reasonably common to use anycast for this purpose, in which case there will just be one IP address and BGP will be used for this kind of traffic management. ↩︎

Understanding The Web Security Model, Part III: Basic Principles and the Origin Concept

2022-03-21T00:00:00Z

Note: This is one of those posts that is going to be best read on the Web, especially if you read your email using Gmail or the like, as it will tend to mangle some of the HTML features.

This is Part III of my series on the Web security model (see parts I and II for background on how the Web works). In this part, I cover the primary unit of Web security, the origin and some of its implications.

The Web Security Guarantee #

Unlike applications or e-books, the experience of using the Web is not confined to content provided by one vendor. Instead, even if you start on one site, many of your activities on that site will take you to other sites. Consider, for instance, the experience of searching for something using Google. Once you execute the search, Google then gives you a set of links, many of which take you to another site. Google's relationship to those sites is arms-length at best: it doesn't control them and doesn't bear any responsibility for their content beyond some vague assertion that this might be something that was responsive to your search. The situation is the same for other big content platforms like Facebook and Twitter: just because you see some link there doesn't mean that the site endorses it.

The Web vs. Internet Threat Models #

RFC 3352 (self-citation alert) defines a threat model in which the attacker has complete control of the network, which means that they can read or modify any packet. In this case, it is trivial for them to look at any unencrypted traffic or impersonate any site the client is making an unencrypted connection to. Because this kind of network attack is so powerful, it renders most questions about the Web security model more or less superfluous: if the attacker can intercept your connection to the site, it doesn't much matter whether there is some way that some other site can mount a weaker attack.

However, although powerful network attackers are reasonably common—just open your browser using Airport WiFi—there are also many weaker attackers. It used to be common to talk about the Web threat model in which we assume that the attacker has their own site that they can get you to talk to but is unable to interfere with your connections to legitimate sites. Due to the complexity of the Web, there are still a number of attacks in this setting. Moreover, now that HTTPS use has become so common and most traffic is encrypted (and browsers have banned mixed content) the Internet and Web threat models have basically merged.

In order for the Web to work successfully, people have to feel comfortable visiting arbitrary Web pages, even those controlled by the attacker. It's the browser's job to mediate that interaction so that it's safe. Back in 2011, my coauthors and I described this as the "core security guarantee" of the Web: users can safely visit arbitrary web sites and execute scripts provided by those sites.

Just to reinforce this point, in this threat model the Web site is the attacker. You can come in contact with a malicious site in several ways:

An active attacker on your network can pretend to be a Web site you are trying to go to. This is less common now with the rapid increase of encrypted connections in the form of HTTPS, but it's still reasonably common for people to visit a small number of unencrypted sites.
You can be lured in some way to a malicious site, for instance by an ad campaign, phishing, or just visiting the wrong link.

In this series, we are not primarily concerned with network attacks. First, this is supposed to be prevented at a lower layer, specifically, via HTTPS (modulo phishing). Second, if you have an insecure connection to your bank, then the attacker can tamper with your requests to do whatever they want. Instead, we're primarily interested in cases where the attacker gets you to visit their site and uses that as a foothold to attack your computer or your interaction with the bank.

This leads to the following set of requirements:

A malicious site won't be able to compromise your browser or your computer.
A malicious site won't be able to see or interfere with your interaction with other sites. For instance, if you have Gmail in one tab you don't want an attacker in another tab to be able to read your emails.

This series of posts is mostly about the second category of attacks. Making networked programs secure against arbitrary input is a serious problem, but one that's not unique to Web browsers, so we can take it up at a different time.

Motivation: Cookies and Ambient Authority #

One of the problems with writing these posts serially rather than all at once is that sometimes you find there is something you wish you had explained earlier that now you can't go back and do. This is one of those times. In Part II, I explained how to use cookies to implement a shopping cart, but another of the main uses of cookies is to persist authentication. This is something you experience every time you use a Web site that uses authentication: the first time you go to the site, it detects you aren't logged in and gives you a login prompt. On subsequent visits, though, it just remembers who you are.

This works in more or less the way you would expect, shown in the figure below:

Initially, when the user goes to the site, they have no cookie. The site notices this and sends them a login page with the usual username and password prompt. The user enters their password (presumably in a Web form) and the browser sends it to the server. The server checks the password. Assuming the password is correct, the server generates a new cookie, stores it in the local authentication database along with the user identifier, and then returns a success page to the user along with the cookie. The next time the user visits the site, their browser sends along the cookie. The site can then look the cookie up in the database and if successful it knows who the user is and can present an appropriate page. In reality, this doesn't happen just on subsequent visits, but during the same visit. Whenever the user clicks on another link, or even loads an image off the site, the cookie is used to authenticate them; the password is just used to authenticate the user long enough to set the cookie.

It's important to realize that from this point on, the cookie is the only thing authenticating the user to the site. In effect, the cookie is a new password that's created by the site and just handled by the browser rather than remembered by the user. Anyone who has access to the cookie is effectively the user (the technical term here is a bearer token, which means that anyone who has a copy of the token can impersonate the user). This means that the cookie has to be (1) unguessable and (2) be kept secret (this is where encryption comes in, as we'll see later).

Now here's where things start to get complicated. If you remember the discussion of online advertising in Post III, cookies get sent whenever a resources is loaded, regardless of the site where the resource is being loaded from. For instance, suppose that you have a picture on a photo site which is available only to certain people who are logged into the site. If the URL isn't secret, a site can embed an <img> tag pointing to the picture and it will be shown on the site. In general, this applies to any request made by the browser, no matter how it is triggered. This property is called ambient authority.

As I've just described it, this sounds really bad: any site can just load access-controlled material off of any other site, and would obviously violate the second half of the guarantee above. And if that were the whole story it would indeed be bad. What makes this all work is a set of rules called the same-origin policy that dictate that while a site can load the content from another site and show it to the user, it can't read the content. This is a powerful tool, but in practice a very tricky one to use correctly, as we'll be exploring in some detail.

The Same-Origin Policy #

The same-origin policy (SOP) is the collective name for a large-ish set of rules about how browsers behave in cross-origin situation. These rules have gradually evolved over time . In an important 2006 paper^[1] on Web privacy, this, Jackson, Bortz, Boneh, and Mitchell describe it as follows (under the name of "same-origin principle"):

Only the site that stores some information in the browser may later read or modify that information.

First, however, we must define what we mean by a "site". As described in the previous two posts, any given Web page is often composed of resources from multiple servers, with each resource being retrieved via a URL. Obviously, we don't want all of these resources to be isolated from each other because we want them to work together to provide a unified experience. So, we need some concept of "the same site" that is different from just the URL. This concept is given by the origin.

Recall the structure of the URL from post I:

Risks of Including Paths in the Origin #

One interesting detail is that the path component is not part of the origin, so https://example.com/abc and https://example.com/def are in the same origin. There's an obvious reason for this, which is that Web sites frequently consist of multiple paths and you want them to share cookies and state. However, it used to be fairly common to have several people share a given server, for instance by having Alice have her home page at https://example.com/~alice/ and Bob have his site at https://example.com/~bob/. Unfortunately, this has some problematic security properties. For instance, it's possible to scope cookies to a given path prefix, but if Alice sets a cookie, Bob can read it by injecting script into the page. For more on this class of problems, see the classic paper "Beware of Finer-Grained Origins" by Adam Barth and Collin Jackson.

The origin of a piece of content retrieved by a URL is defined by the following three values:

scheme: e.g., http: or https:
host: the domain name of the server
port: the TCP or UDP port number that the server is listening on

In order for two origins to be the same, all three values must be the same.

We've covered scheme and host before, but what's a port? Internet hosts are addressable by IP address, but what if you want to run multiple services on a given machine, such as mail and Web. This is handled by having a second layer of addressing: the port, which is just a 16-bit number carried in the transport porotocol. You can have a large number of different services on a server, each addressed by a separate port (the technical term here is that you are multiplexing multiple services on the same IP and the port is used to demultiplex them). Traditionally, each protocol has a fixed port number (HTTP is 80, HTTPS is 443, e-mail transmission (SMTP) is 25). However, nothing stops you from running services on other ports; you just need some way to tell the other side what port to talk to. In URLs, this is done by appending a colon and the port number.

Here are some examples of URLs and their associated origins:

URL	Scheme	Host	Port
`http://example.com`	`http`	`example.com`	`80`
`http://example.com:8080`	`http`	`example.com`	`8080`
`https://example.com`	`https`	`example.com`	`443`

Notice that in the first and last examples, the port isn't provided: HTTP has a default port value of 80 and HTTPS has a default port value of 443. In the second example, the port (8080) is explicitly provided. As a practical matter, nearly all Web traffic runs on the default port, though it's common to use other ports for development purposes.

It's important to note that the path is not part of the origin. So, for instance, these URLs have the same origin (See MDN for some more examples, as well as examples of some edge cases.)

https://example.com/index.html
https://example.com/~ekr/homepage.html
https://example.com/js/scripts.js

As I mentioned above, this allows them to work together to provide a unified experience (though see below for some special considerations for JavaScript).

In general, if two resources have the same origin, then they can share information. However, if A and B are from different origins, then their interactions are going to be fairly limited.

Reading/Writing Other Resources #

First let's look at the example I used above: a page from origin A loading an image from origin B. The SOP requires that A be able to see the content if and only if A has the same origin as B. If A and B are from different origins then I can only learn if it was loaded but can't see the actual content. The way you read the content of an HTML <img> tag is by drawing it on a Canvas element and then reading the data back with getImageData(). The following JavaScript snippet does that and then writes the resulting value below the image:

function onloaded(el) {
    let canvas = document.createElement("canvas").getContext("2d");
    canvas.drawImage(el, 10, 10);
    let pixelvalue = null;
    try {
      let imgdata = canvas.getImageData(0, 0, 1, 1);
      pixelvalue = imgdata.data[0];
    } catch {
      pixelvalue = "forbidden";
    }
    el.parentElement.appendChild(document.createTextNode("URL=" + el.src + " pixel=" + pixelvalue));
}

By setting the onload property on the image element, we can arrange that this function runs whenever the image is loaded. Below you can see the results with two images, the first loaded from this site, and the second loaded cross-site.

As you can see, in both cases you can tell when the image was loaded (because the function gets called) and get some basic information like the URL (and the width and height). However, when we try to actually access the image data, call to getImageData() only works with the same site image, producing the pixel value [211, 196, 173, 255],^[2] but fails with the cross-site image, producing the result "forbidden". This is the same-origin policy at work. The same thing applies to other elements that you load cross-origin like this, for instance audio files or videos. It also applies if you load another Web page in an IFRAME or in another tab. If the page is same-origin, then you can access the DOM of that page, but if it's cross-origin you cannot. In addition, same-origin IFRAMEs or pages can access the original page.^[3]

Note, however, that the containing site can write to a cross-site element, or rather, it can replace them with other elements. This makes sense, because even though the site can't read the element it ultimately controls the DOM that the element appears in, so it can just replace it with something else, as in the following code snippet, which just swaps the image element below between two images whenever you click:

var onclickimageindex = 0;
const images = [
    "/img/ekr.jpg",
    "https://www.rtfm.com/ekr-ud.jpg"
];
function imageonclick(el) {
    onclickimageindex++;
    el.src = images[onclickimageindex%2];
}

What About JavaScript? #

But if cross-origin resources can't access the DOM, then how is it that you can load JavaScript libraries off of other sites, which, as I mentioned, people do all the time? The answer is that when you load JavaScript into a site with a <script> tag, that JavaScript runs in the origin it was loaded by not the origin it was loaded from. For instance, if a page loaded from https://educatedguesswork.org pulls in a script from https://example.com that script has the same privileges as if it were loaded from https://educatedguesswork.org/ and can do anything one of those scripts can do.

It's important to recognize that an attacker who can run script in a site's security context effectively controls that site from the user's perspective. Because scripts can manipulate the DOM, they can make the user see anything they want. They can access locally stored state and can often access cookies (via the document.cookie variable.). They can't directly access the user's password, but they can prompt the user to retype it and the user will likely do so; a password manager cannot protect you here because they determine what password to show based on the site's origin. Being able to run script on a site is very nearly as good as intercepting all communications between the client and the site.

Mixed Content #

Because imported JavaScript is so powerful, it's critical to ensure that the right script is loaded: an attack on imported JavaScript is nearly the same as an attack on your site. Suppose that ExampleCo serves example.com over HTTPS, but that site imports JavaScript from http://libraries.example. This situation is called mixed content (because you are mixing secure and insecure content). In this case, even though a network attacker cannot directly attack example.com, they can attack the JavaScript from http://libraries.com and through that JavaScript control how the browser renders example.com. In other words, this is barely better than having the original site served insecurely.

Mixed content used to happen quite frequently: if you wanted to upgrade your insecure site to HTTPS, you might find that some of your dependencies were insecure; the easiest thing to do was just accept the situation. Eventually, as HTTPS became more common, browsers started blocking active mixed content (like JavaScript), loading the original page but just generating a network error when it tried to load the insecure content. This obviously broke some sites which still depended mixed content, but also protected users from attack on those sites (and in some cases, the site would still work correctly).

Compromised Dependencies and Subresource Integrity #

Another form of attack on cross-origin JavaScript—or really any included JavaScript—is attack on or by the site hosting the script. Suppose that your site depends on a JavaScript library like jQuery but loads it off the jQuery CDN rather than hosting it locally. If the jQuery CDN—or the jQuery distribution itself—is compromised, then the attacker can serve malicious JavaScript and subvert the user's experience of the site. This works even if the connection to the CDN is encrypted, because the problem is a compromised endpoint, not a network attacker.

The W3C has standardized a technology called Subresource integrity (SRI) which is intended to prevent this type of attack. The idea behind SRI is that the <script> tag loading a piece of JavaScript includes a cryptographic hash of the expected result. When the browser loads the resource, it checks the hash and generates an error if it doesn't match. For instance, here is a lightly modified example from the SRI spec:

<script src="https://example.com/example-framework.js"
        integrity="sha384-Li9vy3DqF8tnTXuiaAJuML3ky+er10rcgNR/VqsVpcw+ThHmYcwiB1pbOxEbzJr7">
</script>

In theory, SRI solves the problem of compromised subresources, but in practice deployment has been fairly slow. One likely reason for this is that coordination is difficult: the site author must somehow learn the hash of the JavaScript library they are loading, and it's just one more thing to go wrong. At present most sites (this site included) which depend on external JavaScript—which is a huge fraction of the Web because of advertising and tools like Google analytics—are just dependent on the security of the external servers which host those scripts.

Cross-Origin Requests (and Cross-Site Request Forgery) #

As noted above, the SOP allows site A to make requests to site B but not read the responses. Unfortunately, this still allows for attacks. The basic problem here is the combination of cross-site requests under control of the attacker with ambient authority provided by cookies. Suppose that there is a shopping Website such as the one we described in part II. If the attacker knows that you have logged into the site and can get you to visit their site, they can force you to make purchases on the shopping site, as shown below:

The way this works is that when you visit the attacker's site, they serve you an HTML page with an element that causes the browser to make a request to the shopping site's server to buy something; that request is the same message that the browser would have sent if you were on the shopping site's page and comes along with the user's cookie (ambient authority, remember?). This all looks fine and the site just goes ahead and executes the purchase. This is called a *Cross-Site Request Forgery (CSRF) attack.

It's worth mentioning a few fine points. First, why am I using an HTML form here? The reason is that many (most?) sites use the HTTP POST method for requests that are supposed to have side effects, such as buying something. ^[4] Most of the HTML elements that result in a cross-origin load use the GET method, but forms allow you to use POST. You can also use JavaScript methods to make this kind of cross-origin request, but the situation is somewhat more complicated, so I'm going to get to it later when I talk about Cross-Origin Resource Sharing (CORS).

Second, it's possible to make this operation automatic and invisible to the client: even though form submission usually results in navigation events, you can put the form in a hidden IFRAME so the user doesn't notice the event. Similarly, you can use JavaScript to trigger the form submission so that it happens automatically on loading the page.

Obviously, CSRF is a serious attack, and we'd all be in trouble if it were regularly possible to mount CSRF attacks on (say) Amazon or (worse) Wells Fargo. The most basic CSRF defense is to use what's called a CSRF token. The idea is that when you access the legitimate site, it adds a random token to every HTML element corresponding to a request which would generate side effects. For instance, if it gives you a link to add something to your shopping cart, that link might have a random token at the end. Then, when your browser dereferences the link to add the item, it sends along the token; the site checks it and only takes the action if the token is correct. Because the CSRF request the attacker induces doesn't have the token, it will be rejected.

It's worth taking a moment to think about how this defense works: effectively, it's a check on ambient authority. ordinarily, requests are authenticated just by having the cookie but because of CSRF that's not good enough; the token restores the concept of the provenance of the request. In order for it to work properly, the token has to be (1) unknown to the attacker and (2) tied to the user (presumably via the cookie). If it's not tied to the user, the attacker will just go to the site themselves, retrieve the token, and give it to the user's browser on their page.

One very important property of CSRF tokens is that they work with every browser because they don't depend on any new browser feature. Over the years a number of such features have been introduced to make CSRF harder, but any new feature takes time to propagate throughout the entire user population. This is a general problem with Web security. When a new attack like CSRF is discovered, sites need to be able to protect themselves immediately and so defenses which don't require client side changes are strongly preferred and can't be relaxed until effectively the entire user population has upgraded to the new client-side defenses.

There is some good news on this front, however. As I noted above, this is a consequence of the fact that cookies are sent both in the situation where the resource is on the same site and where the resource is on a different site. Arguably this is a misfeature in HTTP, and so one fix is to simply have cookies only apply to same site resources. This is the idea behind SameSite cookies. When you set a cookie, you can add the SameSite label with a cookie to say whether it can or cannot be used for cross-site resources. Recently, browsers have started to default cookies to SameSite=Lax, which is intended to prevent cookies being used in contexts which would enable CSRF. Once those browsers become ubiquitous, sites should finally be able to deprecate CSRF tokens.

The same-origin policy is a fairly blunt—albeit complicated—instrument. There are times when you would like to do cross-origin requests that also carry authentication and actually be able to see the data. In the next post, I'll be talking about a mechanism designed to allow that: Cross-Origin Resource Sharing (CORS).

This paper is actually quite entertaining reading, as it describes many tracking techniques we see in use today, such as bounce tracking. In addition, Section 1 starts with "The web is a never-ending source of security and privacy problems. It is an inherently untrustworthy place, and yet users not only expect to be able to browse it free from harm, they expect it to be fast, good-looking, and interactive — driving content producers to demand feature after feature, and often requiring that new long-term state be stored inside the browser client" ↩︎
The four values are R, G, B, and alpha. ↩︎
Technical note: in order for this to work, you need the two pages to have a handle to each other. This happens if page A was opened by page B with window.open() or if page B is an IFRAME on page A. ↩︎
The HTTP spec spec strongly discourages using GET in contexts that have this kind of user-visible side effect "Request methods are considered "safe" if their defined semantics are essentially read-only; i.e., the client does not request, and does not expect, any state change on the origin server as a result of applying a safe method to a target resource. Likewise, reasonable use of a safe method is not expected to cause any harm, loss of property, or unusual burden on the origin server." ↩︎

Understanding The Web Security Model (Outtake): Cookies and Behavioral Advertising

2022-03-13T00:00:00Z

This post was originally part of Post II of my series on the Web Security Model but kind of broke up the flow of that post, so it got pulled out. But a blog means never having to kill your darlings, so here it is. In Post II I wrote about how Web applications use cookies for statekeeping on a single site, but it turns out to be trivial to extend that functionality to provide targeting for behavioral advertising. There's nothing new technically here, it's just a new combination of several existing elements we've already seen.

Ad Networks #

Most advertising on the Web is done by ad networks. It's of course technically possible to just sell ads on your own site, but for obvious reasons this doesn't really work unless you're a big prestige site like Google, Facebook, or the New York Times. Instead, the typical thing to do is for the publisher to work with some third party ad provider who places ads on a lot of different sites.

The technical details of the system are unbelievably complicated. It's traditional at this point to show the baffling diagram below, called the "LUMAscape", which maps out the various entities in the ad ecosystem. However, at the level we need to be concerned with, matters are fairly simple.

In order to show advertising from a given ad network, the publisher embeds an element on their site with content of the element being loaded off of the ad network's server.^[1] When the user visits the publisher's site the browser automatically loads the content from the ad network, which invisibly decides what ad to show. Recall that there's no rule that the content at a given URL has to remain constant, so the server can dynamically select the specific ad based on any information it has.

There are a variety of options for the element type. The simplest thing to do is just to use an image or an or an IFRAME. A fancier alternative is to first load some JavaScript off the ad network site; that JavaScript can then insert an image or IFRAME into the DOM of the page. Whatever the method, the browser ends up loading some content from the ad network. Note that I'm radically oversimplifying here; describing the ad sales process is out of scope for this post.

Determining Context #

There are a variety potential ways for the ad network to know the context of the page. First, browsers add a header called Referer which indicates the original site (yes, it's spelled "Referer". It's a typo that we're now stuck with). Increasingly, however browsers are sending less useful Referer headers (for privacy reasons). Another major option is to carry this data in the URL. In the simplest version, the publisher can be given a per-publisher URL. If the ad was inserted by ad network JavaScript, then that can insert the page into the URL. In any case, the ad network can generally tell what page the ad was on.

The question then becomes what ad the network should show. You could obviously show the same ad everywhere, but that's not going to do a very good job of showing interesting ads. The next most interesting thing is to show what's called a "contextual" ad, which is to say an ad that is relevant to the content of the page on which it is being shown. For instance, if you were on Runner's World you might get an ad for running shoes.

However, a lot (most?) of Web advertising isn't contextual but rather "behavioral". What this means is that it's not just based on the page the user is currently is on but based on their previous behavior. That behavior is measured using cookies.

Behavioral Tracking with Cookies #

If the advertising network has contracts with multiple publishers this allows them to observe the user's behavior across those publishers. The first time that the user goes to a page served by a given ad network, that ad network sets a cookie. From then on, they get to see every site that the user goes to and can link them all up using the cookie. Based on that information, they can build up a profile of the user's behavior and use that to decide which ads to show (recall that the server can serve any image it wants, regardless of the URL). The diagram below shows an example of this process.

The user first visits sneakers.example, which embeds an image from the advertiser's site. The advertiser only knows that the user is on sneakers.com but nothing about the user so it serves a contextual ad for sneakers. However, when it returns the ad it sends a cookie. Later, the user visits recycling.example, which also embeds an image from the same advertiser. This time, when the user visits the advertiser, it sends the cookie, so the advertiser knows that (1) the user was on sneakers.com before and (2) they are on recycling.example now, so it shows the user an ad suitable for both interests: recycled sneakers.

You can also use this seem basic technique for what's called retargeting. Suppose you go to a site and look at some product. If the ad network has a presence on the site (this can be an invisible element) then they can record this event and use it to target ads specifically at people interested in that product.

The Bigger Picture #

The use of cookies for behavioral advertising is basically an unintended consequence of the design of cookies, specifically, allowing them to be used in what's often called a "third party" context, in which the site you are sending the cookie to is different from the site you are on. One the one hand, this is an example of the power and extensibility of a few basic primitives: you can build a global ad network based on not much more than the ability to load third party content onto a site and attach cookies to those requests. On the other hand, the result is a system built on ubiquitous surveillance.

At the time cookies were first introduced, people did understand that there were privacy implications. However, a lot of the attention focused on first party tracking (i.e., of your behavior on a single site). The original cookie RFC has a fairly extensive discussion of privacy, but the section that most clearly addresses the third party context is kind of confusing and seems almost to be discussing what is now called cookie syncing:

A user agent should make every attempt to prevent the sharing of session information between hosts that are in different domains. Embedded or inlined objects may cause particularly severe privacy problems if they can be used to share cookies between disparate hosts. For example, a malicious server could embed cookie information for host a.com in a URI for a CGI on host b.com. User agent implementors are strongly encouraged to prevent this sort of exchange whenever possible.

My sense is that people were sort of aware of the problem but just didn't anticipate the scale of tracking that would eventually result. It's also worth noting that early browsers would often prompt users before accepting cookies, thus making this kind of tracking more difficult. Eventually, of course, every site wanted to set a zillion cookies and the permission prompts got too annoying so they were removed, only to be replaced years later by the arguably even more annoying GDPR cookie consent dialogs.

This is a theme we'll be seeing throughout this series: a lot of the early Web features were designed to solve specific problems and without much of understanding of the broader implications. It took years for the security and privacy community to catch up and develop a more comprehensive understanding of the security of the Web platform, and, as with advertising, we're still dealing with the implications of those original choices.

Technically, this third party is called a supply-side platform (SSP). There are also demand-side platforms (DSP)s which serve the advertisers, plus a bunch of other stuff. ↩︎

Understanding The Web Security Model, Part II: Web Applications

2022-03-08T00:00:00Z

Note: This is one of those posts that is going to be best read on the Web, especially if you read your email using GMail or the like, as it will tend to mangle some of the HTML features.

This is Part II of my series on the Web security model. In Part I, I talked about the basic structure of the Web and how Web publishing works. However, quite early in the lifetime of the Web people started to want to do more than just publish information. In particular, they wanted to sell stuff. Of course, you could just publish your catalog on the Web and then have people email you their order, but this is obviously pretty clunky; what you want is a Web storefront (yeah, I know this is obvious now, but we're talking 1994!).

It's possible to build even fancier applications like Facebook or Slack with not much more than the primitives I introduced in the previous post; it's mostly a matter of combining them in the right way. That's the topic of this post.

How to build a Web store #

As I said, much of the initial work around Web applications was in building shopping sites. Your basic shopping site was pretty simple, with just a few functions:

Showing the catalog of items.
Adding selected items to the shopping cart.
Checking out, buying the items in the cart.

Let's go through these one at a time.

Catalog #

If you have a relatively small number of items, then you can build a catalog entirely with technologies we saw in the last post. There are two main options here:

If you have a very small number of items you can just make a static Web page that shows them.
If you have a somewhat larger number of items—especially if they go in or out of stock, or you have different prices in different regions—then you can dynamically generate the Web page.

The first option is straightforward. The way that the second option works is that you have some database that is basically a list of every item (the jargon here is stock keeping unit (SKU)), its description, maybe a picture or two, and the price or prices. Then when the user's browser requests a given catalog page, some code on your server goes through the database and renders it into an HTML page and serves it back to the browser.

It's important to realize that these two methods are interchangeable from the perspective of the browser; the server can switch between static and dynamically generated pages at will. It can also cache the dynamically generated pages—that is, temporarily store the output of what was generated—and serve that back to clients, thus saving run time and computing resources.

I know I keep making this point, but it really can't be overemphasized—as long as the data sent to the client is valid HTML, the browser doesn't care how it was generated. The point of having standardized network protocols is so that you can detach the implementation on each side from the messages they send to each other. This creates important implementation flexibility and allows new functionality to be added on either end without consulting the other. Part of what makes the Web so powerful is the combination of these standardized protocols with the ability to move implementation logic onto the client via JavaScript, as we'll see below.

This is great if you are a small site, but if your store is the size of Amazon (or even the LCBO), you obviously need people to be able to search. Fortunately, HTML has a feature that makes this straightforward, the <form> element. At a high level, a form element is a container for one or more input controls (text fields, buttons, pull-down menus, etc.). The form element also has an "action" which causes the client to send the values of these elements to the server.

For instance, here is the form element that represents the subscription box at the bottom of this page:

<form class="email-form" action="https://educatedguesswork-subscribe.herokuapp.com/subscribe" method="post">
  <input class="subscribe-email" type="email" placeholder="Your e-mail address..." id="email" name="email">
  <input class="subscribe-button" type="submit" value="Subscribe"/>
</form>

Ignore the class attributes; they are just labels that are used to attach CSS styles to the form. The key things to look at here are the action tag on the first line. What this says is that when you "submit" the form the browser will navigate to https://educatedguesswork-subscribe.herokuapp.com/subscribe. The first input field type=email creates a text field that you can put your email address into. You submit by clicking on the "Subscribe" button which is generated by the second input field, of type submit.

This produces the following result, which you can actually use to subscribe to my newsletter. Take a minute to do it now.

All done? Great.

When you fill in the form and click submit, the client sends the server an HTTP request that looks like this:

POST /subscribe HTTP/1.1
Host: educatedguesswork-subscribe.herokuapp.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
[other headers deleted]

email=ekr%40rtfm.com

To orient yourself, the first line is called the "request line", the next lines are called "headers", and the stuff after the blank line is called the "body". The Host header and the second field of the first line (/subscribe) together match the URL in the action attribute of the form element defined above. The body of the submission contains the value of the form, in this case the email field and the value of ekr@rtfm.com.^[1]

Even though this comes from a form submission, it's conceptually like a link click, and the result is that the browser is navigating to a new page. Therefore, the server is expected to respond with a new HTML page. As noted above, it can generate this page however it wants, but the idea is that it will do some processing on the form submission input, in this case subscribing you to the list. The response is just an HTML page indicating (hopefully) success.

It should be obvious at this point how to use an HTML form to build a search interface: you use almost exactly the same HTML as above, except with different text labels and probably input type search rather than email. The user would type the product search term in the box and click submit; the server would respond with the products that match the search term. That's all there is to it.

Statelessness #

In the early days of the Web, there was a lot of emphasis on how HTTP was stateless, which is to say that each request by the client was independent of every other client and that the protocol had no way of linking them up. This property extended down to the network layer: each request was carried over a new TCP connection, with the connection being closed after the server sent the response (in fact, closure of the connection was often used to indicate the end of the response).

Statelessness turns out to be a fairly inconvenient property for several reasons. The first is the one we are seeing here, which is that lots of things the server wants to do require creating continuity between client requests and so it was necessary to retrofit a state-keeping mechanism.

The second reason is performance: because of the way that network protocols are designed, there is a significant amount of startup overhead each time a connection is created (see slow start, so having a new connection for each request add significant delays. Much of the history of the development of HTTP is concerned with removing the legacy of this initial decision, first by adding multiple requests on the same connection and then by adding multiplexing of multiple simultaneous requests.

Shopping Carts #

Our next job is to let the user select some products and add them to their shopping cart. Unfortunately, this presents us with a problem, which is remembering which items the user has selected. The problem is that the HTTP requests to the server don't contain any kind of user identifier, so when your browser sends a request asking to add an item to your shopping cart, how does the server know whether to add it to your cart or to my cart?

The solution to this problem that eventually emerged is what's called a "cookie". The idea behind a cookie is simple: the server sends the client a cookie in the header of one HTTP response and the client stores it. The client then sends the cookie to the server in subsequent requests. The cookie is just an opaque string to the client and the server can construct it any way it pleases, but there are two main options:

An opaque identifier for this user or session. This identifier is then used as an index into some database that stores the user's state.
An actual representation of the user's state (e.g., a list of items in its cart).

Because the cookie is opaque, the server is, of course, free to use either of these techniques or a combination of the two.

The diagram below shows an example of how cookies can be used to build a shopping cart:

In this case, the server has chosen to use a back-end database, so the cookie is just an opaque identifier (XYZ). Initially, the client contacts the server and requests the catalog. The client and the server have never talked before so the client doesn't have a cookie. The server creates a new cookie with value XYZ and stores an empty shopping cart [] in the database associated with that cookie. It then returns the catalog to the client along with the cookie.

The user browses through the catalog and selects item 1234. When they click to add it to the shopping cart, the browser sends a request to the server with the item id and the cookie. The server then uses the cookie to retrieve the shopping cart. Seeing it's empty, it adds the item to the cart and stores that in the database. Finally, it returns a confirmation to the user. The user browses the catalog some more and decides to buy item 5678. This transaction proceeds the same way, except that this time the server adds it to the already non-empty shopping cart, ending up with two items.

Checkout #

At this point, we have all the tools we need to do checkout. When the user presses the checkout button, the server uses the cookie to collect all the items in the shopping cart and compute the final price. It then provides a Web form which lets the user enter their name, address, payment information, etc. The user submits that form (with the cookie, of course), and the server processes the transaction. It then can clear the shopping cart (so that the user can start shopping again) and send back the confirmation page.

Client-Side Applications #

In principle you can build just about any application you want with the techniques described above. In practice, though, loading a new page whenever you want to change anything is painfully slow.^[2] It's certainly too slow to give a smooth app-like experience. Moreover, it's ugly because the page flashes as it rerenders and so it's anything but smooth. The resulting system isn't really viable for anything significantly interactive like Google Maps, Slack, etc.

Fortunately, we already have the solution: JavaScript. Recall that in Part I I said that JavaScript could change the DOM and that this would cause the page to change as well. The key thing is that unlike a page reload, small changes to the DOM mostly don't cause the entire page to rerender (only the elements that need to be updated).

Here's a simple example of what I'm talking about. The box below is a list of entries. If you enter a new entry in the box at the bottom and hit return, it will be added to the list without the page reloading.

Shopping List
Apples
Bananas

The way this works is just that I have a tiny piece of JavaScript that watches for you to hit return in the entry box and adds the value of the box into the list:

const tbodyEl = document.getElementById("entries-list");
const textboxEl = document.getElementById("list-addition-entry");
const formEl = document.getElementById("list-addition-form");

formEl.addEventListener("submit", function(event) {
    event.preventDefault();
    const row = tbodyEl.insertRow(-1);
    const cell = row.insertCell(0);
    cell.appendChild(document.createTextNode(textboxEl.value));
    textboxEl.value = "";
});

We don't need to go through this in detail, but at a high level, the first three lines select the relevant elements (the table, the textbox, and the form), and the rest of the code is a JavaScript function that retrieves the value from the textbox and adds it to the list. Attaching it to the "submit" event ensures it will run whenever the form is submitted, which is when you press return. Obviously this is a trivial example, but trivial examples are the stepping stones to real programs. Suppose we wanted to make something like Slack. The most basic version really only needs two small changes:

When you type into the window, it needs to send a message to the other people in the chat.
When someone sends you a message, it needs to receive it and add it to the list of messages.

These are both done with the same basic technique: a Web Service API.

Web Service APIs #

So far, all the examples of requests made to Web servers are for content which will then be consumed by the browser (e.g., HTML, JavaScript, etc.) A Web service API is different: it serves data that is intended to be consumed by JavaScript running in the browser.^[3] For instance, in our chat application, the server would have (minimally) two functions:

Send a message to a channel.
Receive any new messages on a given channel.

Each function requires defining a few things:

The URL (path) for the API function. It's conventional to refer to URL, and by extension the function, as an "API endpoint".
A definition for the data that the client sends to the server (both format and semantics)
A definition for the data that the server sends to the client

For instance, here's the API that Slack uses to post a message.

The Client Side #

On the client side, the JavaScript uses the fetch API or the older XmlHttpRequest (XHR) API to talk to the server. These Web APIs let it make arbitrary (within some limits I'll cover later) HTTP requests to the server, which means that they can use the endpoints provided by the server.^[4] To continue our chat example above, whenever the user types a message into the compose window and hit enter, the JavaScript function that gets activated would use fetch to tell the server that a new message had been added to the chat. This might look something like:

POST /send-message HTTP/1.1
Host: chat-server.example.com

message=Hello World!

Obviously, this could be fancier and include a channel identifier, or, if it were a direct message, the recipient identifier, but you get the idea. Depending on the way the application was written, that same function might add the message to the local window or the server might handle this with the same code it uses for incoming messages (see below).

This brings us to incoming messages. The simplest way for this to work is for the server to have an endpoint that allows the client to ask for new messages. For instance, it might look something like this:

GET /get-message?lastmessage=105 HTTP/1.1
Host: chat-server.example.com

The semantics of this request would be something like "Send me a copy of every message with a sequence number greater than 105". That way, the client can just ask for new messages without the server having to remember which ones the client already knows. And a new client can get all the messages by sending lastmessage=0 (or maybe -1, if you started counting from 0). The server would then respond with a list of new messages, which would be empty if there were no new messages. Once those messages are received, the client side JavaScript can just add them to the message window.

This style of application was originally known as Asynchronous JavaScript and XML (AJAX)). Asynchronous because you could be using the Web application while it talked to the server. JavaScript for obvious reasons. XML because at the time most servers used XML to send messages around (XML is just a structured data format). In recent years, however, fashions have changed and increasingly people structure their data in JavaScript Object Notation (JSON) instead.^[5] "AJAJ" just doesn't have the same ring to it, though. Whatever the name, this is now the dominant style of Web application, for sites as diverse as Google Maps, Facebook, Slack, and Kayak. You still see old-style Web applications, but if you want to do something fancy—which people often do—then it's likely to have some sort of AJAX-y component.

Just to keep emphasizing this point: the only new piece of technology here is the existence of the client-side HTTP APIs. Everything else is just done server-side by adding new server-side endpoints and writing new JavaScript which the server sends to the client.

Notifications #

With that said, there is one kind of inconvenient property of this system: We've just shown how the client can find out what messages are available, but how does the client know when to ask? The obvious approach is to just poll the server constantly, but then you're adding a lot of load to the server as well as a lot of network traffic. You can also poll less frequently, like every 10 seconds or so; but while this might be fine for e-mail, it's really not fast enough for instant messaging, because it means that on average each message will be delayed by 5 seconds.

Paving the Cowpaths #

The story of long polling and WebSockets is a common pattern on the Web. The Web is now powerful enough that you can usually get the job done, though perhaps in a hacky and inefficient way. But people have product requirements so they do it anyway. Once some technique gets common enough, then it becomes attractive to build a better version into the platform ("paving the cowpaths") but application developers don't need to wait for that to happen. Moreover, there is usually a long period where only some browsers support the new technology, so application developers will check to see if it's available on a given browser and if so use it, and otherwise fall back to the old hack.

The fundamental problem is that HTTP requests are initiated by the client and there's no way for the server to talk back without the client saying something first. And then someone clever realized that instead of having the server respond immediately when there were no new messages, it could instead wait to respond until there were new messages. This is called a "long poll" and lets the client gets the information right away, without constantly polling the server.

Long polling works, but it's not ideal. Due to various timeouts at different parts of the system, you can't have an HTTP request outstanding indefinitely, so as a practical matter the request times out after some tens of seconds and then you have to reissue it. Also, it's just kind of a hack. Back in 2011 the IETF standardized a protocol called WebSocket that provided a bidirectional channel over top of HTTP to replace long polling.^[6] This is a new—well not so new now—API, but fundamentally it's an optimization over long polling and if WebSockets isn't available you can always fall back to long polling.

Post-Standardization #

Up to now I've been focusing on how Web applications are built, but now I want to zoom out and talk about the bigger picture.

Traditionally, client-server applications relied on standardized protocols. This means that there is some document which describes what messages the client can send the server and how the server will behave in response and vice versa. For instance, if you are reading mail on your iPhone, you are probably using a standardized protocol (likely IMAP) to talk to the server. This is why the iOS mail client can talk to any mail server; you just need to give it the address of the server and your username and password. All of the protocol machinery is built into the mail client, which knows how to send email, download it, etc. It can show any UI it wants but it needs to comply with the protocols.

The Web is also built on standardized protocols, of course: HTTP and TLS for interacting with the server, HTML and CSS for formatting the page, JavaScript and Web service APIs for application logic. These are all standardized, which is why—at least most of the time—any Web site will work on any browser. But these standards only define the application infrastructure: the actual Web application is a combination of logic on the server (however that's implemented) and logic on the client written in JavaScript. This has huge implications because it means that the application author provides both the client and the server and therefore doesn't need to coordinate with anybody but themselves. That's why the world was able to switch from applications that used XML for data transfer to JSON for data transfer without changing the Web browser at all.

When the first real interactive Web applications using AJAX came out, this was a truly revolutionary property. After years of painstaking coordination defining every detail of application protocol behavior, suddenly it was possible to quickly build a complete client/server application without talking to anyone. It had of course had always been possible to define your own protocol and write a client and server that spoke it, but getting people to download your client was a huge obstacle; by contrast anybody could use your Web app just by navigating to the right place. Moreover, the Web browser included all kinds of powerful facilities—this is even more true now—that you would have had to build (or at least download) yourself.

Of course, now it's 2022, 15 years after the introduction of the iPhone. We have mobile app stores and the problem of software distribution—and in particular updating—has gotten much easier, so on mobile you can invent some proprietary protocol and roll out an app and as long as people download it, you're good to go. If you want to change the protocol, no problem, just update to a new version. The Web is like this, but even moreso because users don't need to install or update software: they just get whatever the new thing is when they load your site. This lets vendors build a completely vertically integrated system that leverages the power of the Web platform but without having to standardize—or, often, even document—anything.

Obviously, this has real benefits in terms of engineering velocity, but it's also contributed to a situation in which the user experience of a site and its functionality are completely entangled, so it's hard to use (say) Facebook without the Facebook UI. If you don't like something about that UI, you're basically out of luck. And even if you did reverse engineer the server-side APIs that Facebook used and write your own client, there's no guarantee Facebook won't change those APIs tomorrow. By contrast, if you want to use a different mail client with mail that is hosted by Gmail, it's just a download away.

This isn't to say that there isn't still plenty of work going into creating standardized technologies for the Web. However, that work is primarily concentrated on creating new plumbing (e.g., TLS 1.3 or QUIC) or new Web platform features (e.g., WebRTC or Web Assembly). This all makes the Web a better platform for running applications, but the applications themselves live on top of that substrate and are largely opaque and non-interoperable.

Next Up: Origins, and the Same Origin Policy #

At this point, we've covered most of what you need to know about how the Web works in order to understand its security model (and I'll be introducing the rest as we go). In the next post, I'll be covering the basic unit of Web security: the origin.

Note that this actually says ekr%40rtfm.com. This is what's called escaping of the @-sign. It's not really necessary here but is done for consistency with cases where the address would appear in the URL, where the @-sign is forbidden. ↩︎
Not quite as slow as you might think because a lot of the images and the like on the page can be cached, but still slow. ↩︎
Obviously standalone apps can and do use these APIs, but the topic of these posts is the Web. ↩︎
Yes, I know that calling both of these APIs is confusing. I resisted calling the HTTP APIs offered by servers "APIs" and then finally gave up. ↩︎
JSON is modestly easier to work with, but like styles of jeans, data formats tend to cycle in and out of fashion. ↩︎
For the nerds here, we also have the Web Push API which consolidates channels to multiple servers. ↩︎

Understanding The Web Security Model, Part I: Web Publishing

2022-03-04T00:00:00Z

Note: This is one of those posts that is going to be best read on the Web, especially if you read your email using GMail or the like, as it will tend to mangle some of the HTML features.

Like many pieces of technology, the Web is one of those things that people are perfectly happy to use but have absolutely no idea how it works.^[1] It's natural to think of the Web as a publishing system, and at some level it is: the Web lets people publish documents for anyone to read. But what the Web really is is a distributed computing platform that lets Web sites run code on your computer.^[2] Originally, of course, that code just rendered documents, but now it's used for everything from documents (like the one you're reading now) to text-based applications like Slack or even videoconferencing apps like Google Meet. Unsurprisingly, then, the Web has a unique security model, which is the topic of this series of (some unknown number of) posts.

I meant to start right in on security but then I realized I first needed to provide enough background of how the Web works to have the security stuff make sense. This post is the first half of that background material, covering the structure of Web sites and pages. There will be a second post that covers Web "applications". This isn't a textbook or a specification, so I don't intend to provide a complete picture; the idea here is to cover the essential elements for understand the security model.

The URL #

Everything on the Web starts with the Uniform Resource Locator (URL), which, as Wikipedia puts it, is commonly called the "web address". Minimally, it's the thing that shows up in the address bar of your browser when you go to a Web page, but actually everything on the Web has a URL, not just web pages. For instance, most Web pages are made up of a mix of text and images and each of those images has their own URL. In fact, you can (usually) independently load each individual subcomponent of the page by right-clicking on it, like so:

What a URL really is is just the address of some thing (the technical term here is resource) on the Web. Given the URL for a thing, your browser can go to the indicated location (i.e., the Web server), load the resource, and do something with it. What that something is depends on the resource type and the context in which it's loaded, as we'll see below. For instance, if the resource is an HTML document or a PNG image, then the browser will try to display it. If it's a zip file, the browser might try to save it to your disk.

A URL (at last for the Web) has three major parts, shown in the diagram below. [Attention nitpickers: I'll get to query and fragment shortly.]

Scheme #

The first part of the URL is what's called the scheme, which indicates the protocol that the client (the browser) should use to access the resource. The Web itself has two important schemes:

http, which means to use the Hypertext Transfer Protocol (HTTP)
https, which means to use HTTP with the Transport Layer Security (TLS) secure transport protocol.

Schemes and Protocols #

In practice, the scheme doesn't refer to a single protocol but actually to a family of protocols which have roughly the same externally visible properties and can be mutually negotiated. For instance, there are three main versions of HTTP (HTTP 1.1, HTTP/2, and HTTP/3), all of which are fairly different on the wire. Similarly, there are several different versions of TLS. Finally, HTTP/3 doesn't run over TLS but actually runs over the QUIC transport protocol which uses the TLS handshake for security. All of these different protocols can be addressed with the same set of URLs, with the browser and the server automatically selecting the right protocol. This is actually an important requirement for seamlessly deploying new protocols: for instance if HTTP/2 had required a new scheme it would have taken much longer for it to be deployed, if ever, because everyone would have had to change their pages.

There are a huge number of registered schemes,^[3] but as a practical matter very few matter for the Web. When the Web was young, there were a number of different information transfer protocols and browsers used to support a number of other transports besides HTTP, such as the File Transfer Protocol (FTP) and the Network News Transfer Protocol (NNTP) and Gopher. However, as the information systems those protocols were associated with were subsumed by the Web, HTTP became the dominant protocol and those protocols were allowed to rot, and now HTTP(S) in its various versions is basically the only game in town for transferring Web pages.

There are, a few other URL schemes that matter on the Web for specialized purposes, such as the mailto scheme for indicating an email address or the turn scheme for indicating relays to be used with the TURN protocol in WebRTC. These serve an important purpose, but aren't really used as part of the main structure of the Web. These schemes will often have a different structure than Web URLs, for instance mailto URLs look like mailto:ekr@example.com, but we don't need to worry about that for now.

Host #

The second piece of an HTTP/HTTPS URL is the host, which is just the name of the server hosting content. As discussed in excruciating detail in my series on DNS, this host name is resolved to an IP address via the DNS and the browser then connects to that IP address. If the browser is dereferencing an HTTPS URL, it will also expect that the server present a certificate which has the hostname in it, thus—at least in theory—demonstrating that the browser is talking to the expected server.

Path #

The final piece of the URLs shown above is the "path" component, which indicates the actual resource on the Web site which you are accessing. The structure of this component is extremely server specific. In theory, the server could just name all of its resources 1, 2, etc. but in practice, the path tends to somewhat mirror the server's directory structure, with the / separator indicating directories on the server, etc., and this is what common servers encourage.

Even for more sites that are more like applications and that don't really have directories of files, it's conventional for paths to have a hierarchical structure that mirrors the underlying information hierarchy. For example GitHub URLs look like:

https://github.com/[username]/[repository-name]/

with the list of issues at

/[username]/[repository-name]/issues/

and individual issues at

/[username]/[repository-name]/issue/[issue-number].

Query and Fragment #

There are two other pieces of the URL that I didn't show above but that are important to be aware of:

"Query arguments" are a list of keyword-value pairs, e.g.,

https://example.com/foo.html?foo=bar

These are automatically appended by the Web browser when the user interacts with specific kinds of elements, such as "web forms". These will make an appearance later.

"Fragments" allow the browser to refer to individual portions of the page. For instance, the URL:

https://educatedguesswork.org/posts/web-security-model-intro1/#query-and-fragment

goes to the section you are reading now. The key thing to know about the fragment is that because it's used for intra-page navigation, it doesn't get sent to the server, but is processed solely by the client. Moreover, if you click on a fragment link on the same page (you can try it with the link above), the browser will just scroll to that point, but doesn't need to connect to the server to reload the page.

The Web Architecture #

The diagram below shows the overall structure of a drastically oversimplified Web application, on both the client and the server.

Even this simplified version is pretty complicated, so I'll walk through it slowly.

As you would expect from the above discussion, the process starts with the URL, whether the user enters it directly, clicks a bookmark, or clicks on a link. The browser then goes to the server and requests that URL. In nearly every case what's going to come back is a HyperText Markup Language (HTML) page.

HTML #

We don't need to go into HTML in too much detail, but at a high level, HTML is structured text. What this means is that HTML is a text file that contains extra information ("markup") that tells the browser how to interpret it. As a simple example, consider the following HTML fragment:

<h4>This is a header</h4>

This is some text with a hyperlink. <a href="https://educatedguesswork.org/">hyperlink</a>.

Just to orient yourself, HTML markup mostly consists of paired "start" and "end" markers ("tags") that indicate that the stuff in between them is associated with the tag. If you have a tag xx then the start tag will be <xx> and the end tag will be </xx> and the stuff in between will be called the "xx element". Tags can also have attributes that get attached to the start, like:

<xx attr1="abc">

which means "tag xx has attribute attr1 with the value abc".

In this example, then, the h4 markers indicate that the text inside them is a header (at header level 4) rather than body. The

<a href="https://educatedguesswork.org">

block indicates that the text inside it is a hyperlink, which just means that it's a section of text that contains the text "hyperlink" and when you click on it it navigates the browser the the page indicated by https://educatedguesswork.org. This will get rendered something like this:

This is a header

This is some text with a hyperlink. hyperlink.

It's important to recognize that this markup is (mostly) semantic. Instead of telling the browser that the margins should be size whatever, you're supposed to just provide the page structure the text of the page and leave the browser to figure out how to render it (though of course you should expect to have reasonable margins, emphasized headers, etc.) HTML does have some basic formatting stuff like bold and italics, but it's quite limited and insufficient for making the document look the way you really want; with just HTML you're mostly at the mercy of the browser's styling decisions, with results that tend to be somewhat less than satisfactory.

HTML has a whole pile of other types of markup for things like lists, tables, buttons, etc. We mostly don't need to worry about these right now. What is important, however, is that HTML can also include tags that pull in other resources from the site. For instance, you can have an <img> tag which loads an image off the site and renders it at that place in the document, as in the following fragment, which pulls in the diagram shown above. The src attribute is the place to load the image from.

<img src="/img/overall-web.svg">

Already this is pretty useful: you can use HTML to publish fairly rich documents. In fact, this was pretty much all that was in the original Web. However, it quickly became clear that people wanted to have more control over sites. In particular, they wanted more control over how things looked and they wanted to be able to add arbitrary dynamic content that ran on the client. In the Web, these needs are addressed by allowing the HTML document to use two other kinds of resources that serve these functions:

Cascading Style Sheets (CSS), which allows you to tell the browser how to render your content.
JavaScript (JS), a general purpose programming language which, among other things, allows you to manipulate the HTML and CSS of the page.

It's possible to embed the CSS and JS in the page directly, but what's more common is actually to have HTML tags which reference CSS and JS files on the server. So, what happens in practice is that the HTML loads and then as the browser parses it, it finds the tags for CSS, JS, as well as images and the like and loads them all from the server to assemble the correct page.

CSS #

As I mentioned above, originally the Web mostly had semantic markup, so you could say "this is a header" and some very limited styling ("use this font) but not "render this column with 20 pixel margin". CSS allows you to apply styles to the content of the page. As noted above CSS can be embedded in the HTML (that's how the newsletter version of this site works) but is commonly loaded off of separate resources, with the HTML just pointing to the CSS. I don't intend to write too much about CSS; while there are security and privacy issues around CSS, most of Web security is concerned with other things.

JavaScript #

HTML and CSS are pretty powerful all on their own if what you want is a static Web site that publishes information. They also have some limited interactive capability: for instance you can have a web form where people can fill in information, click on radio buttons, etc., and even send that data to the server which can then act on it. But at the end of the day they're limited and lots of applications require a general purpose programming language. This is where JavaScript comes into the picture.

JavaScript itself is just a regular programming language at roughly the same level of abstraction as other "scripting" languages like Python or Ruby. You can use JavaScript for anything you would use those language for, though you might not want to. What makes JavaScript special to the Web is two things (1) browsers know how to execute it natively, which means if you send them JavaScript they will run it; if you send them Python, they'll just display it to the user or try to save it on disk^[4] (2) the browser has special JavaScript APIs that let the JavaScript code interact with the user and the Web page.

The DOM #

HTML, CSS, and JavaScript work together to produce the experience you see on the Web via what's called the Document Object Model (DOM). The way this works is that the browser parses the HTML provided by the server into an abstract data structure that reflects the structure of the underlying HTML.^[5] The DOM is then used to generate what you see on the screen. Both CSS and JavaScript work by addressing the DOM. For instance, CSS works by providing style information for certain elements of the DOM (e.g., this paragraph) or certain types of elements ("all headers") (simplifying, remember!).

JavaScript is much more powerful. First, it can manipulate the DOM itself, by adding, removing, or changing elements. When changes are made to the DOM, the browser will rerender the page, which means that JavaScript can change what appears on the screen. This can also have other side effects: for instance if JavaScript adds a new <img> tag, that will cause the image to be loaded off the server and displayed as part of the page. On unobvious consequence of this is ability is that because JavaScript is loaded into the page with HTML <script> tags, this means that one piece of JavaScript can load new pieces of JavaScript by inserting new <script> tags; it can do the same for CSS as well of course. These turn out to be powerful but also dangerous capabilities.

In addition to manipulating the DOM, the browser has lots of other APIs that let it interact with the network or the user. For example:

Perform network requests to the server using fetch()
Read from the camera and microphone using getUserMedia()
Form peer-to-peer connections with other browsers using RTCPeerConnection

One of the major ways in which the Web gets extended is by adding new APIs; obviously JavaScript can do any computation that any other language can do, but if you want to affect the outside world, then you generally need some API to do it.

The Server #

This brings us to the Web server. The most basic Web server just serves static files to the client: the client sends a URL and the server sends back the corresponding file. In the early days of the Web, the structure of the URLs as shown in the path component would mirror the structure of the server's filesystem. For instance, you might have a server which stored files in /home/server/, in which case the URL https://example.com/abc/def.html would correspond to /home/server/abc/def.html. And those files themselves would be Web pages or the other assets on them (like images). But of course, over time, the world has gotten complicated. This is still possible but of course it's also possible for things to be a lot fancier. In particular, instead of just serving static files the server can perform computations and return the results to the client.

The Structure of Web Servers #

As I said, the original Web servers just served whatever was on the file system to the client. But people quickly realized that they wanted to be able to have the server provide dynamic content. The original way to do this was with something called Common Gateway Interface (CGI). The way CGI worked was that you would have a special directory, by convention called /cgi-bin and instead of serving the files in that directory, the web server in would run them and send the output the client. This wasn't that efficient, but it got the job done. You'll still see it in some places on the Web.

More recently, it's become common to invert this structure and have Web servers which handle essentially every request programmatically. For instance, the popular Express framework for Node.js lets you register individual functions to handle portions of the URL namespace. These functions can just generate content directly or can use files as a template to generate the content based on the file and some information the server has. These servers can of course handle static files, but this is done by having a special code module which then reads those static files off the disk and then serves them.

A common pattern is to serve the dynamic files off one server and static files off another server, with each being specialized for its job. This is an especially attractive pattern if the static files are big and can be served off a fast content delivery network (CDN) which is optimized for that purpose. Of course, CDNs have now started to grow some capabilities to handle dynamic content in what's called edge computing.^[6]

Obviously, the server can do any kind of computation it wants to return answers, but there are a few major common types.

Templates #

Suppose you want to send a more-or-less static page but you want to customize it slightly. For instance, you might want to put the user's username in the upper right hand corner or add the number of times someone has viewed this page. You could of course generate the whole page from scratch on your server, but an easier way to do it is with a template. Briefly, a template is a file containing HTML but with markers that allow you to fill in variables. For instance, you might have:

<h1>Page title</h1>

This page has been viewed [[num-views]] times.

The [[num-views]] means "replace this string with the value of the num-views variable.^[7] The idea here is that the server has a template processor which is configured with a set of variables, in this case the number of views. The processor reads the template, finds the template variable markers, and replaces them with the corresponding values. There are a lot of different template languages, some more fancy than others, including handlebars, nunjucks, mustache, etc.

Full Result Generation #

Suppose that instead most of your page is dynamic, like a news site or a search engine result page. In that case, a template doesn't really help you that much. Instead, you probably just want to have your server assemble the whole page, piece by piece (though probably from fragments of HTML stored in the server software). This is basically the dual of templates: templates are HTML (or markdown) with embedded code. Page generation is code with embedded HTML.

It's important to recognize that the precise method that the server uses to generate the page is largely invisible to the client: it could be a static file, a template, fully programmatic, or a mix of the above, with some pieces generated one way and some another. The Web just defines the protocol (i.e., the format of the page) and leaves the implementation to generate that protocol however it wants. This is a very important feature for allowing extensibility in the future.

Non-HTML Data Types #

Most of the text in this section sort of assumes that the server will be returning HTML, but of course HTTP is an extensible protocol and so you can transmit just about any content over HTTP. And because the server can do arbitrary computations, this means that it can return those results of the computation to the client. We'll see how that's useful in the next post.

Cross-Site Content #

If you were paying close attention before, you noticed that when you load an image on a Web site, you provide a URL where the browser can find the image. The same thing is true for other kinds of content, whether it's audio, video, CSS, or JavaScript. That makes sense, after all, because all that stuff was authored separately and you don't want to have all that stuff crammed into one giant file on your server? But who says that stuff has to be on your server? The content is being addressed by a URL and that URL can point anywhere, including some totally different Web server.

Take for instance, this image of the Dogefox logo:

Here's the HTML which loaded that:

<img src="https://i.redd.it/ldcju3p3w3x11.jpg" alt="DogeFox" width=400>

As you can see, the src attribute, indicating where the image comes from doesn't go to this site at all. It's pointing to a resource on Reddit—but I was able to just load it into my site and unless you use the browser developer tools to look deeply, you wouldn't even notice. Importantly, the way that this works is that the browser connects directly to the site indicated in the URL; it doesn't go through the original server at all (thought experiment: what happens if the server decides to change the image?).

You can do this kind of cross-site loading with pretty much anything, including video, JavaScript and CSS. This, for instance, is how you embed YouTube videos in your site (you don't want to absorb the bandwidth costs, right?). The JavaScript thing is actually incredibly common because people often want to make use of JavaScript libraries but save bandwidth by serving them off their own server (because, as above, it gets served directly). Of course, now your Web site is incorporating an arbitrary program from someone else's server, so what could possibly go wrong?

This trick isn't limited to individual files either: you can actually load a whole Web page this way, like so:

<iframe src="https://educatedguesswork.org/posts/" width=800 height=400></iframe>

This fragment pulls the archive page of this site into a frame on the page, with scroll bars and everything:

This kind of mashup of cross-site content is one of the basic functions of the Web and the source of all kinds of powerful functions, good and bad, ranging from reusing open source content, to embedded maps and YouTube videos, to Facebook like buttons and online ads (with their associated tracking). It's an incredibly powerful feature and also one whose full implications weren't really understood at the time it was introduced, using to some exciting moments down the road.

Next Up: Web Applications #

At this point, we have the makings of a very fancy Internet-scale publishing system, complete with cool styling, mashups, and even a local programming language for producing cool effects.^[8] But as as I said at the top, the Web isn't just a publishing system, and some of the most important parts of the Web (Facebook, Gmail, Google Meet, Slack) act much more like applications than they do like online publishing. But even though they have a lot more going on than say, this site, they use basically the same primitives I've introduced here, just in a number of new and interesting ways (and with a number of exciting new security problems!). In the next (hopefully shorter) part of this series, I'll talk about how those work.

Yes, I'm quoting Blackadder ↩︎
The Web actually isn't the first or only such platform; PostScript and PDF documents are actually programs that run on your printer or your computer. This provides a much more flexible system than alternative designs like sending a static image to the printer. ↩︎
The astute reader will note that the registry here talks about URI rather than URL schemes, where the I stands for Identifier. URI is the generic term with URLs being the subset of URIs which have enough information to dereference them as opposed to just uniquely identifying something. ↩︎
It is, of course, possible to run other languages on the Web by first compiling them into JavaScript and then running the JavaScript. For instance, Emscripten is a tool that does this for C/C++ code. This works but is a bit clunky. Eventually, there was so much demand for this kind of thing that people designed a special "low-level" language called WebAssembly that browsers would run alongside JavaScript and that was more appropriate as a compilation target for other languages. ↩︎
Technically, this is a set of nodes arranged in a tree structure. So, for instance, you might have the root of the tree and then paragraphs as children and within each paragraph, hyperlinks, etc. ↩︎
In the context of graphics, this cycle of specialized optimizations followed by the optimized system becoming more generalized and then the generalized system undergoing further specialized optimizations is sometimes called the wheel of reincarnation (this name due to Ivan Sutherland) ↩︎
More commonly the markers are curly braces, but if I use curly braces here, the template processor which renders this site will try to process it, so I'm using square brackets. ↩︎
Basically, Xanadu but built out of duct tape and cardboard. ↩︎

Games, constraints, and the humanly possible

2022-02-26T00:00:00Z

On Friday's Ezra Klein show, Ezra interviews philosopher C. Thi Nguyen on the topic of games. Nguyen provides an interesting definition of a game (btw, thanks to the Times for providing transcripts so I didn't have to type all this in):

What’s interesting about games for him [Bernard Suits —EKR] is that you have this thing— the finish line—but it doesn’t count unless you did it under specified constraints. It doesn’t count unless you follow a particular path, unless you did it for a marathon on your own feet instead of a bicycle or a taxi. And the fact that the activity would lose its value if you didn’t do it in the specified, inefficient, constrained way, that, for Suits, points the way to what games really are.

And the way I think of them sometimes, after Suits, is that games are constraint-constituted activities. Does that make sense? That what it is to run a race is to do it inside a certain set of constraints. Like what it is to climb a rock in rock climbing is to do it with your hands and feet and not a jetpack, or a chain, or a helicopter. So whatever is valuable about games has to be in the fact that they’re constructed struggles.

There's a lot here that's true. To take the example of the marathon, not only is it rarely the case that running is the most efficient way to get from point A to point B. In fact, it's not even the most efficient way allowed in marathons. Many major races have a wheelchair division and the wheelchair athletes are much faster than the runners. For instance, in the 2021 Chicago Marathon, the men's winner came through in 2:06:12 and the men's wheelchair winner came through in 1:29:07.^[1] Moreover, plenty of marathons actually start and end in the same place (and don't even get me started about 100 mile ultras run on a quarter mile track).^[2]

It's interesting that Nguyen uses the example of rock climbing, as mountaineering and rock climbing are both sports that started out much less arbitrary than they are now and gradually became more arbitrary and rule bound. Mountain climbing is perhaps the purest example here: the tallest mountains are essentially inaccessible by any means other than actually climbing them on foot it's just barely possible to fly a helicopter to the top of Everest, but as far as I know it's been done exactly once, so as a practical matter if you want to get to the top you have to walk up.

That doesn't mean that there aren't arbitrary rules, but the interesting thing is how they have grown over time. Initially, it was just a challenge to climb Everest at all and it took about 70 years of more-or-less serious attempts before Tenzing Norgay and Edmund Hillary's first ascent 1953. At the time, this was an incredible achievement and people took any advantage they could get including supplemental oxygen, teams of porters, etc. After a while, though techniques developed and the mountain was better understood and so people started to find ways to make it harder, for instance by climbing without supplemental oxygen (Reinhold Meissner and Peter Habeler in 1978), solo, without oxygen (Meissner again in 1980), alpine style, etc. Another complication is that there are different routes mountains, some harder than others, so it might be a challenge to do a new route even if you've gotten to the top before. At this point, just getting to the summit by any means necessary is difficult but doable by ordinary people even without large amounts of mountaineering experience (see Krakauer's Into Thin Air for more on this).

The story is similar with rock climbing: the first ascents of a number of the big wall climbs like Half Dome or El Capitan were were done "aided" which means that you use your protection (back in those days, this meant bolts and pitons) for support. Here too, initially it was a challenge just to get to the top,^[3] but after a while it became clear that if you were willing to spend enough time and drill enough bolts you could get up just about anything and so people started thinking about free climbing (using ropes for safety but not support) (El Capitan's Salathe Wall by Skinner and Piana in 1988 and The Nose by Lynn Hill in 1993), or free soloing (no rope) (Alex Honnold up Freerider in 2017). Here too, this is a story of technology (primarily sticky rubber shoes and better mechanisms for attaching your protection to rocks) and better technique.

Under the definition being offered by Nguyen—and as I understand it, Suits—when the first people went up Everest it wasn't a game, but as soon as it became relatively achievable by ordinary people and the challenge became to handicap yourself by doing it without oxygen, then it became a game. This might be right, but on the other hand it seems to me to that Tenzing Norgay and Edmund Hillary's first ascent in 1953 and Meissner's 1980 solo ascent without oxygen are a lot more similar than they are different in a way that Nguyen's definition tends to erase. You could of course respond that the original first ascent was a game—after all, isn't Everest arbitrary?—but then I think you've just redefined almost any challenge to be a game.

I think that the common thread between all these challenges is something Nguyen hints at later, which is that games can be designed to be just difficult enough that you can do them, but only barely:

But in games, because the game designer manipulates what you want to do and the abilities and the obstacles, the game designer can create harmonious action. They can create these possibilities where you’re— what you need to do— the obstacles you face and your abilities just match perfectly.

...

And in games, for once in your life, you know exactly what you’re doing and you know exactly that you can do it. And then you have just the right amount of ability to do it.

This feels a lot closer to me as a description of the essence of the kind of challenge that mountain climbing or running a two hour marathon presents, namely that they are at the very limit of human capability. When people first tried to climb Everest or El Capitan (or the moon!), nobody knew if it was possible, so the challenge was just to do it at all. But then once it was achieved, then the limit of capability shifted and people wanted something harder, which could either mean trying something harder like K2 (or Mars!) or adding new constraints to make it harder, like climbing without oxygen.

What I'm saying is that the core experience here is doing something that is just barely possible for you. Of course at some level, "something" is arbitrary and once you've run a marathon "just barely possible" can be "do it slightly faster" but humans like things that feel like natural anchor points even if they are ultimately arbitrary, hence the appeal of the 40 minute 10K or climbing 5.12 for the amateur or the four minute mile or 2 hour marathon for the professional.^[4] I think this is also behind the appeal of climbing without oxygen, in that it feels like a clear dividing line. From this angle, the nice thing about games is that the games designer gets to set the conditions so that they are at the right level, but those arbitrary tuning parameters are buried inside the rules of the game so that finishing the game becomes a concrete anchor that people can focus on.

Of course, this is all easier said than done, especially if you want everyone to do the same task. Human capabilities vary widely and a challenge that is just barely at the limit of someone's capabilities (say running 100 miles) is easy for others. This is something that Gary Cantrell, the creator of the Barkley Marathons talks about, namely that it's easy to make a race that's so hard that nobody can do, but what's hard is making a race that almost nobody can do. But of course that's exactly what makes people want to attempt it.

Conversely, race walking is a sport where you have to propel yourself on your own two legs, but you're not allowed to run. This is arguably harder than running because you're walking above the speed where the most efficient thing to do would be to run (around 5mph). ↩︎
As an aside, I'm happy to do long distance multi-day backpacking trips but I don't like day hiking. If I'm going to end up the same place I started, I'd just as soon run. ↩︎
I should mention at this point that unlike Everest, you can hike to the top of Half Dome and El Capitan, though the Half Dome hike depends on a set of cables put up by the park service. ↩︎
Note how each of these is tied to some set of basically arbitrary units of time, distance, or difficulty. Of course, there are challenges that aren't tied to some arbitrary number, like bench pressing your own weight. ↩︎

Risks (or non-risks) of scanning QR codes

2022-02-20T00:00:00Z

I did not watch the Super Bowl but it seems Coinbase bought a super bowl ad that consisted of a QR code floating around your screen. Honestly, I find it kind of soothing—not that I own any cryptocurrency—but the Internet got upset:

Scanning an unidentified QR code that bounces across your screen during the Super Bowl is like going around at the end of a party finishing all the half empty drinks. You can do it, but you'll regret it. And you'll get a lip fungus. But for your computer. It's a whole thing.

— Evan Greer (@evan_greer) 2022-02-13

I am once again reminding you that scanning random QR codes is upsettingly close to plugging a random flash drive you found into your laptop.

Do not do the thing.

— Techni-Calli (@iwillleavenow) 2022-02-13

5 years from now, news will come out that Coinbase’s QR code was the source of the biggest data breach in US history.

— Aaron Parnass (@AaronParnass) 2022-02-13

See also this longer writeup on the topic by Iam Waqas that predates the Super Bowl, this SecureWorld post, etc.

I wasn't planning on clicking on that QR code, but I'm also rather less worried about it than others. This post explains why, but first we need to have a clear sense of what's going on. As I explained earlier, a QR code is just a way of encoding digital information. The QR reader on your device then decodes the QR code into a string of bytes and tries to figure out what to do with those bytes. Interestingly, there doesn't seem to be any really standardized meta-information telling you what the type of the data is, so typically your device tried to infer it from the first bytes. For instance, if those bytes are http:// or https:// in front of it then it's presumably a Web address (the technical term here is a URL). But it really could be any data and hopefully your device infers what it is correctly.

This situation presents a number of potential security risks (see here for discussion of the privacy risks).

Remote Compromise #

Probably the attack that most people have in mind when they think of the potential dangers of QR codes is that that will result in your computer being compromised. From Iam Waqas in IEEE Computer:

Cybercriminals might embed malicious URLs in publicly present QR codes so that anyone who scans them gets infected by malware. At times merely visiting the website might trigger the downloading of malware silently in the background. Apart from that, they might also send phishing emails containing QR codes that again infect the user’s device with malware when scanned.

This is of course possible, but I don't think a QR code presents a particularly high risk compared to the usual risks you take. At a high level, the QR code could result in your computer being compromised in three basic ways:

The QR code could take you to a Web site that attacks your browser.
The QR code could take you to a Web site that prompts you to install some malicious software.
The QR code reader on your computer/device could itself have a vulnerability that enables compromise.

Let's put (2) aside here for a minute, because while it's a real attack, it really belongs with phishing, which I cover below; this leaves us with attacking the QR code reader and malicious Web sites.

The QR Code Reader #

It's certainly not out of the question that the QR code reader—whether the one built into the device or the one in your browser—could have some kind of vulnerability, as bugs in image processing code are reasonably common. For example, NSO's iMessage exploit took advantage of a vulnerability in the iOS PDF reader (see this excellent writeup by Ian Beer and Samuel Groß of Google Project Zero). With that said, a vulnerability like this in the QR code reader would be pretty serious, given that people scan untrusted QR codes all the time and aren't going to stop.

This isn't to say they don't exist: this article what seem like some legitimate memory vulnerabilities in the Android QR code reader back in 2015.^[1] As far as I can tell, the last serious vulnerability in a QR code reader on a major device operating system was actually in the URL parser in iOS 11. This isn't good but shouldn't lead to device compromise.

Note that these comments mostly apply to the QR code reader that is built into your device or your browser. I generally would not assume that a random QR code reader app is safe to use to read arbitrary QR codes. And of course in at least one case a QR code scanner contained malware itself. However, it would be a big deal if the QR code reader built into your phone OS were insecure.

Compromise of the Browser #

This brings me to the second major avenue for remote compromise: the browser. In this case what's happening is that the QR code contains the address of some Web site and reading the QR code navigates your browser to that site, and presumably that site would then attack your computer. This situation isn't conceptually any different from you just typing in the site address yourself: the end result is you end up at a specific Web site that was indicated by the QR code.

One point that is often made in this situation is that it's hard to know what Web site you will end up at because the QR code is unreadable by humans. This is true, but, I think, largely misplaced, for three reasons:

It's common for QR code readers to show you the URL they are going to, so it's not opaque. Indeed, the iOS exploit I mentioned above was designed to circumvent that feature.
You don't need a QR code to send someone an opaque URL: You can just use a URL shortener like bit.ly.
Going to arbitrary URLs shouldn't be a problem anyway.

This first two of these reasons should be straightforward, but the last needs some unpacking. The point here is that it's the browser's job to protect you even from malicious site (indeed, especially from malicious sites). In fact, in a paper with Lin-Shung Huang, Eric Chen, Adam Barth, Collin Jackson, we described it as the "core security guarantee" of the Web: users can safely visit arbitrary web sites and execute scripts provided by those sites. The browser does this by isolating the content provided by the site so that it (hopefully) can't endanger your computer. Of course, browsers do have vulnerabilities that can result in remote compromise, but these are very serious defects that are worth real money: a remote compromise of a live web browser is worth $100K or more. If you have such a vulnerability, there are probably better things to do with it than hack random Super Bowl watchers, especially given that that's hardly an anonymous or stealthy way to deliver your payload.

Even if we assume that you have a zero-day like this and you're willing to waste it in an on attack on basically random people, there are easier ways to accomplish that. For instance, you could serve up your attack via a Web advertising campaign; this would even let you target your victims to some extent, especially if you're willing to pay. Indeed, it's precisely because it's so easy to get a large number of people to load content from your site^[2] that it's so important that browsers be safe when run against arbitrary sites.

Phishing #

Probably the more serious risk here is phishing. As with any phishing attack, phishing via QR code relies on you thinking that you are going to a site operated by someone legitimate when it's actually operated by the attacker. How serious this attack turns out to be depends on how much you trusted the person you thought you were connecting to in the first place.

In this case, for instance, you're theoretically connecting to Coinbase and the attacker might try to prompt you for your credit card and banking information or, if you're a Coinbase customer, for your Coinbase credentials (use a password manager, people). Obviously, you need to be careful here, but again, the situation isn't any different than if the attacker had provided a short URL; in both cases you enter something opaque and you end up at a site with a domain you may or may not recognize. Or, for that matter, the attacker might send you to a domain that looks plausible but is not run by who you think it is. For example, http://coinba.se^[3] does not take you where you might expect.

One interesting recent example of QR-code based phishing attacks if phishers putting fake QR codes on parking meters. The victim thinks they are paying to park but really they are paying the scammer. This attack seems like it's slightly facilitated by QR codes but mostly it's facilitated by using your phone to pay a parking meter. It's not as if the actual site you go to pay for parking necessarily has a particularly credible looking name anyway, so it's not clear how much better the situation would be if you had to type in a URL rather than a QR code (though obviously it would be less convenient.)

Browsers do try to protect users from this kind of attack using blocklists like Safe Browsing. This actually seems like a case where blocklist techniques are likely to be fairly effective because the time scale of attack is fairly long—the stickers take a long time to deploy and people are fooled over a period of days—which gives the blocklist provider time to detect the attack and mitigate it. By contrast, ordinary phishing attacks (e.g., by email) can use short-lived domains and so be hard to block before they do damage.

Consider the Source #

The final reason I'm not too worried about the Super Bowl ad per se is that it's expensive and easily attributable. A 30 second Super Bowl ad cost as much as $7 million, so you'd have to be a pretty dedicated attacker to use that airtime to deploy your malicious QR code. Moreover, it's hard to buy that kind of thing anonymously, so when people inevitably discover that the QR code is malicious, the attacker is likely to be looking at some pretty serious law enforcement action.

I've seen it suggested that a more interesting threat vector is reposts on YouTube and the like:

"The real risk in this situation is if someone edits the commercial and adds a malicious QR code to it, especially on social media platforms.

People will repost Super Bowl ads for weeks after the game itself, so an attacker could easily change the QR code. The ad could be reposted across social media apps and crypto forums to get people to visit a malicious webpage. That page could be a fake Coinbase login site. If this was a success, the victim could end up having their entire account drained. Attackers could also build that page to deliver a trojanized version of a crypto app.

This does seem like a potential risk, though hopefully most of the major venues for finding the Coinbase ad will actually get the right QR code. Here too, time is on your side and so even if someone does post a fake YouTube video, hopefully YouTube would be able to take it down fairly quickly.

I'm not saying that you should trust that a random QR code that claims to be for your bank actually is legitimate any more than you should trust a random email that claims to be from your bank. However, this just doesn't seem like a particularly efficient mechanism for attack delivery. The parking meter case is interesting precisely because (1) the user may have no real previous association with the service provider and so it's hard for them to know if it's legitimate and (2) the user already has an intent to pay—and is probably in a hurry—so even a very small success rate is likely to be worth the effort of going around sticking stickers on parking meters. The situation for Super Bowl ads seems pretty different.

Summary #

I'm open to being wrong here, but from what I've seen so far, I'm just not that concerned about this particular threat. However, even if you disagree with me, we have to deal with the fact that users probably aren't going to stop scanning QR codes whatever we tell them; it's up to operating system and browser vendors to make that as safe as we can and/or to offer alternatives that are safer and equally convenient.

This writeup also describes some attacks where you insert JS in the QR code and it gets executed by the client. Those attacks seem to rely on the QR code data being treated as a file:// URL and same origin to other file:// URLs, which is something that browsers are moving away from. ↩︎
Note that in the ads case they'll be loading that data in an IFRAME, but this probably won't make a difference to attack effectiveness. ↩︎
There is no HTTPS version. ↩︎

Overview of Interoperable Private Attribution

2022-02-15T00:00:00Z

Note: this post contains a bunch of LaTeX math notation rendered in MathJax, but it doesn't show up right in the newsletter verison. You should mostly be able to follow along anyway except for the "Technical Details" section and the Appendix (which is part of why it's an appendix) so you may want to instead read the version on the site.

Recently, Erik Taubeneck (Meta), Ben Savage (Meta), and Martin Thomson (Mozilla) recently published a new technique for measuring the effectiveness of online ads called Interoperable Private Attribution (IPA). This has received a fair amount of attention—including some not so positive comments on Hacker News. I've written before about how to use a variant of this technology to measure vaccine doses, but I thought it would be useful to walk through how IPA works in its intended setting.

Attribution and Conversion Measurement #

For obvious reasons, advertisers and publishers want to know how effective their ads are. The basic tool for this is what's called "attribution" or "conversion measurement", Suppose I see an ad for a product on a news site and click on it, taking me to the merchant, where I subsequently make a purchase. This is called a conversion, and advertisers want to know which ads convert—and how often—and which ones do not.

At the moment, conversion measurement is mostly done with cookies, as shown in the figure below:

Let's walk through this in pieces. First, the client visits the publisher site. The publisher serves the client a Web page with an IFRAME from the advertiser^[1] (reminder: an IFRAME is HTML element that allows one a Web page to display inside another Web page, even from two different sites). When the advertiser sends the page, it also sends a tracking cookie to the client, in this case 1234.

The user views the ad (an impression) and clicks through, which takes them to the merchant. In this case, they just make an immediate purchase, but they might also shop around on the site or even go away and come back later. Eventually, the user makes a purchase ("converts"). When the merchant sends the confirmation page it includes a tracking pixel (an invisible image) served off of the advertiser's site. When the browser retrieves the pixel, it sends the advertiser's cookie (1234) back to the advertiser. The cookie allows the advertiser to connect the original click and the resulting purchase, thus measuring the conversion.

You'll note that what's technically being measured in this example is the conversion from the impression to the purchase. If you wanted to measure the click instead, there are a number of ways to do this, such as having the ad click redirect through the advertiser or having a Javascript hook that informed the advertiser of the click.

The problem with this technique is that it involves the advertiser tracking you across the Internet: it sees which Web site you are on every time it shows you an ad, and for a big ad network this can be a pretty appreciable fraction of your browsing history. This is a serious privacy problem and browsers are gradually deploying techniques to prevent this kind of tracking, such as Firefox's Enhanced Tracking Protection and Safari's Intelligent Tracking Protection. Those technologies are good for user privacy but interfere with conversion measurement. IPA is a mechanism designed to provide conversion measurement without degrading user privacy.

The Basic Idea #

The main idea behind IPA is to replace cookie-based linkage with linkage based on an anonymous identifier. Let's assume that each client $i$ has a single unique identifier $I_i$ (I'll discuss how this identifier is assigned below). This identifier can't be read directly off the client but instead has to be accessed via an API e.g., getIPAEvent() that produces an encrypted version of the identifier $E(I_i)$. The encryption is randomized so that each time the identifier is encrypted, the ciphertext is different, preventing linkage of the encrypted identifiers. To represent that, we use the notation $E(R_j, I_i)$ where $R_j$ is the randomizing value. Two encrypted values $E(R_j, I_i)$ and $E(R_{j'}, I_{i'})$ will with high probability be different unless both the identifier and the randomizer are the same. However, by use of an appropriate service they can be decrypted and matched up.

If we go back to the conversion scenario described above, but instead use IPA, it would look like this:

Everything is the same up to the point where the ad is displayed, except that along with the ad the advertiser also sends some Javascript code that calls getIPAEvent()^[2]. The browser responds by providing an encrypted version of the identifier, with random value $R_1$: $E(R_1, I_i)$. The advertiser just stores this information on a list of the impressions for this ad (note that as before we are measuring impressions).

When the user actually buys the product, the merchant calls getIPAEvent() and gets a new encrypted version of the identifier, this time with a different randomizer, $R_2$: $E(R_2, I_i)$. The merchant sends the encrypted value it receives to the advertiser. However, even though the identifiers are the same, because the randomizers are different, the encrypted values are different, thus preventing either the advertiser or the merchant from linking them. The only thing that the advertiser knows is that there has been one impression (because it saw it directly) and one purchase (because the merchant told it about it). It's important to note that this is all information that the merchant and the ad server knew already: the only secret information is the identifier and that's encrypted. In order to decrypt it and match up these events, you need to use the IPA decryption and blinding service.

The basic idea behind the service is that the advertiser (or merchant) has a set of encrypted identifiers that it sends to the service and the service returns information about the number of matches. So, for instance, you might send in 20 encrypted identifiers and get back something like:

Type	Count
Unmatched impressions	2
Unmatched purchases	3
Impression/purchase pairs	6
Two impressions/one purchase	1

Note: it's important that the IPA service only operate on batches of reports and produce aggregate reports about the batch; otherwise the advertiser could just send in small numbers of reports at a time. More on this below.

Internally, the service works by having a pair of servers which cooperate to decrypt and blind the input values. The advertiser (or merchant) sends its values to the first server, which decrypts, blinds, and shuffles them, and then passes them on to the second server, which does the same thing, as shown in the diagram below (I've used a different color for each identifier to help make it easier to follow).

In this example, the advertiser has two encrypted impressions and two encrypted purchases (it knows which are which because that information was available when the API was called, so it can just label them). One of the impressions and one of the purchases line up but it doesn't know that. It passes all of its data in a batch to the first server of the IPA service (A) which partially decrypts them, blinds them with its secret, and then passes them to server B. Server B decrypts them the rest of the way and applies its own blinding key. At this point server B has a list of blinded identifiers labeled with whether they were impressions or purchases. Because the blinding keys are constant, each time identifier $I_1$ is blinded, the blinded values are the same, and so it can match up the impression and purchase for $I_1$ (both shown in blue). However, because the values are blinded, it can't match them up to the input reports. Given this information, the server it can then produce a report to the advertiser to the effect that there was one pair, one unmatched impression and one unmatched purchase.

Multi-Device #

One of the main requirements for the design of IPA is that it allow for linking activity across multiple devices. For instance, I might see an ad on my mobile device but make the purchase on my desktop machine. Obviously, advertisers and publishers want to be able to measure the impact of their ads. With the current cookie-based system it's possible under some circumstances to associate those events. For instance, if Facebook is displaying the ad and you're logged into Facebook, then your Facebook account ID can be used to link them up. A number of the proposed private conversion measurement systems (e.g., Apple's Private Click Measurement) do not allow for this use case, which is clearly a big part of Meta's motivation for proposing IPA, as a lot of their usage is on mobile.

IPA handles this case in a straightforward fashion, via the per-client identifier. Earlier I just assumed that each client $i$ had an identifier $I_i$ but didn't say how it was assigned. If instead, we arrange that each user has the same identifier across all of their devices, then IPA just naturally links up impressions on device A and device B without any extra work.

This of course reduces to the problem of how to get a per-user identifier synchronized across devices. One obvious approach would be to have the devices synchronize it, much as browsers can sync history across devices. However, there are a number of cases where this won't work, for instance if you use Chrome on your Android device and Firefox on your desktop,^[3] or if the impression came from something other than a browser like an app or a smart TV (I'm no happier than you are about ads on my smart TV, let alone having their conversion measured).

IPA addresses this issue in a clever but counterintuitive fashion: it allows any domain (e.g., example.com or more likely facebook.com) to set a per-domain identifier (which IPA calls a "match key") that can be used by any domain. The idea here is that when you log into some system (e.g., Facebook), it sets an identifier that is tied to your account and is therefore the same across all your devices. The identifier can be used by any advertiser or merchant (via the getIPAEvent() API), no matter which domain they are on, thus preventing Facebook from being the only people who can do attribution via the Facebook account.

Key to making this work is that the identifier is write-only: nobody—including the original domain—can access it, except by using the API, which of course only produces an unlinkable, encrypted value. This prevents the identifier from being used directly for tracking, as would otherwise be the case for a world-readable value. In fact, you can't even ask whether the identifier was set, because then it would leak one bit. Of course, the original domain knows the identifier for a given user (because it generated it) and it can set a cookie on the client to remember if it set the identifier, but if the cookie is deleted, then it doesn't know either.

IPA Technical Details #

This section provides technical details on how the IPA service works. I've attempted to make them mostly accessible and can be understood based on high school math^[4] , but they can also be skipped if necessary. If you don't care about the details—or you already waded through this in my post on linking up vaccine doses—you can skip this section and still be fine.

Note: in ordinary integer math, given $g^a$ and $g$ it's easy to compute $a$ but we're going to be doing this in an elliptic curve where that computation is hard. Everything else is pretty much the same, but just remember that part.^[5]

The service is implemented by having a pair of servers, $A$ and $B$. Each has a Diffie-Hellman key pair, which is to say a secret value $x$ and a public value computed as $g^x$. We'll call $A$'s key pair $(a, g^a)$ and $B$'s pair $(b, g^b)$. Each server also has a secret blinding key $K_a$ and $K_b$. These servers are operated by different entities who are trusted not to collude. However, if either service behaves correctly then you're OK. The service then publishes a combined public key $g^{a+b}$ which can be computed by multiplying the public keys: $g^a * g^b$ (if you remember your high school math!).

In order to submit an ID $I$, the sender first encrypts it. It generates a random secret $x$ and computes: $g^{x(a+b)} = {(g^{a+b})}^x$. Note that we're using the service combined public key and the sender's private value $x$, so the result is a secret from attackers who don't know either $x$ or $a+b$. It then multiplies $I$ by this value and sends the pair of values (this is just classic ElGamal Encryption, but to the key $g^{a+b}$):

$$g^x, I * g^{x(a+b)}$$

Importantly, this second term can be broken up into a part involving only $a$ and a part involving only $b$. I.e.,

$$I * g^{x(a+b)} = I * g^{xa} * g^{xb}$$

Again, this is just high school math. These values then get sent to $A$ (or $B$, it doesn't matter), who computes $g^{xa} = {(g^{x})}^a$ (recall it knows $a$). It then divides the second part by $g^{xa}$:

$$I *g^{xb} = \frac{I * \cancel{g^{xa}} * g^{xb}}{\cancel{g^{xa}}}$$

This cancels out the $g^{xa}$ term, leaving you with just a term that involves $b$, and thus the pair:

$$g^x, I * g^{xb}$$

$A$ then blinds this value, by exponentiating both values to $K_a$, giving:

$$(g^x)^{K_a}, (I * g^{xb})^{K_a}$$

We can flatten this out to give:

$$g^{x * K_a}, I^{K_a} * g^{(xb)(K_a)}$$

$A$ batches these values up with other inputs it has received, shuffles them, and sends them to $B$. $B$ takes the first term and computes $(g^{x*Ka})^b = g^{x * K_a * b} = g^{(xb)(K_a)}$. It then divides the second term by this value, to get:

$$I^{K_a} = \frac{I^{K_a} * \cancel{g^{(xb)(K_a)}}}{\cancel{g^{(xb)(K_a)}}}$$

Finally, $B$ blinds the value by taking it to the power $K_b$, this giving us:

$$I^{(K_a)(K_b)} = (I^{K_a})^{K_b}$$

That was a lot of math, but the bottom line is that the actual identifier $I$ (e.g., the ~~SSN -- Updated 2022-02-16~~ account id) has been converted into a new blinded value, with (hopefully) the following properties:

Neither $A$ or $B$ ever saw $I$
$A$ sees the input encrypted version but doesn't learn the blinded version.
$B$ sees the blinded version but doesn't learn the encrypted version.
You need to know $K_a$ and $K_b$ to compute the blinded version of $I$.

Disclaimer: The IPA documents were just published recently, so I don't think they have seen enough analysis to prove they are secure. Here I'm just describing how it's supposed to work.

Privacy Properties #

The basic two privacy properties we are trying to achieve here are:

Neither the advertiser nor the merchant is able to associate a specific input report to a specific output report, even with the help of one of the servers (because you need both $K_a$ and $K_b$). This is true even if they also know the identifiers, which are not even required to be high entropy (e.g., they can be e-mail addresses).
Neither the advertiser nor the merchant is able to determine which users are represented in a given set of reports or are associated with a given piece of additional data (see below).

As far as I know, no attacks on property (1) are known (though see the above caveat about insufficient analysis) but we do know of an attack on property (2) (see the appendix). The basic situation is that the advertiser can collude with whoever issued the match keys and with one of the servers to determine if a given user is incorporated in a set of reports. However, if both servers are honest, this attack will not work. This is not the desired privacy target, which is that you only have to trust that at least one server is honest, but it's where things currently stand.

In any case, the second server learns more than the first server because it knows which reports match up with which other reports. However, it still doesn't know which ones match up to which input reports because it doesn't know $K_a$. This is still a somewhat weird asymmetry, and when we look at additional data in the next section, we'll remove it.

Importantly, the summaries that are provided to the advertiser can still leak data. For instance, suppose that the advertiser wants to know if impression A and purchase B are from the same user: it can send them in together with a bunch of fake reports which have random non-matching identifiers. If the report that comes back lists any matches, then it know that A and B match. This is a generalized problem in any aggregate reporting system which I covered in some detail previously and there are a variety of potential defenses, including trying to ensure that data comes from "valid" clients and adding noise to the output. The IPA proposal contemplates some kind of noise injection along with budgeting for the number of queries but doesn't really include a complete design.

Although this system provides a fair degree of privacy if you trust the servers, there will of course be people who don't trust them, or just don't want to send their data on principle. One question I've seen asked is whether it will be possible to configure your software not to participate. However, from a privacy perspective, it's actually undesirable to have the API call just fail because then you have sent some information to the server that might be used to track you (as most people will not disable the API). A better approach technically is just to send an unusable report, e.g., the encryption of a randomly selected ID. This should not be possible to distinguish from a valid report without the cooperation of both servers and knowing what valid identifiers look like. Obviously, whether there is such a configuration knob depends on the software you are using.

Additional Data #

So far the system we have described just lets us count matches, but what if we want to record more than matches, for instance by measuring the total amount of money spent by customers via a given ad campaign? This turns out to be a somewhat tricky problem to solve because we need to make sure that that information doesn't turn into a mechanism for tracking reports through the system.

For instance, in the diagram above, I had the advertiser label each report as either an impression or a purchase; this is mostly fine as long as we only have those two labels because if there are a reasonable number of each you don't know much about whether a given output and a given input match up. However, if we let the advertiser attach arbitrary labels, this would obviously be a problem because then they could collude with one of the servers to track a given input through the process (this is of course the same reason you have to shuffle). Naively, suppose that the merchant adds the customer's email address to the report, then obviously if that pops out the other end then you have a real problem.

IPA doesn't contain a complete proposal for this, but does have some handwaving. The general idea is that the client, not the advertiser or merchant would attach "additional data" (the cute name for this is a "sidecar") to their report. This data would be supplied by the server which would say something like "make a report that says that this purchase was for 100 dollars". This additional data would also be multiply encrypted so that neither server could individually decrypt it, but that once it had been shuffled, the second server would get it along with the blinded identifier. Note that this additional data would not be blinded because otherwise you wouldn't be able to add up the results; it just appears unmodified in the output.

But wait, you say, if we just let the advertiser provide arbitrary data, then it can provide a user identifier of its own which will then show up in the output and we're back where we started. The proposed fix is that instead of just reporting the value directly, the client instead reports it via some secret-sharing mechanism like Prio. Of course, this means that the client actually has to submit two reports, one that is processed by server A then server B and one that is processed by server B then server A, as shown below:

As shown here, the client generates two reports, each of which contains a Prio share for the value provided by the advertiser. When the advertiser is ready, it sends one report share to Server A and one report share to Server B. In this case, I've shown reports from two clients, each with one share. As described above, each server partly decrypts its reports, shuffles, and then passes it to the other server. The other server completes the decryption, correlates the matching reports, and aggregates (e.g., adds up) the additional data.^[6] Finally, Server A sends its aggregated additional data to Server B which combines it with its aggregated additional data and sends the result back to the advertiser (see my post on Prio for more details on how this part of the process works).

So far so good, except that I haven't specified how the additional data is encrypted. This part turns out to be somewhat tricky and the IPA authors don't have a published design for it at the moment, so this is piece is still a hard hat area.

Status of IPA #

So what's the status of IPA? This has been the source of some confusion, perhaps in part because Google has implemented some of their "Privacy Sandbox" proposals in Chrome and has already done or proposed to do "origin trials" (a kind of limited access test) for them. At present, however, IPA is just a proposal. It has been submitted to the W3C Private Advertising Technology Community Group for consideration but has yet to be adopted, let alone shipped by anyone. In other words, it's a potentially interesting idea but not something that is finished or ready to standardize.

Appendix: Linear Relation Attacks #

The IPA authors describe a few known attacks on the system (though more analysis is needed). The most interesting one is what they term "linear relation" attacks. The basic idea behind this kind of attack is to use the blinding process as an oracle to determine whether a given user was in the report set.

Recall that the result of the blinding process for identity $I_i$ is $I_i^{K_a K_b}$. So if you have two identities $I_1$ and $I_2$ their blinded versions are of course: $I_1^{K_a K_b}$ and $I_1^{K_a K_b}$,

These have the interesting property that:

$$(I_1^{K_a K_b})(I_2^{K_a K_b}) = (I_1 I_2)^{K_a K_b}$$

Updated 2022-02-16: oops, fixed a subscript

If the advertiser knows a user's identifier and it has the cooperation of one of the servers, it can use this fact to determine whether a given user was in a set of reports. If the target user has identifier $I_t$ it creates two fake reports $I_x$ and $I_y$ such that: $I_y = I_tI_x$. When these are blinded, the result is:

$I_x^{K_a K_b}$
$I_y^{K_a K_b} = (I_x I_t)^{K_a K_b} = (I_x^{K_a K_b})(I_t^{K_a K_b})$

And if a report from the target was included, then the reports will also included the blinded version of $I_t$, which is $I_t^{K_a K_b}$.

The colluding server then looks to see whether there are a triplet of blinded values $(B_1, B_2, B_3)$ such that $B_1 = B_2 * B_3$. If there are, then they know that $B_1$ corresponds to $I_y$ and that one of $B_2$ or $B_3$ corresponds to $I_t$.^[7] As I said above, this is a known attack and the authors are working on ideas to address it. Note also that this attack depends on knowing users identifiers, so it can't be done by any site, but just by (or with the help of) the one issuing the identifiers.

Usually this is from an ad network of some kind, but I'm simplifying. ↩︎
The actual proposal proposal uses different names for the impression and the purchase, but that's not necessary for this simple example. ↩︎
Yes, it's bad that sync between browsers of different manufacturers doesn't work, but that's a whole different story. ↩︎
In particular, the facts that $(g^a)(g^b) = g^{a+b}$ and $(g^a)^b = g^{ab}$. ↩︎
Yes, I know I'm using exponential notation. It's easier to follow for people not used to EC notation. ↩︎
I've omitted the discussion of the Prio proofs for simplicity. ↩︎
Note that another way to execute this is to just create a new identity that is the product of two existing identities; this lets you learn if both are in a set of reports. ↩︎

Ensuring Privacy For Age Verification

2022-02-11T00:00:00Z

The BBC reports that the UK has revived it's online safety bill, which was shelved back in 2019. There has been a lot of concern about the policies embodied in this bill from organizations ranging from ISOC to Big Brother Watch but I want to focus on what's essentially a technical point, which is that it represents a threat to user privacy that we don't really know how to fix.

The bill appears to require require adult (i.e., pornography) sites to verify the age of their users. This has been widely interpreted as effectively requiring the use of some kind of age verification system. Regardless of the wisdom of age verification requirements in general (see, for instance, this BBC article), it's going to be difficult to build a system which doesn't run the risk of creating a database of everyone who goes to a porn site. Given that what kind of porn people watch or whether they watch porn at all is generally considered private information this seems fairly undesirable.

Age Verification Providers #

The basic problem here is that determining whether someone is over 18 requires learning a fair bit of information about them, generally enough to determine their identity. The UK Age Verification Providers Association lists a variety of different methods for determining age, such as government identity documents, mobile phone record, credit reference agency, credit cards, etc., most of which are directly tied to your real-world identity.^[1]

There are two major ways in which these age verification systems can work, neither of which is great:

The site itself is verifying your age, e.g., by collecting the above information and using some third party service.
The site somehow bounces/redirects/embeds some third party age verification site.

In both cases, the age verification service learns your identity and the site that you are going to (because the site has an account with the service). In the first case, the site probably also learns your identity and so can associate it with the exact pages you view rather than just the site you visit.

The general assumption by the UK government seems to be that this privacy issue will be dealt with by policy controls, i.e., by restricting use and mandating security measures. In April 2019, the British Board of Film Classification designed an Age-verification Certificate Standard for age verification providers (AVPs) which prescribes a bunch of data retention policies as well as a set of procedures for attempting to ensure that the provider's network is secure (penetration testing, cryptographic key lifetimes, monitoring requirements, etc.). This Twitter thread by well-known security guy Alec Muffett does a good job of analyzing this standard and comes to some pretty negative conclusions. I have a bigger concern, though, which is the disclosure of your identity in the first place: even if you trust that the AVP will follow its own policies, they could still be hacked (see, for instance this 2007 Equifax Breach), or their records could be subpoenaed. The bottom line is that you're placing a lot of trust in someone you have no real relationship with.^[2] A better system would be one in which nobody ever got both your identity and the fact that you were on a given site.

Anonymous Age Verification #

The good news is that we now have technical mechanisms that enable this kind of anonymous verification of people's ages. The cryptographic details are complicated (see here for a description of one such system), but the basic idea looks like this:

You go to the age verification provider and prove your age (most likely by proving your identity).
The AVP issues you an unlinkable, anonymous credential.
When you go to the porn site you provide the credential as proof of age.

This way the site knows you are of the appropriate age but doesn't learn who you are. And because the credential is unlinkable^[3] the porn site and the AVP can't collude to discover which users are which. This is all reasonably well understood technology cryptographic technology (see, for instance, Privacy Pass) and while it might be a bit challenging to integrate it with the Web, it's far from impossible. Unfortunately, I'm not sure how much this helps.

The problem is that even if the credential which the AVP provides to the user is anonymous, the AVP still sees the user's identity at the time they prove their age to the AVP. If the main reason that people need to do age verification is to watch porn then this is a pretty strong signal of the user's behaviors, and so they still need to trust the AVP's discretion. Ironically, this is a case where privacy would be better if people had to routinely demonstrate their age. For instance, if you needed to demonstrate you were over 18 ever time you bought something on Amazon or read the New York Times—or even used Facebook—then it wouldn't tell the AVP much when you signed up with it. However, if it's mostly just to access porn sites, then users don't really get to hide behind the less embarrassing use cases.

Summary #

Regardless of the wisdom from a policy perspective of this kind of age verification, it seems like a real privacy threat. I'm well aware that the privacy situation on the Web is extremely bad, but that's something that browser makers are hard at work preventing, with technologies ranging from cookie restrictions to IP address-hiding proxies, and so we're gradually moving towards a world where you don't have to trust either Web sites or the trackers embedded on them. However, requiring this kind of age verification would effectively require people to trust that the AVPs protect their privacy. This is exactly the kind of trust we usually try to avoid via technical controls, but in this case those don't seem like they will be effective, leaving users with nothing but trust.

There are some AVPs which offer face-based age estimation. While this technically doesn't involve learning your identity, I'm not sure people should be that much happier about having the AVP have their photo, and of course given the capabilities of facial recognition, it will often be possible to determine your identity anyway. In any case, the most common mechanism for providers to offer seems to be based on government documents. ↩︎
This is of course true to some extent with the porn site itself, but they don't necessarily have your name and IP addresses aren't necessarily sufficient to identify you. Plus, you could use a VPN. ↩︎
What unlinkable means in this context is that the credential that the AVP sees is different from and can't be connected to the one that is presented to the porn site. ↩︎

DNS Security, Part VII: Blockchain-based Name Systems and Transparency

2022-02-07T00:00:00Z

DNS security, I just can't quit you (see parts I, II, III, IV, V, VI). In Part VI I talked about blockchain-based name systems, but I forgot to mention one aspect: defense against surreptitious changes. For instance, suppose the attacker doesn't want to take over example.com but just wants to intercept TLS connections to it; for obvious reasons, they don't want it to be common knowledge that that's happening. One could argue that blockchain-based systems makes that kind of thing harder than with conventional systems (DNS + PKI), but I don't think that's really true, for reasons laid out in this post.

The naive version of a blockchain-based DNS system mechanically and inflexibly enforces some specific policy (typically first-come-first-served). This doesn't do a good job of accommodating a number of real-world use cases such as (1) people losing their cryptographic keys or (2) people registering domain names corresponding to someone else's trademark. In the DNS, these are relatively easily handled: if you lose your DNSSEC key, you can just update it as long as you can authenticate to your registrar; if you lose your password, you can probably recover it; if someone registrars your trademark, there's the UDRP. In blockchain-based systems, however, these mechanisms are not available, because everything ties mechanically back to your private key.^[1]

It's of course possible to build a flexible system which incorporates some element of discretion in these situations. The Ethereum Name Service (ENS) sort of contemplates this, though they also don't seem to have defined any real policies for how to handle these cases beyond trusting the system operators. It's not clear how this is better than the existing system of DNS governance: I know ICANN isn't particularly popular, but they do have fairly clear policies for how to handle exceptional cases (not that these cases are actually that exceptional).

The problem is that as soon as you allow this kind of discretion^[2] into the system, it undercuts the basic value proposition of having the names on the ledger: if that discretion can be exercised for legitimate reasons it can also be exercised for illegitimate reasons (e.g., to steal your domain name). The question then becomes whether it's possible to detect and contain that kind of misuse.

How to Transfer Domains #

Before we ask about how to handle these exceptional cases, we first need to look at how you handle the normal case of name transfer. As I mentioned earlier, registration is done just by storing a name/public key pair on the ledger, with the rule being that the first registrant wins. Suppose Alice has registered example.com and wants to transfer it to Bob, what now?

The obvious way to handle this is for Alice to use her key to digitally sign a record transferring the domain and insert it into the ledger. This can just be the same record that Bob would have used to register the domain if he had been first, but signed by Alice. In this case, then, what it means to own the domain is to have an unbroken chain of signatures starting from the original registrant.^[3] Note that you need to bake this rule about transfers into the system early on; otherwise, there is a risk that some relying parties (i.e., clients) won't have been updated and so won't accept the transfer, which is an obvious interoperability problem.

Involuntary Transfers #

From a technical perspective, involuntary transfers are just a natural extension of voluntary transfers. The way this works is that you have some set of keys which can authorize transfers for domains they don't actually own (once again, this has to be baked into the system from quite early on, at least at some level). So, if Bob holds the trademark on "Example", and Alice registers example.com then there might be some (unspecified) procedure that Bob goes through to demonstrate that he really should own example.com and if he prevails, then whoever holds those keys would create a new record on the ledger reassigning example.com to Bob's public key I'm being vague about the details here because AFAICT none of the existing systems seem to have developed any specific procedures along these lines, so we're just talking in the abstract. Note that you can use a similar technique to handle lost keys; these aren't technically involuntary but from a technical perspective, it's basically the same thing because your key is your identity and the original key isn't being used to make the transfer.

Obviously, you can make the precise technical conditions under which a transfer is valid as complicated as you want. For instance, you can require multiple keys to sign (or use a threshold signature scheme), require multiple signatures on different days, whatever. You can even require the record to contain some description of what happens. But at the end of the day the story is the same: there's some process that takes place outside of the ledger machinery that leads some group of people to conclude that a transfer is warranted and then they effectuate the transfer on the ledger.

The key point, however, is that the transfer itself has to be recorded on the ledger in order to take effect. This makes it difficult to surreptitiously transfer a domain name, because everything that happens is public.^[4]

DNS and the WebPKI #

Let's compare this to the situation with DNS. As we saw earlier, because it's a hierarchical system, nothing stops .com from lying about who owns example.com. It can even serve correct records to some people and bogus records^[5] to others (a "split view"). The same thing is true for the WebPKI: a CA can issue a certificate for example.com to the attacker who can use it to impersonate the real owner of example.com, and it's mostly invisible to relying parties. On first glance, this looks like a real advantage for these ledger-based systems, where this misbehavior is inherently visible to relying parties and to everyone else (whether they know enough to act on it is another question). However, I don't think that's really true, because it's possible to add transparency onto these systems.

Let's start with the WebPKI piece. It's certainly true that surreptitious misissuance is possible and the purpose of Certificate Transparency (CT) is to detect just this kind of misissuance. Briefly, CT is a system of append-only ledgers designed to ensure that every valid WebPKI cert is visible on the ledger. This makes it possible to check the ledger for suspicious certificate issuance. The technical details here are a little complicated, in part because CT was created after the WebPKI was already in wide use, but as a general matter the visibility guarantees are pretty similar to those that a ledger base name system provides.^[6] Note that one nice feature of this kind of system—unlike a ledger-based system—is that you can roll it out gradually because processing the transparency data is not required to accept the certificate.

This brings us to the question of the DNS itself. Here too, it's possible to think of adding some after the fact transparency mechanism to prevent parents generating bogus data. At one point there was some interest in "CT for DNSSEC", but apparently not enough to get it off the ground. I wasn't deeply involved in that discussion, but IIRC there were concerns about log scaling and in particular about DoS attacks/spamming the logs. These are real issues but they primarily arise because of the notion that the DNS has to be free(-ish). In the existing ledger systems you just deal with this by charging people (in some cases quite a bit) to store transactions on the log). If you were willing to do that, the problem seems like it could be simplified considerably.

Detecting and Handling Misbehavior #

You may have noticed that I've sort of skipped a step here: all of these mechanisms just record every action, but that doesn't tell you what to do about it, or necessarily even how to detect it. The basic idea here is that one can scan the ledger/CT log and look for transactions which look fishy. There are a number of ways this can happen:

People can scan looking for their own names.
People can register for some service that scans looking for names for all of their clients.
You can just generally scan for suspicious-looking stuff (e.g., why did Google's name just get reassigned?)

This is probably somewhat easier for the blockchain-based systems because the exceptional cases are going to be rare and are clearly marked, so you can just ignore all the others,^[7] but it's certainly possible with a system like CT (CT calls these services "monitors"), and there have already been a number of cases where CT has detected various kinds of misbehavior, including certificates which should never have been issued..

Summary #

None of this is to say that it's not useful to have some transparency mechanism to detect misbehavior, and I agree that it's a nice property of ledger based systems that that's built into the system. My point here, however, is it's not really much an inherent advantage over our current systems because we can add transparency mechanisms to them. We already have such a mechanism built on top of the WebPKI in the form of Certificate Transparency and if we really wanted one for DNSSEC, we could almost certainly find a way to build one. More importantly, we can get these benefits incrementally: preserving the validity of all of our current names while adding transparency on top, which seems a lot easier than starting from scratch with an incompatible system.

This is actually a general problem with systems that are rooted in cryptographic keys, whether they are on the blockchain or otherwise (e.g., end-to-end encryption). It's quite common for people to lose their keys, and building a system that allows recovery from this that doesn't involve trusting someone else not to attack you is a really hard problem. ↩︎
Just to anticipate an objection, you obviously can encode some kind of complicated recovery logic into the system that might handle some of these cases via a smart contract but I'm skeptical that you can handle every case this way; the world is just too complicated. ↩︎
What happens if there are two signatures from the same registrant? This is obviously impermissible because once Alice has transferred the domain to Bob she can't also transfer it to Charlie. This is called "double spending", and is one of the primary reasons that cryptocurrency systems use ledgers. For our purposes, we can just ignore the second transfer. ↩︎
I had originally thought that it would also break the original owner's use of the domain, but upon reflection, I'm less sure. Suppose that Alice owns example.com and is DNSSEC signing her domains. If the domain is transferred to Bob, he can serve up a record that includes both Alice's keys and his own, which means that the records that Alice signs will be valid but that Bob can also sign his own records. ↩︎
What I mean by "bogus" in this case is that they haven't effected a transfer; if you checked whois it would still show the correct owner. ↩︎
The two major differences are that the ledger in CT isn't decentralized and that RPs have limited ability to verify ledger consistency (see here for more writeup on this). Not to say that I don't think these are issues, but I also think it's clearly possible to build a CT-style system that was better in these respects. ↩︎
Though of course there are also cases where someone's key is compromised/stolen which just look like normal transfers. A practical system also needs a way to deal with these. ↩︎

DNS Security, Part VI: Blockchain-based Name Systems

2022-02-04T00:00:00Z

This is Part VI of my series on DNS Security (parts I, II, III), IV, V). I thought I was done after talking about recursive to authoritative, but I then realized I wanted to cover blockchain-based name systems; these aren't strictly part of the DNS, but they're intended to fulfill a similar function, so it's worth covering them a bit.

DNS is a distributed system: name data is spread across multiple servers and resolving a given name requires asking those servers.^[1] Specifically, it is a hierarchical, federated system. In this case, federated means that different domains are controlled by different people and hierarchical means that domain example.com is subordinate to (and hence controlled by) .com, which is in turn subordinate to the root. This is easy to see if you work through the resolution process described in post I: if the root decides to lie to you about who owns a given domain, then you just get the wrong answer. This notion of trust is baked into DNSSEC, where each zone is signed by its parent: here too, any compromise of the root or of a parent domain leads to compromise of the child.

Government Takeover #

This structure has lead to a fair amount of complaining about the trustworthiness of the DNS. The conspiracy theory version of this is that the root is operated by the Internet Assigned Numbers Authority (IANA), which is part of the Internet Corporation for Assigned Names and Numbers (ICANN), which is a US corporation, and so the US government will take over the root and require it to misbehave (e.g., taking over people's names, signing false records, etc.). For instance, suppose that the US government decided that the Iranian TLD (.ir) shouldn't work any more. To my knowledge that has never happened—and for reasons covered below, I think it's kind of unlikely—though it's of course possible in principle.

What has happened, however, is that various governments have simply seized people's domain names. This isn't done by leaning on ICANN, however, but rather by serving the registrar or the registry with legal process. US Immigration and Customs Enforcement (ICE) does this, as does the FBI, with the typical thing to do to just be replace the web site with something like this:

Note that once you've taken over the site, you own the name and can put whatever on it. The typical practice seems to be to put the kind of warning label I show above, which is pretty obvious, but you could just as well build a replica of the site and continue to silently operate it —you can even get a valid TLS certificate—though this doesn't seem to be common.

A related concern is that many of the popular TLDs are actually owned by foreign countries who might not have the most friendly relationship with the jurisdiction that registrants are in. For example, .ly (as in the URL shortener https://bitly.com) is actually the Libyan TLD. If you have one of these domain names, you're obviously somewhat exposed to action by the parent jurisdiction.

Of course, it's somewhat of a semantic question whether this is actually an attack. Obviously, if you're the owner of airbags.com you might be unhappy about the government seizing your domain name, but it's not clear how different it is from just seizing your servers or your car; the government has plenty of processes for taking your stuff. The situation is somewhat different here in that so much of the infrastructure is in the US, and so people who don't live in the US are suddenly exposed to actions by the US government, but the situation isn't too dissimilar to what happens if you live outside the US but decide to store your money in a US bank and of course there certainly are plenty of TLDs that are operated by non-US entities.^[2].

As I said, despite fears to the contrary, I'm not aware of any case when the US has used its control of the root to take over a name. It's not even really clear how this would work because in order to take over example.com they would first need to take over all of .com and serve all the other records besides example.com normally. This seems like a lot of work and it's not really something you could do surreptitiously, as lots of people would notice that .com suddenly had a new DNS key and was being served from a new set of servers; it's much easier to just require the registry to change their records.

Again, I want to emphasize here that most of this isn't about attacking the technical infrastructure of the DNS. Rather, it's changing actual ownership relationships in the name hierarchy, as when the government seizes your car; the DNS just reflects those ownership relationships. In other words, this is the system faithfully publishing the official data as it is designed to do.

Filtering #

Even if you don't control the TLD for a domain name, it's comparatively easy to filter the DNS if you control the network. This is not so much because of the hierarchical structure of the name system but because of the fact that the name resolution tends to be controlled by the network. This means that if you control that resolver you can easily remove any names you don't like or (if DNSSEC is not in use) replace them with names of your own (see here).

This kind of filtering is fairly common. For instance, China's Internet filtering uses DNS blocking. It's also common practice in enterprise or school environments to block domains corresponding to material that the network operator thinks is contraband (often "adult" material). One of the impacts of encrypted DNS is to make this kind of blocking harder, especially if the device or software is configured to use an unfiltered resolver.

Name Ownership Disputes #

Finally, there are circumstances in which a domain can be involuntarily transferred from one party to another. One common case is where someone registers a domain name which corresponds to a trademark held by another entity. Suppose, for instance, that I register coca-co.la (which incidentally, seems to be unregistered) and started some business selling soda (EKR Cola!). The Coca Cola Company might be upset about this and their recourse is ICANN's Uniform Domain Dispute-Resolution Policy (UDRP) which allows them to file a complaint and potentially gain control of the name. The details are of course complicated, but here are some high points:

b. Evidence of Registration and Use in Bad Faith. For the purposes of Paragraph 4(a)(iii), the following circumstances, in particular but without limitation, if found by the Panel to be present, shall be evidence of the registration and use of a domain name in bad faith:

(i) circumstances indicating that you have registered or you have acquired the domain name primarily for the purpose of selling, renting, or otherwise transferring the domain name registration to the complainant who is the owner of the trademark or service mark or to a competitor of that complainant, for valuable consideration in excess of your documented out-of-pocket costs directly related to the domain name; or

(ii) you have registered the domain name in order to prevent the owner of the trademark or service mark from reflecting the mark in a corresponding domain name, provided that you have engaged in a pattern of such conduct; or

(iii) you have registered the domain name primarily for the purpose of disrupting the business of a competitor; or

(iv) by using the domain name, you have intentionally attempted to attract, for commercial gain, Internet users to your web site or other on-line location, by creating a likelihood of confusion with the complainant's mark as to the source, sponsorship, affiliation, or endorsement of your web site or location or of a product or service on your web site or location.

Name registration is frequently first come first served, and it's actually reasonably likely that you'd be able to register some domain name or another that was arguably infringing, as it's kind of a subjective judgment, but the UDRP allows the holder of the trademark to try to reclaim the name in these cases after the fact.^[3]

Note that here too, we're not talking about a technical process but rather a legal/policy one. The UDRP allows the trademark holder to argue that a certain domain name shouldn't have been registered and if they prevail, then the domain registration will be transferred or canceled. When that happens, the DNS gets changed to reflect the outcome of that process, but that's just publishing a decision which got made outside the DNS.

Blockchain/Ledger-Based Systems #

This brings us to the topic of alternative name systems based on ledgers, which are advertised as addressing these issues, especially censorship. Probably the two best known of these are the Ethereum Name Service and Namecoin. Here's Namecoin's description of its value proposition:

Protect free-speech rights online by making the web more resistant to censorship.

Attach identity information such as GPG and OTR keys and email, Bitcoin, and Bitmessage addresses to an identity of your choice.

Human-meaningful Tor .onion domains.

Decentralized TLS (HTTPS) certificate validation, backed by blockchain consensus.

Access websites using the .bit top-level domain

What all this means will become clear below.

How to build a blockchain-based name system #

As with everything crypto, the details are fantastically complicated, but the idea is conceptually simple:^[4] The blockchain provides a decentralized append-only ledger. I'll probably describe how this works at some future point, but for now, this means it's a data structure which:

Has a fixed order of operations
(Mostly) anybody can write to it.
You can only write to the end of it
Everyone agrees on the contents
Nobody can change anything that happened in the past^[5]

With a data structure like this, it's easy to build a simple first-come-first-served (FCFS) name system. You just write a record to the ledger consisting of (1) the name you want to register (2) your public key. E.g.,

{
    "domain-name":"example.com",
    "public-key":
     {"kty":"EC",
          "crv":"P-256",
          "x":"...",
          "y":"...",
          "use":"enc",
          "kid":"1"},
}

Public key borrowed from RFC7517.

As long as you're the first person to register a name, congratulations, you own it!^[6] Anyone can validate you own it just by looking through the entire ledger from the beginning (this may take some time) and seeing that you were the first person to register it. If someone tries to register it afterwards, then it's just ignored (whether it even makes it into the ledger or not is a detail, though an important one in practice). From this point on, things are pretty simple: once you've registered your public key you can just use it to sign ordinary DNSSEC records for your name and use DNSSEC for every name below you. Of course you also need some way to tell resolvers which authoritative server to go to to get those records, but this can be stuffed in the blockchain as well, or stuffed somewhere else and signed with your blockchain-based key.

You'll notice that above I've tried to register a domain in .com but actually this is bad news: if we have two mechanisms for registering names that are uncoordinated we're going to run into situations where some people see example.com as one thing and other people as another (RFC 2826 does a good job of laying this out). In practice, people who want to build their own naming systems tend to try to locate them in as-yet-unused portions of the DNS space: for instance, Namecoin uses .bit and ENS uses .ens.^[7]The idea here is that if you have a Namecoin-capable client you look at the top label and if it's .bit you use Namecoin and otherwise you use the DNS. Of course, these names are still notionally within the DNS and so there's actually nothing stopping ICANN from deciding tomorrow to mint a .bit domain, which would cause confusion. The general idea seems to be that once you get enough usage of your new TLD, ICANN will avoid creating it because it would cause too much trouble; it remains to be seen whether this is actually true.

It is technically possible to register a Special Use Domain Name (SUDN) that is outside of the DNS hierarchy, so one might imagine doing so for a new blockchain-based name system. The bar for this is quite high and the only top-level SUDN which has been registered for an alternative namespace is .onion (RFC 7686) for Tor's cryptographically-generated domain names. This registration was controversial at the time and in some sense sui generis because the names are cryptographically verified rather than looked up; for obvious reasons the IETF and ICANN are less excited about registering TLDs name resolution protocols which are conceptually similar to DNS but use different technical underpinnings.

Technical Properties #

With this under our belts, let's look at the technical properties of the system. For the purposes of this discussion, I'll be assuming that the ledger behaves as advertised; there are potential attacks on the ledgers but they're not so interesting here.

The main advertised advantage for blockchain-based systems is censorship resistance. The first thing that Namecoin lists as it's value proposition is "Protect free-speech rights online by making the web more resistant to censorship." Similarly, ENS advertises itself as "Launch censorship-resistant decentralized websites with ENS.". The answer to the question of whether these systems are more censorship resistant is "sort of". As we saw before, there are two primary ways to censor a domain name in the DNS (1) legally/administratively take over the domain itself (2) block the domain name resolution process. We need to look at these independently.

Domain Takeover #

How resistant this kind of system is to domain takeover depends on the name allocation and reassignment policy. The simple first-come-first-served system I described above really is more resistant to takeover by governments or by anybody else. The ledger enforces ordering and so there's just no external mechanism to transfer a name from someone to someone else. The system of course needs a mechanism to do transfers, but that's done by having the original owner sign the a transfer and that means you need the owner's private key, which the government or ICANN wouldn't have.

It's far from clear that these are actually good properties to have, for two reasons. First, if you lose your signing key you have effectively lost your domain, which seems like a terrifying prospect if you're the person in charge of cisco.bit. You certainly don't want to be like that guy who had 220 million dollars locked up in a Bitcoin wallet that you've lost the password for. Second, while it may seem like a good property that nobody can take your correctly registered domain away from you, it also means that if someone registers a domain for a trademark you own then you can't take it away from them, which is obviously less desirable. Given the importance of the UDRP for the existing domain name system, I have a hard time seeing most big company wanting to participate in that kind of a system, given the risk that they will be unable to protect their trademarks.

It's of course possible to build a system that allows for controlled involuntary transfers: you just have some group of people who can sign those transfers. It appears that this is what ENS has done, requiring four out of seven trusted people to change policies (see here) for a much more negative assessment of the ENS system), but then the censorship resistance benefits come down to how much you trust those people and especially how much you trust them not to be pressured by governments.^[8] The material that ENS has published here isn't very encouraging:

The root node is presently owned by a multisig contract, with keys held by trustworthy individuals in the Ethereum community. We expect that this will be hands-off, with the root ownership only used to effect administrative changes, such as the introduction of a new TLD, or to recover from an emergency such as a critical vulnerability in a TLD registrar.

The keyholders are drawn from respected members of the community, and with the exception of Nick Johnson, founder of ENS, are unaffiliated with ENS. We ask and expect them to exercise their individual judgement acting in the interests of the ENS community, rather than rubber-stamping requests made to them by ENS developers

This kind of ad hoc decision based on people being expected to act in the best interests of the community doesn't really seem sufficient to govern a name system which supports trillions of dollars of transactions.

Finally, it's worth noting that none of this means that your domain can't be taken away by legal process because that could potentially be used to force you to sign the transfer. In this case the system will duly publish that transfer as there's no real way for it to tell you signed it under duress). All the cryptographic machinery is really doing is making it hard for people who can't force you to do things to effectuate the transfer.

Filtering #

It's a bit hard to tell whether this kind of system is more resistant to filtering than ordinary DNS. At the moment, the answer is almost certainly "yes" because there is an established ecosystem devoted to filtering DNS and the blockchain-based name systems are too small to be worth filtering.

I don't think, however, that there is any real technical reason why these systems are more resistant to filtering. At the end of the day, the way these systems work is that you download a bunch of data from the ledger and then verify all the signatures. So what makes them filtering resistant is that the distribution mechanism for the blockchain data is peer to peer and also that you can layer them on top of some other system that is censorship resistant (e.g., download them from the Web or via a real anti-censorship system like Tor).

However, you can do precisely the same thing with DNS. First, if things are DNSSEC signed then they can just be passed around directly because DNSSEC chains are self-contained. And even for non-DNSSEC-signed domains, it's certainly possible to have some third party (e.g., Google public DNS) sign the data. So, as long as you have a censorship-resistant publishing mechanism—this is the hard part—DNS will be equally filtering resistant. Moreover, given that secure DNS transport mechanisms are already in common use, it seems like it's going to be a lot easier to make the DNS hard to filter than to deploy some entirely new naming system, especially given that much of the Internet will be running on DNS for years whatever new system is invented.

What about the rest? #

Let's just look quickly at the rest of the Namecoin value proposition. (I'm not trying to beat up on Namecoin here; mostly similar comments would apply to ENS or any of these systems.)

Attach identity information such as GPG and OTR keys and email, Bitcoin, and Bitmessage addresses to an identity of your choice #

This seems like a reasonable goal, but there's nothing special about a blockchain system that lets you do this. DNS already supports new record types and we've already seen how to attach cryptographic material to DNS; it's straightforward to add all of these record types as well. All you'd need is to want to do it.

Human-meaningful Tor .onion domains. #

This is kind of confusing until you read the FAQ. The situation is that .onion addresses are special because the address is actually the hash of a cryptographic key. With Namecoin you can register a pointer from a regular name to a .onion name. This is fine, but of course you can do it with DNS as well as long as the domain is DNSSEC signed.

Decentralized TLS (HTTPS) certificate validation, backed by blockchain consensus. #

There are two points here: first that you can have a TLSA record associated with your Namecoin domain. This is of course equally possible with ordinary DNS as well. The second point is just the one I made above, which is that the name registration is rooted in the blockchain not in the DNS hierarchy.

Access websites using the .bit top-level domain #

And this just means that you can use .bit instead of .com or whatever.

Summary #

At the end of the day, I don't really see much advantage to these blockchain/ledger-based systems. The primary value proposition is that they are censorship resistant. However, this property is provided by having them rigidly and mechanically enforce some policy, which seems more like a bug than a feature. Our existing name system depends on flexibility in order to function, both to save people from themselves (if they lose their key) and to save them from others (if people register your name in the DNS) and so a system that doesn't provide any discretion seems like a step backwards. It's of course possible to layer some kind of governance structure over top of such a system—this would of course have to be cryptographically reified—but that's not what we have now and at that point, it seems like you've reproduced the same discretionary properties of the DNS that motivate these systems.

Even if these systems do turn out to be technically superior, they face the same network effect challenges that we saw with TLSA: anyone can get a DNS name today and it will be acceptable to basically anyone else on the Internet. By contrast, if you register something in .bit then very few people will be able to see it, so you're most likely going to want to register both a DNS name and a .bit name, at which point the incentive to register the Namecoin name as well seems rather low.

Or, in the words of Leslie Lamport, "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." ↩︎
Though many of the country code TLDs are operated by US companies. ↩︎
As an aside, one problem with minting new top level domains is that they are a new opportunity for people to register names corresponding to some large entity. Rather than go through the dispute resolution process, it's potentially easier and cheaper for the owners of famous names to just register in every new TLD. In a 2014 paper, Halvorson et al. show that the vast majority of registrations in .xxx (intended for adult content) were either for defensive (registering your own name) or speculative (hoping to sell the name) purposes, thus reflecting a windfall to the operators of .xxx of around $10 million USD. ↩︎
Full disclosure: I once participated in the design of a similar system, called Churro in the days before blockchain. It seemed like a good idea at the time. ↩︎
At least in theory. The degree to which this is true in practice is debatable, but for now let's take it as true. ↩︎
As a practical matter, this isn't quite what you want to do: things don't get added to the ledger instantaneously and so it's possible for someone to "frontrun" your domain by seeing the domain you registered and trying to register it themselves, in the hope that they will get added to the ledger first; this is easier if they are themselves part of the infrastructure of the ledger. The fix for this is to first record a commitment to the name you want to register (e.g., HMAC(K, <name>) with a randomly chosen key K) and then once that commitment has been logged, you reveal the commitment by publishing K. This prevents someone from seeing the domain you want to register before it is already in the ledger. ↩︎
ENS will also allow you to register names in the ordinary DNS space but they require you to already own the DNS name, so that's not a problem. ↩︎
As an aside, it's quite possible to build a ledger-type system on top of DNS, using something like certificate transparency. ↩︎

Privately Measuring Vaccine Doses

2022-01-25T00:00:00Z

Note: this post contains a bunch of LaTeX math notation rendered in MathJax, but it doesn't show up right in the newsletter version.*

Anyone can go to the CDC Web site and find out the status of the US COVID vaccination effort. Unfortunately, due to privacy controls in the CDC's data collection(see footnotes), this data seems to be less accurate than we would like:

To protect the privacy of vaccine recipients, CDC receives data without any personally identifiable information (de-identified data) about vaccine doses. Each record of a dose has a unique person identifier. Each jurisdiction or provider uses a unique person identifier to link records within their own systems. However, CDC cannot use the unique person identifier to identify individual people by name. If a person received doses in more than one jurisdiction or at different providers within the same jurisdiction, they could receive different unique person identifiers for different doses. CDC may not be able to link multiple unique person identifiers for different jurisdictions or providers to a single person.

These inaccuracies are made somewhat less apparent by the fact that the CDC caps ("top codes") estimates of vaccine coverage at 95% (formerly 99%), so you don't see reports where more people in an area are vaccinated than actually live in that area:

CDC has capped the percent of population coverage metrics at 95%. This cap helps address potential overestimates of vaccination coverage due to first, second, and booster doses that were not linked. Other reasons for overestimates include census denominator data not including part-time residents or potential data reporting errors.

As I understand it, the situation here is that the data reported by states is roughly accurate, as long as you don't get into people who got doses out of state, but the CDC data is less so because of these privacy measures. For instance, the CDC's data shows that 40 different states have 95% of people 65+ with at least one dose, which not only doesn't help you distinguish between California and Iowa but actually seems to be wrong for California as well. Here's a comparison of the California and Federal Data for 65+.

Metric	California	Federal
Number w/ >= 1dose	5926681	6606265
Percent w/ >= 1dose	90.8	95 (presumably topcoded)
Number fully vaxxed	5403586	5147954
Percent fully vaxxed	82.8	88.2

It's somewhat hard to square this data, and the percentages may just be about the size of the eligible population, but the raw numbers should at least agree. At least part of what's going on seems to be that doses are being misattributed (e.g., boosters marked down as first doses) and CDC not having ground truth doesn't help us debug. A number of commenters have been quite critical of these privacy measures and their impact on the accuracy of the data. For instance, here's political blogger Matt Yglesias:

Besides this, the stated reason for collecting such bad data is not to allow people to get illicit boosters, it’s to protect their privacy. As I wrote in “They deliberately put errors in the Census,” I am very skeptical that the privacy value of having the government do inaccurate record-keeping is high.

I suspect I'm more sensitive to privacy issues than Yglesias, but I'm also not sure this is the right tradeoff. In this case, especially, that the states (and of course, probably the health insurance companies) seem to have non-anonymous measurements of who got vaccinated and when, so it's not clear why it's that big a privacy increment to deny this data to the CDC. Moreover, the states can't easily get more private because they seem to be using that information to implement their vaccine passport systems. For instance in California or New York, you can just input some identifying information and download your vaccine passport; this obviously wouldn't work if this data is stored without identifiers. With that said, I can also see the argument that you don't want the federal government having this information and—unlike the states—it's using it for statistical and not operational purposes, so it's worth asking whether it's possible to improve the situation. As usual, sounds like a job for cryptography.

Anonymously Measuring Vaccination Rates #

The underlying problem here is that we want to be able to measure the rate of various kinds of vaccination in each demographic region. This seems to require that we be able to:

Associate vaccine doses with demographic information like where they were given, where the patient lives, age of the patient, etc. This allows you to measure geographic deployment rates.
Associate multiple doses given to the same person so that you don't say obviously wrong things like 200% of people in California have gotten first doses and nobody has gotten a second dose.

The first requirement is actually readily addressable with privacy preserving measurement techniques like Prio (see here for my writeup), but it doesn't do a good job of linking up multiple doses. One could imagine having a different counter for "first dose", "second dose", etc. with the states reporting each dose appropriately. However, part of the problem seems to be that the status of each dose is being inaccurately reported, both because of errors and because some people actually deliberately concealed or at least didn't disclose their vaccination status, e.g., to get an early booster.

If you didn't care about privacy, you would address this just by having each dose associated with some permanent identifier (ID) like personal name or—even better for accuracy but worse for privacy—social security number. You then would just have a list of doses, dates, and identifier and could sort things out in the obvious fashion by grouping by identifier and then counting. But of course the problem with this is that the identifier is, well, identifying, which is what we are trying to avoid. So, what you want is a stable pseudonymous identifier (PID) derived from this information (thus allowing grouping) but that can't be reversed to give the input information (thus protecting user privacy).

Some things which won't work: hashes, PRFs, and OPRFs #

The obvious thing to do here is to just hash the data, but that's clearly not going to work: the cryptographic guarantees around hash functions only apply when the input space is large, but in this case, the input space will be quite small (for instance, there are only 10⁹ SSNs, so it's trivial to compute the hashes for all of them, and the space of names is not that much larger). This means that the CDC could easily make a table of all the possible identifiers and who they belong to.

The next natural thing to try is some kind of keyed one-way function like a Pseudorandom Function (PRF), but the problem then becomes who can compute this function. PRFs depend on a key, and if the CDC knows the key then it's not better than a hash function. But as a practical matter, if every state has the key, then it's a stretch to think that the CDC will not get it or convince some state employee to run the PRF for them on the (again, small) set of potential names.

Recently, it's become common to throw oblivious PRFs (OPRFs) to this kind of problem. An oblivious PRF is like a PRF except that it can be computed on a blinded version of the input. This means that you can set up a server which will compute the OPRF for people without seeing it, like so:

In this version, the state would blind the patient's name and send it to the OPRF Server, which would compute the OPRF on the blinded input and then return it. The state then unblinds the result to get the PRF on the original input. This has two important properties:

It can't be computed without the key.
The OPRF service never sees both the input and the output because they are blinded. The state of course does get the output (the PID) and sees the input.

In the full system, then, health authorities, states, etc. would collect the patient's ID and ask the OPRF server to map it to the pseudonymous PID, and then send the information to the CDC. This is slightly better, but not much because the OPRF service is an oracle that lets a lot of people map the client true identifier to the PID, and so you need to tightly control access to that service. But a lot of entities (at least states, but also maybe local health departments) are going to have access to that service, which makes the problem hard, as any of them can be used to unmask people. Moreover, it's kind of an inconvenient interface for the states because they want to just submit their data, not have some complicated mapping process that they do pre-submission.

Interoperable Private Attribution #

The underlying problem here is that we need a way to map $ID \rightarrow PID$ that can't just be used by the CDC. Otherwise, they can just try candidate $ID$ values until they get a $PID$ match. Recently, Erik Taubeneck (Meta), Ben Savage (Meta), and Martin Thomson (Mozilla) published a new multiparty computation technology called Interoperable Private Attribution (IPA). As the name suggests, it's designed for measuring conversions in online advertisements, and I may write about that later, but the basic ideas can be adapted for measuring vaccine uptake.^[1]

The general idea behind IPA is that we have a service which is kind of like an OPRF in that it takes in an encrypted identifier and outputs a blinded identifier which can't be tied back to the original input (which is essentially the same problem we are trying to solve). However, if we just emit a blinded identifier in response to an encrypted identifier, the service can be used as an oracle to compute the mapping to blinded IDs. In order to prevent that, the service actually has to take in a group of encrypted identifiers and shuffle them somehow (e.g., by emitting them in a batch). This gives you an interface more like this:

Note that this interface could take identifies in as a batch and then shuffle the batch or one at a time but then buffer them; it just has to make it hard to determine which input goes with which output. In addition, we don't want one entity (in this case the OPRF server) to be able to unmask everyone, so we need to distribute the computation over multiple servers, like so:

Note that the precise communication between servers is a bit complicated. The first server actually only partially decrypts and then blinds and passes things to the second server. Details can be found below.

There are a number of ways to use a service like this. The most obvious is simply to have each vaccine dose be a single report, and then submit them to the service and look at the output in batches. The result will just be a set of delinked, shuffled identifiers, like so:

Now you can just count how many times each identifier appears; identifiers which appear once are single doses, twice are double doses, thrice are boosted etc. If you take the data in daily batches, you can also estimate the amount of time between doses by looking at what day each identifier is reported. You can also do geographic distributions by sending each jurisdiction in separately. In the original IPA proposal, the way things work is that all the encrypted reports were sent to a "Consumer" which gets meta-information like the site the report came from. The consumer could then ask the service to aggregate only a subset of the data.^[2]

There are two important properties of this system that might not be immediately obvious. First, the blinding and shuffling process doesn't preserve meta-information: it just emits the identifiers. If you want to learn about subsets of the data, you need to process the data in chunks (e.g., one state at a time.) The IPA authors have been working on how to carry some meta-information along with the reports, but it's a somewhat complicated problem, as the blinding process would destroy it, and they haven't published a design for this feature.

Second, if you allow the consumer to do a lot of queries of different subsets, then they can use that to extract information about the original data (see here for more). This requires you to restrict the number of different queries, or potentially to just commit in advance to what you will do (e.g., just down to counties on a daily basis). Sybil attacks in which the consumer injects fake queries are also possible, but can be prevented by having the jurisdiction sign their reports.^[3]

IPA Technical Details #

This section provides technical details. I've attempted to make them mostly accessible and can be understood based on high school math^[4] , but they can also be skipped if necessary. This section will not render properly in the newsletter because I use MathJax to render LaTeX. Click here to see it rendered on the site.

$$g^x, I * g^{x(a+b)}$$

Importantly, this second term can be broken up into a part involving only $a$ and a part involving only $b$. I.e.,

$$I * g^{x(a+b)} = I * g^{xa} * g^{xb}$$

$$I *g^{xb} = \frac{I * \cancel{g^{xa}} * g^{xb}}{\cancel{g^{xa}}}$$

This cancels out the $g^{xa}$ term, leaving you with just a term that involves $b$, and thus the pair:

$$g^x, I * g^{xb}$$

$A$ then blinds this value, by exponentiating both values to $K_a$, giving:

$$(g^x)^{K_a}, (I * g^{xb})^{K_a}$$

We can flatten this out to give:

$$g^{x * K_a}, I^{K_a} * g^{(xb)(K_a)}$$

$$I^{K_a} = \frac{I^{K_a} * \cancel{g^{(xb)(K_a)}}}{\cancel{g^{(xb)(K_a)}}}$$

Finally, $B$ blinds the value by taking it to the power $K_b$, this giving us:

$$I^{(K_a)(K_b)} = (I^{K_a})^{K_b}$$

That was a lot of math, but the bottom line is that the actual identifier $I$ (e.g., the SSN) has been converted into a new blinded value, with (hopefully) the following properties:

Neither $A$ or $B$ ever saw $I$
$A$ sees the input encrypted version but doesn't learn the blinded version.
$B$ sees the blinded version but doesn't learn the encrypted version.
You need to know $K_a$ and $K_b$ to compute the blinded version of $I$.

Disclaimer: The IPA documents were just published recently, so I don't think they have seen enough analysis to prove they are secure. Here I'm just describing how it's supposed to work.

Limitations #

Like any privacy preserving measurement system, this has some limitations, in particular in the area of flexibility. For instance, this will only properly attribute vaccine doses when there is an exact match on the original identifier. This will work OK if the identifier itself has a single form, like a social security number, but what if you use name and birthday. In that case, "John Smith" and "John H. Smith" will look like different people. If you had people's actual names, you could try to correct this kind of error by looking for close matches at approximately the right time, but IPA isn't "distance preserving" in that two similar inputs A and B are not likely to have blinded versions which are similar, so you can't make this kind of correction later.

Another problem is that in the form I've presented it, you're losing information like the kind of vaccine, so you can't easily ask "how many people started with J&J and then boosted with Moderna." There are some potential avenues for making this work, for instance to carry metadata along with the identifier, and that's probably possible, but making that work is more complicated than the protocol I described above.

Finally, because repeated queries can be used to determine which reports belong to which individuals, you need to limit the number of different kinds of queries you do. This is probably fine if you want to just record the number of doses of each type in a given region, but less fine if you want to do some kind of deeper research. Of course the states can do that analysis now because they have accurate data, but if you want to do national scale analysis or you want it done consistently, that's not that great an option.

Summary #

Given the fact that the states are collecting directly identifying data about vaccination, I suspect it's a bad tradeoff to conceal this data to the CDC: the privacy improvement seems modest and the effect on accuracy is real. However, if we are going to take it as a hard requirement that the CDC not learn identifying information, then we can use Privacy Preserving Measurement techniques to get substantially better accuracy than the CDC seems to be achieving today.

Full disclosure: I was an early reviewer of this design and made some comments and suggestions. ↩︎
In IPA, the service actually computes aggregates like sum or whatever, but that's probably not necessary here. ↩︎
IPA expects to use randomization to provide differential privacy, but of course this reduces accuracy. ↩︎
In particular, the facts that $(g^a)(g^b) = g^{a+b}$ and $(g^a)^b = g^{ab}$. ↩︎
Yes, I know I'm using exponential notation. It's easier to follow for people not used to EC notation. ↩︎

DNS Security, Part V: Transport security for Recursive to Authoritative DNS

2022-01-21T00:00:00Z

This is Part V of my series on DNS Security (parts I, II, III), IV). In part IV I covered DNS transport security between the client (the stub resolver) and the recursive resolver but ran out of room to talk about the recursive to authoritative link, which is the subject of this post.

Recall yet again the DNS resolution process, shown below:

For this post, we will be focusing on protecting the transactions between the recursive resolver and the authoritative servers, shown in blue in this diagram. The work on this has been happening in the IETF DNS PRIVate Exchange (dprive) Working Group. This is commonly called Authoritative DNS over TLS (ADoT), or ADoX if you want to indicate that you don't care whether the transport is DoT, DoH, or DoQ.

The Basic Setting #

Before we start looking at mechanisms, it's helpful to frame the problem correctly. We have two objectives:

Protect the confidentiality of the request. I.e., we do not want the attacker to know that the user is trying to resolve example.org.
Protect the integrity of the response. I.e., we do not want the attacker to be able to lie about the address for example.org.

As discussed before, while DNSSEC can provide integrity, it cannot provide confidentiality.

The first thing to notice is that this means we need to encrypt both the link to the authoritative for .org and the link to the authoritative for example.org because both transactions leak that the user is interested in example.org. Importantly, the privacy value of the query is limited by the number of other domains which are served by the same authoritative as example.org, because the user must be asking for one of those domains. For this reason, if we have encrypted DNS your users will get better privacy if your domain is hosted by a DNS provider that serves a lot of other domains as well. Note that there are cases in which example.org might have a lot of subdomains and you wouldn't want the attacker knowing which one is being requested, but in the most common case it's the second level domain that matters.

Second, in order to provide confidentiality for these lookups, we need to provide integrity for the identity of the server. For instance, if the attacker is able to attack the connection between the client and b2.org.afilias-nst.org, it can substitute its own server for the true authoritative server b.iana-servers.net. DNSSEC as-is does not prevent this form of attack because it doesn't sign the NS records at the parent, but only at the child; but by the time you've queried the child for them, it's too late because you've already leaked the query to the attacker. This means that the most convenient thing is if every link uses secure transport, so that you can trust the results it gives you at stage N before using them for stage N+1. In other words, you want to have secure transport all the way to the root.

As before, then, the basic problem is setting the DNS client's (in this case the recursive resolver, confusing, right?) expectations correctly. In particular, if we are going to be resistant to active attack, the recursive needs to know:

That the authoritative server will do DoX (and what protocol)
The identity to expect the authoritative server to present

If it doesn't know either of these things, then an active attacker can interfere with the connection. Specifically, if the recursive doesn't know that the authoritative server will use DoX, then the attacker can just simulate an error when the recursive tries. If it doesn't know the identity that the authoritative server will present, then the attacker can just provide its own identity and impersonate the authoritative. Unfortunately, this turns out to be quite a bit more difficult than one would like.

Root Servers #

As shown in the diagram above, the first request from the recursive resolver goes—at least notionally—to the root server. If this is to use secure transport, the only way that can work is for the recursive to be preconfigured with the information about which root servers use secure transport. There are only 13 root server names (a.root-servers.org through m.root-servers.org), so it's not at all impractical to imagine just disseminating an updated list. Note that it's not necessary for all the root servers to switch to secure transport at once (they are operated by different people), but of course if the recursive preferentially uses secure transport, then the first one to switch might get increased load. As a practical matter, it seems unlikely that we're going to get secure transport to the root immediately. It's much simpler for the recursive resolver to run a mirror of the root zone locally, as specified in RFC 8806.

Non-Root Authoritatives #

The situation with non-root resolvers (e.g., for .com or example.com) is more complicated, because the way you learn about those resolvers is from the root resolver, so how does the recursive learn that they accept secure transport. There is a similar problem all the way down the chain: when the parent nameserver (e.g., b2.org.afilias-nst.org) tells you about the child resolver for a given zone (e.g., b.iana-servers.net for the zone example.org) how do you know the properties of the child resolver? If you are used to the Web, there will seem to be an obvious answer: the parent nameserver should tell you. This is how things work on the Web, where there is a different URL scheme for secure transactions (https:) versus insecure transactions (http:).

However, DNS isn't the Web and there are actually two "parent servers" where this data could go. Consider the case where we are trying to resolve example.org, but the authoritative server for example.org is on example.net^[1] In order to look up example.org the recursive resolver need to first look up example.net so that it can then contact it. This means that there are two places where one could indicate that the connection to example.net should use secure transport. First, you could put the information in the NS records for example.org that say to contact example.net (this corresponds to the way things work on the Web). These records would be served off of the .org authoritative server, like so:

This seems natural but has the disadvantage that every domain which uses example.net as its nameserver needs to update its own records individually^[2] A more DNS-like approach.^[3] is to have the indication be in a record that gets served for the authoritative (example.net) that you get when you look up its IP address. This would be served off of the .net authoritative, like so:

The advantage of this second approach is that as soon as example.net upgrades to secure transport, everyone who uses it as a nameserver gets it, by contrast with the first approach where each domain has to configure it separately for its authoritative server.

You'll notice that I've just written "Use DoT" here, but that's handwaving, not telling you how it actually works, and in this case details really matter. Unfortunately, here is where we run into trouble. The basic problem here is updating the parent server to know that the server for the child domain supports secure transport. This is a lot more complicated than it sounds, to the point where it's more or less stalled the whole effort. The next section describes the situation in some detail, but the TL;DR is that there seem to be no good existing mechanisms for doing this, so we're left with either not doing it or with some hacks (skip ahead).

Populating the Parent Zone (Technical) #

Warning: this section is fairly technical. You can safely skip it if you don't care about the details.

Recall that DNS has a number of different resource record (RR) types, including A/AAAA for IPv4 and IPv6 addresses, etc. The information about what server to use for a given domain is contained in a nameserver (NS) record, but unfortunately that record has no place to carry other information about the server. The "right" place to put this information is in the service binding (SVCB) record, which can already be used to signal that you should use HTTPS rather than HTTP (the use case for this is cases where someone has used an http: URL but the target domain always wants you to use TLS). Unfortunately, actually populating the parent zone with SVCB turns out to be impractical, at least in the short to medium term.

There are several separate entities who have to cooperate in order to serve a domain name:

The registrant who actually operates the domain (e.g., Google for google.com).
The authoritative name server who actually serves the DNS records for the domain.
The registry which actually hosts the DNS for the parent domain. For instance Verisign operates .com.
The registrar which is responsible for actually interacting with the registrant. It is the registrar's job to populate the registry's database with NS records that point to the authoritative name server.

The registration process proceed as shown below. Note that I've shown it in one order but the steps can sometimes happen in a different order:

First, the registrant registers (i.e., buys)^[4] the domain with the registrar. This just creates a database record that indicates they own the domain.
The registrant publishes the DNS records for the domain with the authoritative server. In this example, they just publish the IP address.
The registrant tells the registrar which authoritative server it is using.
The registrar tells the registry which authoritative server the domain is using, using the Extensible Provisioning Protocol.

At the end of the day, we end up with a situation in which:

The registry (and hence the parent domain) is publishing a record that says that example.com is hosted on the authoritative server.
The authoritative server publishes a record that actually has the address for example.com

In practice, it's reasonably common for two of these entities to be the same. For instance, big companies like Google or Facebook usually run their own authoritative servers. Another version is that many registrars operate their own authoritative servers. In some cases, a hosting provider will operate a registrar and an authoritative server (for instance, Dreamhost is the registrar, authoritative server, and web hoster for rtfm.com).

Whatever the exact configuration, the first problem is that EPP, while extensible, does not currently provide any mechanism for conveying SVCB records, so if we wanted the registrar to convey them to the registry, we would need an extension, which would take some time to deploy. For this reason, there has been a fair amount of interest in ~~hijacking~~reusing existing DNS records which are already propagated to the parent zone.

DS Glue #

Probably the most promising version of this is called "DS Glue" and uses a DS record for a fake algorithm to smuggle information about the target resolver. This is one of those hacks which sits right at the border between hideous and brilliant: because DS is already propagated the parent, we hopefully don't need to change registries or EPP (I say "hopefully" because this depends on those elements being willing to handle the new DS record type, and it's to be seen whether that will work properly.) DS Glue has the nice property that it doesn't require DNSSEC deployment: as long as there is secure transport^[5] to the parent authoritative (in this case, for .org) and to parent for the authoritative server's domain (in this case .net) then the records are trustworthy. If either of these connections is insecure, however, then the attacker can substitute new NS records (to point to a different authoritative server) or strip the DS glue records (thus blocking encryption.)

If the transport connection to the parent for the authoritative isn't secure, but that zone is DNSSEC signed, then DS glue still works. It works less well if there isn't secure transport for the parent of the target domain because NS records aren't signed in the parent and so the recursive will get the DS glue records for the wrong authoritative.^[6]

TLSA #

The other major live proposal is to use the TLSA record to indicate that the authoritative server wants secure transport. This would be delivered in roughly the same way as the DS glue record. This has the disadvantage that it requires that the authoritative server's domain be DNSSEC signed, which then becomes an obstacle to deployment. One of the advantages of secure transport is that it can be deployed in parallel with DNSSEC and this would remove that advantage, so I'm less optimistic about this approach.

No signaling in the parent #

The alternative approach is to not signal in the parent that the authoritative server for the child zone supports secure transport. In this case, the recursive will have to discover that somehow. The most likely way is that you query for a SVCB record for the authoritative server, though I've also seen suggestions to query for a TLSA/DANE record. This would look like this:

This is secure if and only if the zone for the authoritative server is signed. If it's not signed there's nothing stopping an active attacker from just intercepting the connection to the authoritative server and responding that the authoritative doesn't support secure transport (note that it most likely can't actually establish secure transport because it will have the wrong credentials), like so:

An additional problem is that it with this design is that it likely introduces additional latency because the recursive resolver needs to first query the authoritative server for its capabilities and only then can it ask the real question (this is one of the main reasons for signaling in the parent).

Another alternative is to signal this information in the child domain itself somewhere. This is technically possible, but the problem is that by the time you've looked up the information in the client's domain, you've already leaked to the attacker what domain you want to resolve. Of course, after that's happened you could learn that the child wanted secure transport and use it in the future, but not if the attacker attacks the connection between you and the child, so you need DNSSEC here too. Moreover, it means that every child needs to independently signal that it wants secure transport to its authoritative.

Insecurely Discovering Secure Transport #

While it may ultimately be possible to provide for a method of securely signaling the use of secure transport, it's starting to look like it's going to be very difficult to converge on something that everyone likes. In the meantime, a number of people have proposed that instead we do what's often called either unauthenticated or probing modes of secure transport. The basic idea here is that the recursive resolver would attempt secure transport to the authoritative resolver and then in future remember whether that worked or not.

Obviously, this kind of system isn't entirely secure against active attack, but it might be a good idea anyway for at least three reasons:

Active attack is harder than passive attack, so you've increased the attacker's costs.
If you have a way for the authoritative server to signal its commitment to supporting secure transport for some period (like HSTS for HTTP), then you can bootstrap insecure discovery into a secure mode; this requires the attacker to mount an active attack the first time you connect, which is even harder.
It helps the authoritative (and to some extent the recursive) resolvers get experience with deploying secure transport without running the risk of hard failures if something goes wrong (see more on this below).

Moreover, this kind of mechanism is much easier to deploy, because it doesn't involve any of the difficulties we saw above with signaling availability of secure transport prior to connection establishment, or with propagating records to other servers. For that reason, it seems like it might be easier to deploy.

Historically I've not been that enthusiastic about this kind of insecure discovery (what's often called "opportunistic", but that word has become the subject of headed debates about its precise definition), because it's really better to have secure discovery and this seemed like a distraction from that. However, as the discussion about how to actually do the secure signaling has dragged on—and to some extent ground to a halt—I've started to think it's may be better to do something than nothing.

TLSA vs. WebPKI #

Another point of contention here is how the authoritative servers should authenticate. There are two major options here, use the WebPKI like TLS on the Web, or use TLSA/DANE (see here for my writeup on this.) This is an issue which raises some very strong feelings on both sides.

On the WebPKI side, the argument is roughly that we already have plenty of experience with the WebPKI and while it has its problems, it's well understood and we know we can deploy it. By contrast, TLSA/DANE requires taking an unnecessary dependency on DNSSEC. On the TLSA side, the argument is roughly that (1) the WebPKI is bad (2) WebPKI security depends on DNS, so we shouldn't make DNS security depend on the WebPKI, and (3) we should stop acting like DNSSEC isn't a requirement (and perhaps that if we make things depend on DNSSEC, it will become a requirement).

As should be clear from this long series of posts, I'm more optimistic about WebPKI, but I'm more than happy to design a system which allows either WebPKI or TLSA/DANE and let the market sort it out.^[7] As far as I can tell, this is the position of most of the people who favor WebPKI, so the two sides really are more like "WebPKI or TLSA" or "TLSA only" (see above about the implications of making DNSSEC a requirement.)

Operator Concerns #

Even assuming that we address the technical issues about when recursive resolvers initiate secure transport, actually getting deployment requires that the authoritative servers enable ADoX; unfortunately, there are serious questions about their willingness to do so. In March of 2021, the root server operators published a statement expressing concern about the use of encryption to the root servers:

Server Operators have some concerns about supporting DNS encryption for serving the root zone. It is well known that UDP has desirable performance characteristics, due to its stateless nature. Increasing the state-holding burden with the addition of connection-oriented protocols, as well as encryption data, not only reduces the performance of name servers, but also may raise new types of denial-of-service attacks.

At this time, the exact risk-reward tradeoffs for deployment of encryption to root name servers is unclear and will likely depend on which particular transport proposals gain momentum. Root Server Operators do not feel comfortable being the early adopters of authoritative DNS encryption and would like to first see increased deployment in other parts of the DNS hierarchy. Meanwhile, there are other ways to improve privacy in queries sent to root and other name servers.

As described above, it's of course theoretically possible to just do secure transport to the TLD server and not to the root (though Verisign, for instance, runs both .com and two root servers). In addition, some operators also published an Internet Draft documenting, their concerns which roughly come down to performance (due to the additional cost of encryption and doing TCP) and about stability (which seems to be about whether TLS/QUIC failures will cause resolution to fail).

These concerns are actually sort of puzzling to Web people, for several reasons. First, the vast majority of Web traffic is encrypted, including key services like Google and Facebook, and once operators got past the teething pains, this doesn't seem to have created increased stability concerns. If Google goes down, it's an enormous deal, perhaps even bigger than a DNS authoritative server failure, because recursive servers cache data and so won't start failing immediately.

Second, although encryption does increase load somewhat, even 10 years ago it was a relatively small fraction of the cost of running a server. In a 2012 talk by Langley, Modadugu, and Chang they reported that SSL/TLS accounted for less than 1% of CPU load on their front-end machines, and of course both machines and TLS have gotten faster. It's true that serving DNS tends to be lighter-weight because UDP is cheap and the servers are largely stateless (though QUIC may help some here), but the overall load profile doesn't seem like a big deal. As a comparison point, all the root servers together serve on the order of 80 billion queries a day. This is equal to less than an hour of of Cloudflare's query volume, so doesn't seem that impractical to protect. It's certainly possible—even likely—that it would require those operators to invest more than they have in infrastructure, but it seems far from impossible.

Summary #

As I said above, the situation is in flux, but overall, I'm not that optimistic. This is a system with a lot of moving parts and where a number of the veto points have relatively little incentive to change their operations, or as is the case with the root operators, be actively skeptical of doing so. If we look at the situation with DNSSEC deployment, which DNS operators are relatively enthusiastic about and which still has a lot of friction points, the prospects for any kind of signaling for ADoX don't look that great. The prospects for some sort of probing/unauthenticated mode—potentially with an HSTS-style upgrade—seem a little better, but even that seems like it may be a stretch.

Really, it would probably be on ns.example.net but I'm simplifying. ↩︎
This is the situation on the Web, hence HSTS. ↩︎
This may all seem obvious to people who understand DNS, but it took me a while to work through it, so I think it might help others too. ↩︎
Or, more accurately, rents. ↩︎
And recursively from the root. ↩︎
There is one case where this still sort of works: if (1) the target zone is signed and (2) the sensitive label is one deeper than the target zone, e.g., sensitive-label.example.com and (3) the recursive first queries the target authoritative to check the NS record (NS revalidation). In that case you can still protect the sensitive label. ↩︎
This does entail more complexity, because it probably requires a way to signal which kind of credential the authoritative will use so that a recursive which only knows WebPKI or TLSA/DANE knows if it will be able to connect. ↩︎

Qualifying for prestige races (and why you won't get into Western States)

2022-01-16T00:00:00Z

It's a common pattern: a new category of race starts up and initially it's not very popular, so you can just sign up. But the race can't accommodate an infinite number of participants, and if the sport starts to get popular, you can start to hit capacity limits. If they're not too bad you can just make things first come first served, but some really popular races—especially prestige ones like the Boston Marathon or the Hawaii Ironman—are in such demand that they would just fill up instantly. Obviously, this is one way to ration entry, but it's odd to choose based on how good someone is it hitting reload on their browser and unlike COVID vaccination, it's not just a simple matter of prioritization: some people will get in and some will not. Selecting the lucky few turns out to be a somewhat complicated problem, and the three endurance sports I'm most familiar with (road running, triathlon, and ultramarathons) have all developed different solutions.

At a high level, you can select people based on two basic criteria: merit and luck. Luck is theoretically easy: run a lottery (though in practice it's usually not that simple). Merit is more complicated, for reasons I'll get into below.

Road Racing #

Road race fields are typically very large (for instance, the 2019 Boston Marathon had 30000 runners), and so only the most famous and popular races need to do anything special beyond first come first served. If you're a popular race, though, you need to do something different. Boston is by far the most prestigious marathon in the US—and probably the world—and therefore is heavily in demand, even with this big a field size. They run a relatively straightforward system: each age bracket (mostly 5-years) has a qualifying time. If you hit the qualifying standard in any certified marathon then you are eligible to apply for Boston. A similar system is used for the US Olympic trials in marathon, where there is a qualifying time tuned to generate a field of a few hundred or so.^[1]

This doesn't guarantee you entry, though: because more people hit the qualifying time than they can admit they also have a year-to-year adjustment to the qualifying time. For instance, if you are 41, your qualifying time in 2021 year was 3:10, but because of the small field size this year, they had an unusually high cut-off of 7:47, meaning you had to actually run 3:02:13 to be admitted. On the other hand, fewer people applied in 2022 and everyone with the official time got in. These times are fast, but are not out of reach for reasonably good runners. Many other prestige races use a combination of lotteries and time qualification.

Time-based qualification works well for road racing (or track) because times are relatively consistent and depend mostly on the flatness of the course and the weather (specifically, temperature and wind).^[2] This means that most people have a fast (which is to say flat, low wind, cool) course available to them without too much effort, and so they have an opportunity to turn in a fast time. Indeed, it's quite common for races to advertise themselves as "flat^[3] fast" and perfect for Boston Qualifying. Popular places to get the "BQ", as they say, are Tunnel Hill run in November in Illinois and California International Marathon (CIM) run in December in Sacramento^[4]

Triathlon #

The Ironman race that everyone has as their goal is the Hawaii Ironman (aka "Kona"). By contrast to road racing, triathlon courses are somewhat less standardized and there are fewer races, so that means that there's a fair amount of variation in finish times; for instance the Ironman German course record is 7:41 and the Ironman Lanzarote record is 8:30. This, coupled with the relatively small number of entrants in Hawaii (about 2500) means that time criteria don't work well; there will be too much uncertainty at the margin.

Instead, the way this works is that Ironman Hawaii gives each race a fixed number of "slots", which is to say the number of athletes they can send to Kona. These slots are then allocated to each (typically five year) age bracket + gender (e.g., Male 25-29). If there are (say) 5 slots in a given age group, then they go to the top athletes in that age group. If a qualifying athlete doesn't want the slot—or already has one—then it "rolls down" to the next athlete. In some case, it's been known to happen that a slot will roll down off the end of the age group (especially in smaller age groups), and go to another age group. This structure creates a slightly odd dynamic: As with Boston qualifying, people gravitate to specific races, not on the basis of time but rather on the basis of which races appear to have "soft" winning times and thus be easier to qualify at. This can make a big difference if you are a solid but not elite age grouper who is just on the border of qualifying. I myself once flew to New Zealand to race because the previous year had had fairly slow winning times (I DNFed.)

Interestingly, the Hawaii Ironman used to run a lottery in which you could pay $50 to enter, but it appears that they have stopped doing that due to a settlement with the Federal government which treats it as gambling, I think because they charged you whether you got in or not.

Ultramarathons #

Ultras tend have even smaller field sizes than triathlons, both for logistical and historical reasons. The logistical reason is that it's hard to have a lot of people on single-track mountain trails—and of course it's hard on the trails. For instance, even the comparatively large Ultra-Trail de Mont Blanc (UTMB), the most prestigious European long distance ultra, has a field size of only around 2300 runners. The most prestigious North American ultra, Western States has a field size of under 400. The reason for this is that some of the event takes place in a wilderness region where races are technically forbidden, and so the race operates under a permit that keeps it to the size of the event before the wilderness was created. Other famous North American ultras like Hardrock 100 or Sonoma 50 also have relatively small field sizes.^[5]

Unlike both road racing and triathlon, ultras manage the problem of oversubscription (at least for amateurs) almost entirely by luck and not by merit. As an example, Sonoma 50 runs a simple blind lottery for all admissions, including pros. The sole exception is the previous year's winner, who gets in without being in the lottery. It doesn't matter if you're back of the pack or going for the win, you're all in the same lottery. A more common structure is to have some kind of special affordance for professionals. It's not clear to me why this system has evolved, but I suspect it's something do with the generally less competitive ethos of trail running as well as the relative youth of the sport.

Western States #

Western States has a particularly ornate system, consisting of a set of about 100 "automatic entrants" plus a lottery with about 270 spots. The automatics are largely elites of various flavors, including:

The top 10 men and women in the previous year
6 spots for elite athletes (mostly non-Americans) from the Ultra Trail World Tour.
The top two men and women from 6 different Golden Ticket races.^[6]
Around 10 slots for race sponsors. For instance, Jim Walmsley famously won in 2020, turned down his automatic slot for 2021 because he didn't think he was going to race and then got in via his sponsor, shoe company Hoka.

If you're not good enough to run your way in or have a sponsor who will get you in (and you're not Gordy Ainsleigh who ran the course on foot back when it was just the Tevis Cup, Cowman AmooHa, or a few of the other notables), then it's the lottery for you.

The way the WS lottery works is that each year you have to "qualify" by finishing—occasionally within a certain time—one of a set of specified races. Unlike with Boston or the Hawaii Ironman, these qualifying requirements aren't set to pick out elite runners but just to weed out people who have no real chance of finishing Western. For instance, it's sufficient to finish Sean O'Brien 100K in under 16 hours. I'm not saying this is easy, but I finished under 13 hours and was well off the podium.

This all worked reasonably OK until the mid 2010s, at which point the number of applicants exceeded the number of slots by about a factor of about 10 and there were people who had been waiting to get in for 5 years. In 2015, they introduced a new system in which the number of lottery tickets doubled for every year you didn't get in. With a few small modifications^[7], this is the system that exists now.

The obvious problem with this system is that it doesn't make any more slots; it just reallocates the probability of getting in from newer people to older people. This of course reduces the number of people who have been waiting a really long time, but at the cost of making it very unlikely for new people to get in. For instance, someone who entered the lottery for the first time in 2021 (for the 2022 race) had a 1.3% chance of getting in, and it's just going to get worse as long as more people want to run Western than can be accommodated via the lottery.

Hardrock 100 #

Hardrock 100 has an especially goofy system, with three separate lotteries:

Category	Number of Tickets
Never finished	65
Veterans (five or more finishes)	25
Everyone else	55

When you add this up, you see that more than half of the slots are given to people who have already run Hardrock, so this has precisely the opposite bias as Western States uses (although they do use a similar doubling scheme for Never Finished, so at least it tends to reward waiting).

In practice, this has resulted in a terrible gender balance for Hardrock: because historically most of the people who have run Hardrock are men, this system just perpetuates that imbalance and will continue to do so as long as the number of first-time women doesn't massively increase. Starting in 2022, Hardrock's policy is to admit women in proportion to their fraction of the lottery pool. This won't actually bring gender balance because the number of men who enter is far greater, but it's potentially a step in the right direction. The High Lonesome 100 has gone even further and selects exactly as many women as men.

UTMB #

UTMB followed a similar path, starting with open entrance, then qualification, and finally a lottery, including a similar doubling scheme to Western States (the site suggests that they will no longer double after 2022). However they have now introduced a new change to the system in which you can collect "running stones" for participating in specific races (especially races owned by UTMB!) with each stone counting as another lottery entry. So, for instance, you get 9 stones for Thailand By UTMB. And the more races you do the more stones you collect. We should anticipate that in the future the majority of people will be selected via this mechanism, both because it's obviously a huge advantage and because the more people start using it the more of a disadvantage you are for just entering the ordinary lottery. This is, of course, good for business!

The Long-term #

As I mentioned above, as long as more people want to do these races than can be accommodated, any lottery system is sort of a temporary measure, because most people won't get to do the race ever. For instance, there were over 3000 first year applicants for the 2022 Western States. It would take over 10 years just to have all of them race, in which time another 30,000 or so people would be waiting. I think it's only now that people are starting to come to term with this and realize they are unlikely to ever get into Western States or Hardrock. Moreover, increasing the odds for people who have been waiting longer will actually have the paradoxical effect that wait times for people who get in continue to increase as the right hand side of the distribution is increasingly favored (the wait times of people who don't get in will of course always be infinite).

The graph above shows a simulation of 10 years of the Western States lottery under the (very conservative) assumption that the number of new entrants will continue to remain the same (in fact, it has been increasing for years). The area shows the distribution of wait times and the black line the mean number of years that selected runners have been in the lottery. As you can see, this means that the population of lottery winners will have been waiting longer and longer and will be getting correspondingly older. This is going to get especially weird in another 10-15 years as the pros are typically fairly young (under 40), so even more than usual you'll have two races, one for pros and one for amateurs.

The Bigger Picture #

At the end of the day there really isn't a great solution: there are just more people who want to do these races than can plausibly do so, so you need some way to select the lucky few. It seems like one could make an argument for either performance-based qualification (Boston and Kona) or lottery-based qualification. However, it seems to me that the doubling system used by Western States and the quota system used by Hardrock are long-term unstable, the former because it's just going to create an older and older population and the latter because it just seems unfair to favor people who have done the race 5 times over people who have never done it.

This actually went a bit wrong in 2020, when they overshot the mark for women. The women's standard for entering the trials in the marathon was 2:45 and 511 women qualified. The standard has been dropped to 2:37 for 2024. I've seen arguments that a big field was good, but obviously USATF doesn't agree. It's certainly true that the logistics are hard because each runner gets to have their own individualized nutrition at aid stations, etc. ↩︎
Temperature is actually a huge issue because running generates a lot of heat and your body has to work to get rid of it. The data is unsurprisingly pretty noisy, but the optimal temperature for running appears to be quite cold, somewhere around 5-10^oC. ↩︎
Courses can be net downhill but only by a little bit. ↩︎
CIM also advertises "More porta-potties per runner at the start and along the course than any event CIM staff and board has ever seen!". This is more important than you might think. British marathon legend Paula Radcliffe famously had "bathroom issues" at the 2005 London Marathon and had to just go on the side of the course, going on to win anyway. ↩︎
I actually got into the Sonoma lottery this year and plan to toe the line. ↩︎
This works like Hawaii in that the slots roll down. ↩︎
Specifically, they no longer require you to have applied in consecutive years. ↩︎

DNS Security, Part IV: Transport security for DNS (DoT, DoH, DoQ)

2022-01-05T00:00:00Z

This is Part IV of my series on DNS Security (parts I, II, III). In this part I cover transport security for DNS.

For years most of the DNS security effort went into DNSSEC, which provides authenticity for DNS data by signing the DNS records themselves. This left two big gaps. First, DNSSEC has seen fairly low levels of deployment, leaving the majority of DNS resolutions unprotected and most of the resolutions which benefit from DNSSEC only do so as far as the recursive resolver. Second, DNSSEC doesn't provide confidentiality, so DNS query data, which is naturally extremely sensitive, is wholly unprotected. In this post I go into the various technologies to address these gaps.

Disclaimer: I was (am) heavily involved in the design and deployment of the Firefox DNS over HTTPS (DoH) deployment. The opinions below are mine and not Mozilla's.

Overall Situation #

Recall the DNS resolution process from Part I, shown below:

It's easiest to think of this as just consisting of four independent sets of transactions:

Client to recursive
Recursive to root
Recursive to b2.org.afilias-nst.org
Recursive to b.iana-servers.net

Each of these transactions is a request/response exchange, typically done over UDP, but sometimes over TCP.

If you want to protect this system, a natural thing to do is just to encrypt each transaction, resulting in a set of encrypted links to and from the recursive resolver. This isn't a complete solution because the recursive resolver learns what queries you are performing and unless you also do DNSSEC validation at the client, the recursive resolver can simply lie to you when it sends you its results. However, it's also a significant improvement in security and privacy because it protects the user from attacks outside the recursive resolver. Moreover, we already have plenty of experience with protecting this kind of data (just run it over TLS, or in the case of UDP, perhaps DTLS) and so it's—at least in theory—technically straightforward. In practice, however it turns out not to be so, though for reasons that aren't really about the protocol itself.

In this post, we'll focus on the (by comparison) easier problem of protecting the client-to-recursive transaction, colored blue in the diagram above. While this is a fast evolving area, there are a number of large-scale deployments of encryption of this link. The problem of recursive-to-authoritative is essentially unsolved and is the topic of a separate post. For now, you can just assume that link is in the clear.

Server Authentication #

The basic problem here is authentication. Forming an encrypted connection is relatively easy—especially if you have a pre-made protocol like TLS^[1] to start with—but if you want security against an on-path attacker then you need to authenticate the server; otherwise the attacker can just impersonate the server and capture your queries. If they forward the queries to the server themselves and the responses back (in a so-called "man-in-the-middle attack") then this will be invisible to the client. It's generally not necessary to authenticate the client to the server because the server's response doesn't depend on the client's identity.^[2] In order to prevent this kind of attack, the client must know (1) that the server supports encrypted transport and (2) the expected identity of the server. We discuss these both below.

Note: there are three major protocols being used for secure DNS transport: DNS over TLS (DoT), DNS over HTTPS (DoH), and DNS over QUIC (DoQ). While there are important technical differences, they are irrelevant for most of the discussion below and it's conventional to refer to them collectively as DoX and to refer to old unencrypted DNS as Do53.^[3]

Securing the Stub-to-Recursive Link #

As described in Part I, endpoints typically learn about the resolver via the network, which provides them with an IP address for the resolver. This is a perfectly good identity and it's possible to securely connect to that IP address as the WebPKI supports IP addresses in certificates, but that doesn't actually help very much, for two reasons.

First, there's no way to know that the server actually supports encrypted transport. You can configure the client to just try encrypted transport and fall back to unencrypted transport if that fails, but that means that any on-path attacker can just simulate failure (e.g., by sending a TCP reset (RST)) and force you back to unencrypted transport. Second, if the attacker is on your local network, however, they can often interfere with that discovery process and substitute their own resolver, in which case you form an encrypted connection to the attacker, which isn't very useful.^[4]

When the IETF originally standardized secure transports for DNS—and specifically for stub to recursive—they defined the protocols themselves but mostly punted on this problem. Here's what RFC 7858, defining DNS over TLS (DoT) has to say:

This protocol provides flexibility to accommodate several different use cases. This document defines two usage profiles: (1) opportunistic privacy and (2) out-of-band key-pinned authentication that can be used to obtain stronger privacy guarantees if the client has a trusted relationship with a DNS server supporting TLS. Additional methods of authentication will be defined in a forthcoming document [TLS-DTLS-PROFILES].

This is IETF language for "we don't have a good solution to this problem, so we're going to give you some not very good options". However, when people went to actually do large-scale deployments, they had to actually do something. So far, we are seeing two main models evolve.

Same Provider Auto-Upgrade (SPAU) #

The first model, used by Chrome and Windows, is what's called Same Provider Auto-Upgrade (SPAU). The basic idea is that the client (either the browser or the OS) has a list of which recursive resolvers support secure transport. If the IP address of the configured resolver is on that list, then the client attempts to use secure transport;^[5] otherwise it just uses regular insecure DNS.

This design has two nice properties. First, it lets you quickly upgrade a lot of people because there is a fair amount of concentration in the resolver ecosystem. For instance about 15% of people use Google Public DNS, though not all of them will actually get upgraded, for reasons we'll see below. Second, it doesn't interfere with people's existing configurations: for instance if they use an enterprise resolver that does filtering or split horizon then they'll just continue using it without change. As we'll see, the converse property is a challenge with other models such as Trusted Recursive Resolver (TRR).

The main disadvantage of this design is that the level of security it offers is quite limited because when (as usual) the client learns about the resolver from the local network. If that local network is malicious—or there is an attacker on it—then they can just redirect you to their own resolver and this design provides no security at all. Where it does provide security is when your local network is secure (e.g., a home network) but the uplink to the recursive resolver may be insecure. But if you don't trust the local network (e.g., you're in an airport or a coffee shop) then SPAU doesn't provide much additional security or privacy.

There are also several practical deployment problems. First, even if the real recursive resolver you are using supports secure transport, it's quite common for people's local networks to have some sort of DNS resolver endpoint in the WiFi gateway or customer access router (the technical term here is customer premises equipment (CPE)), in which case even if the upstream resolver supports secure transport, you won't get it until the CPE upgrades (which does not happen often). I've seen estimates that in some countries over 80% of people have this kind of configuration. Second, this design requires the software vendor to keep a list of recursive resolvers that support secure transport, which doesn't scale well. This mode is on by default in Chrome.

Trusted Recursive Resolver (TRR) #

Firefox uses a different model, called a Trusted Recursive Resolver (TRR). The idea here is that instead of accepting the resolver provided by the network, Firefox has a list of resolvers which have agreed to comply with strong privacy and transparency requirements. These include very short data retention periods and strict limits on how the data can be used. When possible, Firefox will automatically select one of those resolvers and securely connect to it.

This design has two main advantages when compared to SPAU. First, it works even if the local resolver is insecure or untrustworthy (e.g., in a coffee shop) because the browser picks a "known good" resolver. Second, it provides encryption even if the local resolver doesn't. However, because a TRR model often bypasses the local resolver, this creates a number of challenges, as detailed below.

Information Leakage #

There is an inherent privacy tradeoff in changing from the network's resolver to a separate resolver because the network already has a fair bit of information about your activity from observing the rest of your traffic. Specifically, the network already gets to see the IP addresses you are connecting to, which often only reflect a single site (e.g., Facebook). Even in cases where there are a lot of sites on the same IP address pool (as with some CDNs), the TLS handshake can reveal the expected server through the Server Name Indication (SNI) field^[6]. Finally, it's possible to learn about which Web site people are going to via traffic analysis of the connection.

Adding a third party resolver creates a second entity besides the network which knows about your browsing history, which creates some additional risk, even if that entity has good policies. On the other hand, these alternate mechanisms of learning about browsing history are less efficient than just collecting DNS query logs, and there is active work on closing most of these holes, so this reduces your exposure to the network at the cost of increasing your exposure to the TRR. However, unlike your local network, the TRRs are required to have strong privacy policies; by contrast, it is known that many local networks do not. Nevertheless, this isn't an ideal situation and one that is potentially addressable via proxying as discussed below.

Local Policy #

DNS is often used to apply various kinds of local—or national—policies, for instance filtering adult content, logging user behavior (e.g., for law enforcement), or providing special "internal" domain names which aren't publicly resolvable. For obvious reasons, if the client selects a different resolver from that offered by the network, that resolver may adopt different policies

The difficult problem here is that it's hard to distinguish between situations where the user wants some sort of special policy treatment (e.g., blocking potentially malicious sites) and ones where the user doesn't but the network operator does (e.g., filtering out adult content). From a technical perspective, these both look like interference/attack by the network. Part of the value of securing DNS lookups is to protect against network attacks, and so a naive TRR deployment simply bypasses these policies, even if they were what the user wanted. Firefox in particular has some mechanisms to minimize this kind of impact, as discussed below.

Server Topology #

Most big server operators and CDNs have multiple points of presence at different places in the network. These all have the same name but different IP addresses. Because an ISP resolver knows the actual location of the client in the network topology, if it also knows something about the server's network, it can provide a server that is topologically closer to the client, theoretically providing better performance or making more efficient use of the ISP's network. However, if the client uses a centralized recursive resolver—or even one which doesn't know the ISP's topology—then this kind if optimization may not be possible.

This issue was a big concern when Firefox originally deployed the TRR model, but measurements suggest that in fact there is no real negative impact on performance from using a trusted recursive resolver. It may still be possible that there is an impact on network efficiency; but this is more of an issue for the ISP than for users.

The way that Firefox currently addresses this is to allow local networks to "steer" queries to specific TRRs. The idea here is that the local network might operate a TRR or have an arrangement with one which they share topology information with and so would prefer that clients use that. Currently, Comcast operates such a TRR and Firefox uses a DNS-based technique to determine whether such a resolver is available/preferred. Note that this doesn't allow the network to pick any resolver, just to select between TRRs. I discuss a more generalized solution below.

National Boundaries #

As Mozilla was first looking at launching Firefox with its TRR program, feedback from users indicated that many wanted to have a TRR that was in their jurisdiction (or, for many in Europe, a resolver in the EU). Another issue is that policymakers in some countries were concerned that resolvers would not comply with local regulations. Because of these concerns, Firefox has been somewhat cautious with its encrypted DNS rollout, and currently only has it on by default in North America, using Cloudflare in the US and CIRA in Canada. As of this writing, work is underway on expanding the program, though no specific plans have been announced.

Firefox Heuristics #

For the reasons discussed above, if Firefox just enabled DoX for everyone, this would cause problems for people's deployments. In order to address, this, Firefox uses a set of heuristics designed to address three important cases.

Enterprise-managed devices. In many cases, an enterprise will manage a user device and install their own DNS server or make other configuration changes. If Firefox detects this, it assumes that the enterprise won't want to use a TRR and disables DoH (though the enterprise can explicitly turn it on).
Parental controls. Some ISPs offer "parental controls" services which use the DNS to filter out adult content (with the consent of the parents if not the children). Firefox tries to detect this by checking to see if certain "canary" domains (domains which don't actually correspond to adult content but are used to test filtering) are blocked and if so, disabled DoH.
Local domains/Blocking. Some networks will serve domains that only resolve inside their own corporate network. If Firefox uses a TRR, then these domains fail. Firefox addresses this by falling back to Do53 if a domain is not found or if DoH just generally fails.

These heuristics are imperfect in two ways. First, they do not detect some cases where the user or device administrator might want DoH disabled. One important case is enterprise-owned devices where the operator doesn't remotely manage them. Unfortunately, there is no good way to detect this because any signal that is sent by the network could have been sent by an attacker. This is why Firefox requires evidence that the device is being managed before disabling DoH.

Second, they sometimes disable DoH when they shouldn't. In particular, networks can block the canary—or just block DoH generally—and cause Firefox to use Do53. This allows the network to disable encryption, which is obviously contrary to the goal of protecting the user from the network. For the moment, Mozilla has been treating this as a necessary compromise, but is monitoring the rate at which it happens and in future may make it more obvious to the user when DoH has been disabled and allow them to require secure resolution.

Local Network Discovery #

One important feature of the DoX deployments by Firefox, Chrome, and Windows is that they were something that clients could do on their own without any cooperation from the network. The reason for this is simply that it was the only way to get significant incremental deployment of a solution that addressed a real threat to user privacy. However, a number of network operators—and some governments—objected that they were losing their ability to control their networks. The result was months of of extraordinarily contentious debate, both in the IETF and in the press.

At the same time, it was clear that neither the existing SPAU nor TRR approaches were ideal, even from the perspective of the browser/OS vendors:

SPAU-style approaches required a centralized list of secure transport-compatible resolvers and had no way of detecting that the local network actually had such a resolver.
TRR-style approaches just bypassed the network resolver even in cases where it might be usable (e.g., in cases where that resolver was a TRR).

After months of loud discussion, the IETF decided to charter the Adaptive DNS Discovery (ADD) Working Group to work on mechanisms to allow the client to discover resolvers and their properties without saying anything about what they would do when they found them.^[7] In principle, such a solution could be used to feed into either an SPAU solution (by saying that the local network supports an encrypted resolver) or a TRR solution (by saying that it preferred one or more TRRs), without requiring vendors to change their basic policies, even if network operators wish they would.

There's nothing particularly surprising about the approaches that the ADD WG has come up with. Roughly speaking, they allow the network to indicate (either via DHCP or via a DNS query) that an encrypted resolver is available. When the indication is over DNS, the encrypted resolver has to have a WebPKI certificate for the IP that the client would ordinarily use for Do53 resolution, although it can actually operate on a different IP address.^[8] This is a very important requirement because it prevents an attacker from advertising a totally unaffiliated encrypted resolver that just steals your queries. Unfortunately, it is also extremely limiting: It's very common for home network routers/WiFI APs, etc. to have a DNS proxy which takes DNS queries and forwards them to the ISP resolver. This proxy will usually have an unroutable IP address^[9] which it's not possible to get a certificate for, in which case the existing ADD solutions won't work for SPAU-type designs (they work fine for TRR-style designs). There is active work on trying to address this use case, but not consensus on an approach or even that one is feasible. With the DHCP-based system, you can use a standard domain name—because DHCP is where you learn about the resolver in the first place—but this still won't work well if the actual resolver is just some local router because it probably won't have a globally resolvable name. [Updated 2022-01-17. Thanks to Neil Cook for pointing out that the original text just covered the DNS version.].

Transport Protocols #

We've gotten quite far without talking about the details of the various protocols, but now it's time. There are three major secure transport protocols which have been or are being standardized^[10] for DNS:

DNS over TLS (DoT). This is what you would expect, namely you open a TLS channel to the server and send DNS queries over it. There is also a DNS over DTLS, but that has gotten almost no usage and will probably be deprecated.
DNS over HTTPS (DoH). This maps DNS queries onto HTTP request responses and runs them over HTTP over TLS.
DNS over QUIC (DoQ). This sets up a connection over the QUIC secure transport protocol and sends DNS queries over it. Note that you can also run HTTP over QUIC (HTTP/3), so it's possible to do DoH over QUIC (DoHQ?) but this is something clients can do automatically without any new standards work, because from the perspective of standards, it's just HTTP.

Conceptually these are all very similar and indeed, it's not really clear why one needs both DoT and DoH (DoQ has better performance properties, as would DoHQ). DoT was designed before DoH—though unfinished when work on DoH started—but DoH has become more popular, largely because browsers such as Chrome and Firefox chose to deploy DoH rather than DoT (a decision made at least in part because browser vendors are comfortable with HTTP). On the other hand, DoT was designed primarily by the DNS community and is more popular there.

There has been a lot of criticism of DoH from operators who are concerned about the use of DNS transport for bypassing their network-based controls (Paul Vixie has been particularly vocal on this topic). The primary relevant technical difference from the perspective of a network operator is that DoT contains two pieces of protocol metadata that make it easier to distinguish from other kinds of TLS traffic: it typically runs over port 853 (rather than 443 as for HTTP over TLS) and has an Application Layer Protocol Negotiation (ALPN) identifier of "dot" rather than "h2". By contrast, DoH traffic just looks like HTTP traffic. The result is that it's somewhat easier to have your network block DoT traffic. However, it's not clear how long this will be true if there is a lot of blocking. The DoH servers currently commonly used by clients are also identifiable by IP and SNI so they're relatively easy to block, and if server operators want to conceal DoT, they can run it on port 443 and use ECH to conceal the ALPN. Fundamentally, these are policy not technical questions.

Security and Privacy Properties #

Whatever the transport protocol, at the end of the day what DoX is designed to give you is a secure channel to the resolver so you know that:

Nobody but the resolver is seeing your query to the resolver.
You are getting the result that the resolver is sending you.

How valuable this is depends in part on how much you trust the resolver: a secure channel to the resolver in your local coffee shop doesn't do you much good because you have no reason to trust that that resolver isn't lying or publishing your queries (this is a lot of the rationale for Mozilla's TRR design).

Even if you are connected to a resolver you trust, the level of security and privacy you get is limited by that resolver, especially if it's queries aren't encrypted, which seems quite likely (again, see a future post). First, if that resolver isn't validating DNSSEC (or you are trying to resolve one of the majority of domains which aren't DNSSEC-signed) then a network attacker might forge responses to that resolver, which will happily pass them on. Second, an attacker who is able to observe queries by the recursive resolver may be able to infer which of them are yours by looking at timing. This form of attack is somewhat limited by the fact that recursive resolvers cache responses and so won't necessarily issue new queries to authoritative resolvers for every query, but it will probably issue some of them. It's also possible to do traffic analysis on the encrypted query stream from your machine to the recursive resolver itself based on packet size and timing.

Oblivious DoH #

Even if you are connected to a known and trusted resolver, it's still not ideal that that resolver gets to see all of your queries as well as your IP address. One way to address this is to proxy your encrypted DNS queries through a proxy which conceals your IP address from the DNS server. That way, your queries and IP address are never in the same place. Apple is already doing this with Oblivious DoH and the IETF is standardizing a system called Oblivious HTTP which can be used to proxy DoH traffic (there is no equivalent for DoT).

DoX and DNSSEC #

If your problem statement is "how do we secure the DNS", then you might think of DoX and DNSSEC as competitors, and to some extent this is true: resources being spent on DoH—and in this case it is DoH and not DoT—in endpoints are not being spent on endpoint DNSSEC. Moreover, because local networks are a powerful point of attack and so a secure channel to a trusted resolver reduces the need for DNSSEC validation. In addition, to some extent DoX reduces the need for endpoint DNSSEC verification because it allows endpoints to take advantage of DNSSEC verification in the recursive resolver (assuming they trust it).

However, from another perspective, DNSSEC and DoX are complementary: DoX does something that DNSSEC does not, which is to provide confidentiality. Even if every client did DNSSEC validation, DoX would still serve an important privacy purpose; I certainly don't see clients implementing DNSSEC validation and then deciding to turn off DoX, especially given that it provides important security for the vast majority of domains which are not currently DNSSEC-signed. On the other hand, DNSSEC does something DoX does not, which is to provide end-to-end integrity.

Second, DoX is actually an enabling technology for DNSSEC: one of the big concerns about DNSSEC deployment is that network intermediaries will not convey DNSSEC records directly, thus creating false positive failures when DNSSEC validation fails. However, any resolver which speaks DoX is quite likely to also handle DNSSEC correctly—this can be guaranteed in a TRR system—and thus DoX has the potential to make the risk of deploying endpoint DNSSEC lower and thus perhaps modestly increase the chance of it happening.

Next Up: Recursive to Authoritative #

So far I've really focused on the endpoint perspective, but of course DNS resolution actually involves much more than the stub to recursive link. In the next post I'll address the difficult problems of encrypting the link between the recursive and authoritative servers.

Appendix: How DDR Works #

The IETF has proposed two main protocols for discovery of encrypted resolvers Discovery of Designated Resolvers (DDR), which is DNS-based and DHCP and Router Advertisement Options for the Discovery of Network-designated Resolvers (DNR), which uses the same mechanisms that clients use to autoconfigure themselves for a given network. From my perspective, DDR is the more interesting one because it (sometimes) works without changing customer premises equipment, a process which takes a long time.

The basic setting here is one in which the ISP has both a traditional Do53 resolver and an encrypted resolver (of any flavor, whether DoH, DoT, etc.). However, they don't control the customer premises equipment, which means that they can't change the DCHP or IPv6 RA-type configuration provided by that equipment. The way around this is that the client asks the resolver whether it has an encrypted version. The basic flow looks like this:

When the client joins the network, it is provided with the IP address of the Do53 server in a DHCP option (this assumes DHCP). This is just the normal situation without DoX. Next, the client makes a request to the Do53 server for a special domain (resolver.arpa). The Do53 server responds with the address of the DoX resolver and the client can then connect to it. There are two important points to note here.

First, the identity that the client expects the DoX server to present is the IP address that it was configured with via DHCP. Recall that the threat model here is that the attacker is able to interfere with your connection to the Do53 server—otherwise you wouldn't need encryption—and so you can't trust the new IP address you get from it. This way at worst you end up encrypting to someone who controls the IP address you were going to send your Do53 traffic to anyway. Second, this explains why DDR doesn't work if the CPE has a DNS proxy: in that case you will get the IP address of that proxy and therefore the ISP's DoX server won't have a valid certificate to use to authenticate as that server.

As should be clear from the above, DDR is mostly useful for SPAU models, but you can also use it for steering in a TRR system.

Though actually designing such a protocol is not easy. A topic for another day. ↩︎
One exception here is outsourced cloud-based "enterprise" DNS offerings like OpenDNS (now called Umbrella) which but may want to authenticate that users are actually employees before providing answers. ↩︎
Because it runs on UDP and TCP port 53. ↩︎
There are situations in which someone manually configures the resolver address for instance to bypass the network resolver, but they are comparatively infrequent. ↩︎
I'm not sure if the clients hard fail if they can't successfully connect, but in principle you could. ↩︎
Though the TLS working group is hard at work on fixing this. ↩︎
This is a pretty typical IETF "mechanism not policy" type of compromise. ↩︎
I know this feels counterintuitive, but it's actually the way that HTTPS works now. If I go to www.example.com and there is a CNAME to www.cdn.example, the client checks the certificate for www.example.com. The reasoning here is that the original identity is what the client wanted and the redirect is just some behavior by an untrusted network. ↩︎
These are drawn out of blocks designed for local use, such as those defined by RFC 1918. The key point is that these addresses will be shared and therefore cannot get certificates. ↩︎
There are also two non-standard protocols in use, DNSCrypt and DNSCurve but for various reasons, the IETF opted to start with its existing secure transports. ↩︎

Day	Morning Range	Miles Driven	Evening Range
1	200	80	120
2	180	80	100
3	160	80	80
4	140	40	100
5	160	40	120
6	180	40	140
7	200	40	160

Day	Morning Range	Miles Driven	Evening Range
1	200	80	120
2	180	80	100
3	160	80	80
4	140	40	100
5	160	40	120
6	180	40	140
7	200	40	160

Day	Morning Range	Miles Driven	Evening Range
1	200	80	120
2	180	80	100
3	160	80	80
4	140	40	100
5	160	40	120
6	180	40	140
7	200	40	160