Understanding The Web Security Model (Outtake): Cookies and Behavioral Advertising
Posted by ekr on 13 Mar 2022
Ad Networks #
Most advertising on the Web is done by ad networks. It's of course technically possible to just sell ads on your own site, but for obvious reasons this doesn't really work unless you're a big prestige site like Google, Facebook, or the New York Times. Instead, the typical thing to do is for the publisher to work with some third party ad provider who places ads on a lot of different sites.
The technical details of the system are unbelievably complicated. It's traditional at this point to show the baffling diagram below, called the "LUMAscape", which maps out the various entities in the ad ecosystem. However, at the level we need to be concerned with, matters are fairly simple.
In order to show advertising from a given ad network, the publisher embeds an element on their site with content of the element being loaded off of the ad network's server. When the user visits the publisher's site the browser automatically loads the content from the ad network, which invisibly decides what ad to show. Recall that there's no rule that the content at a given URL has to remain constant, so the server can dynamically select the specific ad based on any information it has.
Determining Context #
The question then becomes what ad the network should show. You could obviously show the same ad everywhere, but that's not going to do a very good job of showing interesting ads. The next most interesting thing is to show what's called a "contextual" ad, which is to say an ad that is relevant to the content of the page on which it is being shown. For instance, if you were on Runner's World you might get an ad for running shoes.
However, a lot (most?) of Web advertising isn't contextual but rather "behavioral". What this means is that it's not just based on the page the user is currently is on but based on their previous behavior. That behavior is measured using cookies.
Behavioral Tracking with Cookies #
If the advertising network has contracts with multiple publishers this allows them to observe the user's behavior across those publishers. The first time that the user goes to a page served by a given ad network, that ad network sets a cookie. From then on, they get to see every site that the user goes to and can link them all up using the cookie. Based on that information, they can build up a profile of the user's behavior and use that to decide which ads to show (recall that the server can serve any image it wants, regardless of the URL). The diagram below shows an example of this process.
The user first
sneakers.example, which embeds an image from
the advertiser's site. The advertiser only knows that the
user is on
sneakers.com but nothing about the user
so it serves a contextual ad for sneakers. However, when
it returns the ad it sends a cookie. Later, the user
recycling.example, which also embeds an image
from the same advertiser. This time, when the user
visits the advertiser, it sends the cookie, so the
advertiser knows that (1) the user was on
before and (2) they are on
so it shows the user an ad suitable for both interests:
You can also use this seem basic technique for what's called retargeting. Suppose you go to a site and look at some product. If the ad network has a presence on the site (this can be an invisible element) then they can record this event and use it to target ads specifically at people interested in that product.
The Bigger Picture #
At the time cookies were first introduced, people did understand that there were privacy implications. However, a lot of the attention focused on first party tracking (i.e., of your behavior on a single site). The original cookie RFC has a fairly extensive discussion of privacy, but the section that most clearly addresses the third party context is kind of confusing and seems almost to be discussing what is now called cookie syncing:
A user agent should make every attempt to prevent the sharing of session information between hosts that are in different domains. Embedded or inlined objects may cause particularly severe privacy problems if they can be used to share cookies between disparate hosts. For example, a malicious server could embed cookie information for host a.com in a URI for a CGI on host b.com. User agent implementors are strongly encouraged to prevent this sort of exchange whenever possible.
My sense is that people were sort of aware of the problem but just didn't anticipate the scale of tracking that would eventually result. It's also worth noting that early browsers would often prompt users before accepting cookies, thus making this kind of tracking more difficult. Eventually, of course, every site wanted to set a zillion cookies and the permission prompts got too annoying so they were removed, only to be replaced years later by the arguably even more annoying GDPR cookie consent dialogs.
This is a theme we'll be seeing throughout this series: a lot of the early Web features were designed to solve specific problems and without much of understanding of the broader implications. It took years for the security and privacy community to catch up and develop a more comprehensive understanding of the security of the Web platform, and, as with advertising, we're still dealing with the implications of those original choices.
Technically, this third party is called a supply-side platform (SSP). There are also demand-side platforms (DSP)s which serve the advertisers, plus a bunch of other stuff. ↩︎