Self-Driving Vehicles, Monoculture, and You

What if everyone else jumped off a bridge?

Posted by ekr on 10 Oct 2022

Warning: this post didn't come out quite as tight as I was hoping. I think there are a bunch of interesting ideas and connections to be drawn, but they don't hang together as well as I wanted. That said, I'm not quite sure how to improve things, and so I'm just going to post it as-is. The Internet has plenty of bits, after all.

Max Chafkin's article arguing that self-driving cars are failing is making the rounds, especially this amazing opening bit:

The first car woke Jennifer King at 2 a.m. with a loud, high‑pitched hum. “It sounded like a hovercraft,” she says, and that wasn’t the weird part. King lives on a dead-end street at the edge of the Presidio, a 1,500-acre park in San Francisco where through traffic isn’t a thing. Outside she saw a white Jaguar SUV backing out of her driveway. It had what looked like a giant fan on its roof—a laser sensor—and bore the logo of Google’s driverless car division, Waymo.

She was observing what looked like a glitch in the self-driving software: The car seemed to be using her property to execute a three-point turn. This would’ve been no biggie, she says, if it had happened once. But dozens of Google cars began doing the exact thing, many times, every single day.

King complained to Google that the cars were driving her nuts, but the K-turns kept coming. Sometimes a few of the SUVs would show up at the same time and form a little line, like an army of zombie driver’s-ed students. The whole thing went on for weeks until last October, when King called the local CBS affiliate and a news crew broadcast the scene. “It is kind of funny when you watch it,” the report began. “And the neighbors are certainly noticing.” Soon after, King’s driveway was hers again.

Waymo disputes that its tech failed and said in a statement that its vehicles had been “obeying the same road rules that any car is required to follow.”

Here's the thing, though: Waymo is right. It wouldn't be a big deal if just the occasional person did a K-turn in King's driveway (who among us hasn't turned around in someone's driveway?), but when everyone does it, then it's a disaster, as least for King. However, it's a little harder to pinpoint exactly what's wrong here.

There's an obvious account of this situation, which is that this is a case of AI risk, incentive alignment, and the famous paperclip optimizer. In this version of the story, Google's system for training their cars is only interested in saving time (or wear on the cars, or whatever), doesn't take into account the externalities of their behavior, so it's perfectly happy to keep people up all night with car noise if it saves a few seconds or minutes.

There certainly is some kind of alignment problem here, but I think this analysis doesn't quite capture it. As I said above, the problem isn't that any particular car does a K-turn in King's driveway, but that all of them do. Even if we ignore externalities, it's not clear that this is an optimal solution: according to the story there were cars lining up to make this turn, at which point you should be wondering if this really is the fastest way for them to accomplish their objective. This suggests another analysis, which is that this is a locally optimal approach which isn't globally optimal, even if we ignore externalities.

This shouldn't be an unfamiliar concept: there are lots of things which work at a small scale but not at a large scale. There are at least two possible failure modes that one can encounter:

This just isn't scalable at all
You need some diversity

Unsustainable Scaling #

Most people are used to systems that have unsustainable scaling. Sometimes this is due to externalities, such as with air pollution. Back when only a few people had cars, it didn't really matter that a typical internal combusion engine emitted way too much NO_x, but put enough cars on the road and you get acid rain, hence catalytic converters. The situation with CO₂ and climate change is similar: we can only dump so much into the atmosphere before whatever homeostasis there is starts to break down.

Other cases of unsustainable scaling aren't so much due to externalities as due to resource constraints. We saw that early in the COVID pandemic, where we had really effective COVID tests based on PCR but there were only a few labs that could do them. Those tests have become more standardized, but we also now have cheap lateral flow tests that scale. I understand that this is also a problem in educational interventions, which often seem to work in pilot projects with teachers who are committed to the idea but don't scale well when you need every teacher to do it.

The need for diversity #

Another possibility is that you actually do have something scalable, as long as not everyone tries to do exactly the same thing. It might be the case that there are hundreds of little hacks like this, and if only a few cars used each of them, it would be fine, so you just need diversity rather than uniformity. The common example of this is of course monoculture in crops, though you actually can get very high yields this way, but you end up with a brittle system. However, there are also situations in which the whole system falls apart if you don't have some diversity.

This is a familiar concept in networking, where, like above, you often have some resource that needs to be shared between multiple agents and if they don't share nicely, everything collapses.

Avalanche Restart #

One well-known case is what's called "avalanche restart". Suppose that you have a server that is under heavy load (i.e., has a lot of clients) and then for some reason it reboots.

Of course, this is experienced by clients as a failure, and they try to reconnect. The obvious thing to do is to try to reconnect immediately and if that fails try again (i.e., in what's called a "tight loop"). This is locally optimal, because it lets you reconnect quickly, but globally bad: if everyone does this, however, what often happens is that you can overload the server or the network that the server is on, which leads to bad service for everyone as it tries to switch between every client and might even cause it to reboot again (this shouldn't happen, but all software has bugs.)

There are two standard techniques to address this problem:

Instead of having the clients retry immediately, have them wait a random time (e.g., between 1 and 10 seconds).
If the client fails to connect, then it increases (typically, doubles) the amount of time it waits before the next retry. This is called "exponential backoff".

Typically, these are used together, so you randomly start and then exponentially back off. The net effect is that you don't have every client trying at the same time, and the rate of clients attempting automatically adjusts until the server isn't overloaded.

Obviously, this isn't locally optimal: if the server has very few clients it would be better if the clients just reconnected immediately. Moreover, if everyone else following the random start + exponential backoff approach, then it's obviously advantageous for a single client to just try to reconnect aggressively (to "defect" in the game theory jargon). But if everyone defects, then the result is that the server is over capacity and most people get terrible service. The point here is that it's better for everyone to do something slightly suboptimal but different than it is for everyone to do the same thing, even if it's locally optimal.

NiCad Battery Memory #

I had originally been intending to write about the famous Nickel Cadmium battery "memory" phenomonen. The way the story is usually told is that there was a satellite that was powered by solar panels and used NiCad batteries to store energy during periods when the the panels weren't illuminated (due to the Earth being in the way of the sun). Because the orbit is very regular—and there's no weather in space—the battery was charged and discharged on a repeating regular schedule. Eventually, it started exhibiting decreased storage at the point where it would usually start being charged. However, attempts to reproduce this phenomenon seems to have been mixed.

Network Transmission Rate Control #

A similar situation occurs with network rate control. A good example is the classic Ethernet local area network. In original Ethernet, every computer was on the same wire and so whatever you send is received by every other computer and vice versa, just like a radio network. But two computers can't transmit at the same time because they will step on each other. The question then becomes how to divide up the time.

One way to address this problem is to have defined time slices during which each node can transmit, but this requires tightly coordinated clocks and doesn't adapt well if one node wants to transmit a lot and the others want to listen. Instead, Ethernet solves this problem by having each node transmit as soon as it has something to send and no other node is transmitting, but it also detects if another node also chooses the same time to start (a "collision"). If there is a collision, each node picks a random amount of time to wait before it tries to start transmitting again.^[1] This way, the chance of a repeated collision is relatively low. Obviously, it would be better for each node to retransmit right away, but if everyone does that you will just get collisions again.

Here too, you get a more globally optimal result if everyone does something that's locally suboptimal.

Some other potential cases #

I'm not trying to suggest that this is some brilliant insight, but nevertheless it's an effect we see surprisingly often. Some other examples of similar phenomena:

Complaints that because of Instagram everyone goes to the same places for vacation.
Heavy congestion on popular hiking and running trails because everyone wants to do Rae Lakes, JMT, etc. and they've had to institute a quota system, even though there are lots of great trails that are basically empty. Pro Tip: quotas only apply to camping, so if you can trail run it in one day you can do anything.
Congestion on "alternate" routes that avoid rush hour traffic on the major arterials. This is a similar case because it would be fine if just a few people did, it but we can't have everyone driving through downtown Palo Alto to get from 101 to 280. We see this some organically but I've often wondered if traffic sensitive navigation systems like Waze and Google Maps that reroute you to alternate routes make efforts not to send everyone there.

There's also a whole game theory literature on what's called mixed strategies which is in part about how it's often better to play a mix of multiple strategies rather than a single uniform one. There's a connection here to the tragedy of the commons (and of course to Prisoner's dilemma) as well.

Coordination #

As I said, this is a pretty common problem, but it can be pretty hard to address when you have a bunch of individual agents all making their own decisions. Above, I've mostly talked about how each agent has an incentive to defect and get a locally optimal solution, even if it's not globally optimal, but even if every agent plays by the rules, it can still be vary hard to design a system that produces the right result.

As a concrete example, early implementations of the TCP network protocol implemented an algorithm for controlling the transmission rate that could fail catastrophically, resulting in what's called "congestion collapse", in which the network was entirely full of traffic, but it was mostly retransmitted data and almost no real progress was being made (Van Jacobsen and Karels have an approachable account of what happened and the fix). The problem of designing rate control algorithms that perform well but don't result in congestion collapse has occupied network engineers ever since. The fundamental problem here is the lack of a centralized point of view and control, instead each agent has to make its own decision independentally, and designing an efficient algorithm is hard.

This is actually the part I find a bit puzzling about the whole Waymo thing: surely the Waymo engineers know about this general phenomenon and they do have an overall view of what's happening, so it would be natural to put in some sort of throttling system so that not every car tries the same hack at once, or even to detect congestion in real time. Do they not have a system like this? Is this still the optimal algorithm in terms of car time, even though it's annoying for homeowners? Something else? Waymo people, my DMs are open!

There is also an exponential backoff component here in case of another collision. ↩︎

Educated Guesswork