DNS Security, Part V: Transport security for Recursive to Authoritative DNS

Posted by ekr on 21 Jan 2022

This is Part V of my series on DNS Security (parts I, II, III), IV). In part IV I covered DNS transport security between the client (the stub resolver) and the recursive resolver but ran out of room to talk about the recursive to authoritative link, which is the subject of this post.

Recall yet again the DNS resolution process, shown below:

DNS resolution process

For this post, we will be focusing on protecting the transactions between the recursive resolver and the authoritative servers, shown in blue in this diagram. The work on this has been happening in the IETF DNS PRIVate Exchange (dprive) Working Group. This is commonly called Authoritative DNS over TLS (ADoT), or ADoX if you want to indicate that you don't care whether the transport is DoT, DoH, or DoQ.

The Basic Setting #

Before we start looking at mechanisms, it's helpful to frame the problem correctly. We have two objectives:

Protect the confidentiality of the request. I.e., we do not want the attacker to know that the user is trying to resolve example.org.
Protect the integrity of the response. I.e., we do not want the attacker to be able to lie about the address for example.org.

As discussed before, while DNSSEC can provide integrity, it cannot provide confidentiality.

The first thing to notice is that this means we need to encrypt both the link to the authoritative for .org and the link to the authoritative for example.org because both transactions leak that the user is interested in example.org. Importantly, the privacy value of the query is limited by the number of other domains which are served by the same authoritative as example.org, because the user must be asking for one of those domains. For this reason, if we have encrypted DNS your users will get better privacy if your domain is hosted by a DNS provider that serves a lot of other domains as well. Note that there are cases in which example.org might have a lot of subdomains and you wouldn't want the attacker knowing which one is being requested, but in the most common case it's the second level domain that matters.

Second, in order to provide confidentiality for these lookups, we need to provide integrity for the identity of the server. For instance, if the attacker is able to attack the connection between the client and b2.org.afilias-nst.org, it can substitute its own server for the true authoritative server b.iana-servers.net. DNSSEC as-is does not prevent this form of attack because it doesn't sign the NS records at the parent, but only at the child; but by the time you've queried the child for them, it's too late because you've already leaked the query to the attacker. This means that the most convenient thing is if every link uses secure transport, so that you can trust the results it gives you at stage N before using them for stage N+1. In other words, you want to have secure transport all the way to the root.

As before, then, the basic problem is setting the DNS client's (in this case the recursive resolver, confusing, right?) expectations correctly. In particular, if we are going to be resistant to active attack, the recursive needs to know:

That the authoritative server will do DoX (and what protocol)
The identity to expect the authoritative server to present

If it doesn't know either of these things, then an active attacker can interfere with the connection. Specifically, if the recursive doesn't know that the authoritative server will use DoX, then the attacker can just simulate an error when the recursive tries. If it doesn't know the identity that the authoritative server will present, then the attacker can just provide its own identity and impersonate the authoritative. Unfortunately, this turns out to be quite a bit more difficult than one would like.

Root Servers #

As shown in the diagram above, the first request from the recursive resolver goes—at least notionally—to the root server. If this is to use secure transport, the only way that can work is for the recursive to be preconfigured with the information about which root servers use secure transport. There are only 13 root server names (a.root-servers.org through m.root-servers.org), so it's not at all impractical to imagine just disseminating an updated list. Note that it's not necessary for all the root servers to switch to secure transport at once (they are operated by different people), but of course if the recursive preferentially uses secure transport, then the first one to switch might get increased load. As a practical matter, it seems unlikely that we're going to get secure transport to the root immediately. It's much simpler for the recursive resolver to run a mirror of the root zone locally, as specified in RFC 8806.

Non-Root Authoritatives #

The situation with non-root resolvers (e.g., for .com or example.com) is more complicated, because the way you learn about those resolvers is from the root resolver, so how does the recursive learn that they accept secure transport. There is a similar problem all the way down the chain: when the parent nameserver (e.g., b2.org.afilias-nst.org) tells you about the child resolver for a given zone (e.g., b.iana-servers.net for the zone example.org) how do you know the properties of the child resolver? If you are used to the Web, there will seem to be an obvious answer: the parent nameserver should tell you. This is how things work on the Web, where there is a different URL scheme for secure transactions (https:) versus insecure transactions (http:).

However, DNS isn't the Web and there are actually two "parent servers" where this data could go. Consider the case where we are trying to resolve example.org, but the authoritative server for example.org is on example.net^[1] In order to look up example.org the recursive resolver need to first look up example.net so that it can then contact it. This means that there are two places where one could indicate that the connection to example.net should use secure transport. First, you could put the information in the NS records for example.org that say to contact example.net (this corresponds to the way things work on the Web). These records would be served off of the .org authoritative server, like so:

Indicator to use DoT at the target's parent

This seems natural but has the disadvantage that every domain which uses example.net as its nameserver needs to update its own records individually^[2] A more DNS-like approach.^[3] is to have the indication be in a record that gets served for the authoritative (example.net) that you get when you look up its IP address. This would be served off of the .net authoritative, like so:

Indicator to use DoT at the resolver's parent

The advantage of this second approach is that as soon as example.net upgrades to secure transport, everyone who uses it as a nameserver gets it, by contrast with the first approach where each domain has to configure it separately for its authoritative server.

You'll notice that I've just written "Use DoT" here, but that's handwaving, not telling you how it actually works, and in this case details really matter. Unfortunately, here is where we run into trouble. The basic problem here is updating the parent server to know that the server for the child domain supports secure transport. This is a lot more complicated than it sounds, to the point where it's more or less stalled the whole effort. The next section describes the situation in some detail, but the TL;DR is that there seem to be no good existing mechanisms for doing this, so we're left with either not doing it or with some hacks (skip ahead).

Populating the Parent Zone (Technical) #

Warning: this section is fairly technical. You can safely skip it if you don't care about the details.

Recall that DNS has a number of different resource record (RR) types, including A/AAAA for IPv4 and IPv6 addresses, etc. The information about what server to use for a given domain is contained in a nameserver (NS) record, but unfortunately that record has no place to carry other information about the server. The "right" place to put this information is in the service binding (SVCB) record, which can already be used to signal that you should use HTTPS rather than HTTP (the use case for this is cases where someone has used an http: URL but the target domain always wants you to use TLS). Unfortunately, actually populating the parent zone with SVCB turns out to be impractical, at least in the short to medium term.

There are several separate entities who have to cooperate in order to serve a domain name:

The registrant who actually operates the domain (e.g., Google for google.com).
The authoritative name server who actually serves the DNS records for the domain.
The registry which actually hosts the DNS for the parent domain. For instance Verisign operates .com.
The registrar which is responsible for actually interacting with the registrant. It is the registrar's job to populate the registry's database with NS records that point to the authoritative name server.

The registration process proceed as shown below. Note that I've shown it in one order but the steps can sometimes happen in a different order:

First, the registrant registers (i.e., buys)^[4] the domain with the registrar. This just creates a database record that indicates they own the domain.
The registrant publishes the DNS records for the domain with the authoritative server. In this example, they just publish the IP address.
The registrant tells the registrar which authoritative server it is using.
The registrar tells the registry which authoritative server the domain is using, using the Extensible Provisioning Protocol.

At the end of the day, we end up with a situation in which:

The registry (and hence the parent domain) is publishing a record that says that example.com is hosted on the authoritative server.
The authoritative server publishes a record that actually has the address for example.com

In practice, it's reasonably common for two of these entities to be the same. For instance, big companies like Google or Facebook usually run their own authoritative servers. Another version is that many registrars operate their own authoritative servers. In some cases, a hosting provider will operate a registrar and an authoritative server (for instance, Dreamhost is the registrar, authoritative server, and web hoster for rtfm.com).

Whatever the exact configuration, the first problem is that EPP, while extensible, does not currently provide any mechanism for conveying SVCB records, so if we wanted the registrar to convey them to the registry, we would need an extension, which would take some time to deploy. For this reason, there has been a fair amount of interest in ~~hijacking~~reusing existing DNS records which are already propagated to the parent zone.

DS Glue #

Probably the most promising version of this is called "DS Glue" and uses a DS record for a fake algorithm to smuggle information about the target resolver. This is one of those hacks which sits right at the border between hideous and brilliant: because DS is already propagated the parent, we hopefully don't need to change registries or EPP (I say "hopefully" because this depends on those elements being willing to handle the new DS record type, and it's to be seen whether that will work properly.) DS Glue has the nice property that it doesn't require DNSSEC deployment: as long as there is secure transport^[5] to the parent authoritative (in this case, for .org) and to parent for the authoritative server's domain (in this case .net) then the records are trustworthy. If either of these connections is insecure, however, then the attacker can substitute new NS records (to point to a different authoritative server) or strip the DS glue records (thus blocking encryption.)

If the transport connection to the parent for the authoritative isn't secure, but that zone is DNSSEC signed, then DS glue still works. It works less well if there isn't secure transport for the parent of the target domain because NS records aren't signed in the parent and so the recursive will get the DS glue records for the wrong authoritative.^[6]

TLSA #

The other major live proposal is to use the TLSA record to indicate that the authoritative server wants secure transport. This would be delivered in roughly the same way as the DS glue record. This has the disadvantage that it requires that the authoritative server's domain be DNSSEC signed, which then becomes an obstacle to deployment. One of the advantages of secure transport is that it can be deployed in parallel with DNSSEC and this would remove that advantage, so I'm less optimistic about this approach.

No signaling in the parent #

The alternative approach is to not signal in the parent that the authoritative server for the child zone supports secure transport. In this case, the recursive will have to discover that somehow. The most likely way is that you query for a SVCB record for the authoritative server, though I've also seen suggestions to query for a TLSA/DANE record. This would look like this:

This is secure if and only if the zone for the authoritative server is signed. If it's not signed there's nothing stopping an active attacker from just intercepting the connection to the authoritative server and responding that the authoritative doesn't support secure transport (note that it most likely can't actually establish secure transport because it will have the wrong credentials), like so:

Downgrade attack on resolver status via SVCB

An additional problem is that it with this design is that it likely introduces additional latency because the recursive resolver needs to first query the authoritative server for its capabilities and only then can it ask the real question (this is one of the main reasons for signaling in the parent).

Another alternative is to signal this information in the child domain itself somewhere. This is technically possible, but the problem is that by the time you've looked up the information in the client's domain, you've already leaked to the attacker what domain you want to resolve. Of course, after that's happened you could learn that the child wanted secure transport and use it in the future, but not if the attacker attacks the connection between you and the child, so you need DNSSEC here too. Moreover, it means that every child needs to independently signal that it wants secure transport to its authoritative.

Insecurely Discovering Secure Transport #

While it may ultimately be possible to provide for a method of securely signaling the use of secure transport, it's starting to look like it's going to be very difficult to converge on something that everyone likes. In the meantime, a number of people have proposed that instead we do what's often called either unauthenticated or probing modes of secure transport. The basic idea here is that the recursive resolver would attempt secure transport to the authoritative resolver and then in future remember whether that worked or not.

Obviously, this kind of system isn't entirely secure against active attack, but it might be a good idea anyway for at least three reasons:

Active attack is harder than passive attack, so you've increased the attacker's costs.
If you have a way for the authoritative server to signal its commitment to supporting secure transport for some period (like HSTS for HTTP), then you can bootstrap insecure discovery into a secure mode; this requires the attacker to mount an active attack the first time you connect, which is even harder.
It helps the authoritative (and to some extent the recursive) resolvers get experience with deploying secure transport without running the risk of hard failures if something goes wrong (see more on this below).

Moreover, this kind of mechanism is much easier to deploy, because it doesn't involve any of the difficulties we saw above with signaling availability of secure transport prior to connection establishment, or with propagating records to other servers. For that reason, it seems like it might be easier to deploy.

Historically I've not been that enthusiastic about this kind of insecure discovery (what's often called "opportunistic", but that word has become the subject of headed debates about its precise definition), because it's really better to have secure discovery and this seemed like a distraction from that. However, as the discussion about how to actually do the secure signaling has dragged on—and to some extent ground to a halt—I've started to think it's may be better to do something than nothing.

TLSA vs. WebPKI #

Another point of contention here is how the authoritative servers should authenticate. There are two major options here, use the WebPKI like TLS on the Web, or use TLSA/DANE (see here for my writeup on this.) This is an issue which raises some very strong feelings on both sides.

On the WebPKI side, the argument is roughly that we already have plenty of experience with the WebPKI and while it has its problems, it's well understood and we know we can deploy it. By contrast, TLSA/DANE requires taking an unnecessary dependency on DNSSEC. On the TLSA side, the argument is roughly that (1) the WebPKI is bad (2) WebPKI security depends on DNS, so we shouldn't make DNS security depend on the WebPKI, and (3) we should stop acting like DNSSEC isn't a requirement (and perhaps that if we make things depend on DNSSEC, it will become a requirement).

As should be clear from this long series of posts, I'm more optimistic about WebPKI, but I'm more than happy to design a system which allows either WebPKI or TLSA/DANE and let the market sort it out.^[7] As far as I can tell, this is the position of most of the people who favor WebPKI, so the two sides really are more like "WebPKI or TLSA" or "TLSA only" (see above about the implications of making DNSSEC a requirement.)

Operator Concerns #

Even assuming that we address the technical issues about when recursive resolvers initiate secure transport, actually getting deployment requires that the authoritative servers enable ADoX; unfortunately, there are serious questions about their willingness to do so. In March of 2021, the root server operators published a statement expressing concern about the use of encryption to the root servers:

Server Operators have some concerns about supporting DNS encryption for serving the root zone. It is well known that UDP has desirable performance characteristics, due to its stateless nature. Increasing the state-holding burden with the addition of connection-oriented protocols, as well as encryption data, not only reduces the performance of name servers, but also may raise new types of denial-of-service attacks.

At this time, the exact risk-reward tradeoffs for deployment of encryption to root name servers is unclear and will likely depend on which particular transport proposals gain momentum. Root Server Operators do not feel comfortable being the early adopters of authoritative DNS encryption and would like to first see increased deployment in other parts of the DNS hierarchy. Meanwhile, there are other ways to improve privacy in queries sent to root and other name servers.

As described above, it's of course theoretically possible to just do secure transport to the TLD server and not to the root (though Verisign, for instance, runs both .com and two root servers). In addition, some operators also published an Internet Draft documenting, their concerns which roughly come down to performance (due to the additional cost of encryption and doing TCP) and about stability (which seems to be about whether TLS/QUIC failures will cause resolution to fail).

These concerns are actually sort of puzzling to Web people, for several reasons. First, the vast majority of Web traffic is encrypted, including key services like Google and Facebook, and once operators got past the teething pains, this doesn't seem to have created increased stability concerns. If Google goes down, it's an enormous deal, perhaps even bigger than a DNS authoritative server failure, because recursive servers cache data and so won't start failing immediately.

Second, although encryption does increase load somewhat, even 10 years ago it was a relatively small fraction of the cost of running a server. In a 2012 talk by Langley, Modadugu, and Chang they reported that SSL/TLS accounted for less than 1% of CPU load on their front-end machines, and of course both machines and TLS have gotten faster. It's true that serving DNS tends to be lighter-weight because UDP is cheap and the servers are largely stateless (though QUIC may help some here), but the overall load profile doesn't seem like a big deal. As a comparison point, all the root servers together serve on the order of 80 billion queries a day. This is equal to less than an hour of of Cloudflare's query volume, so doesn't seem that impractical to protect. It's certainly possible—even likely—that it would require those operators to invest more than they have in infrastructure, but it seems far from impossible.

Summary #

As I said above, the situation is in flux, but overall, I'm not that optimistic. This is a system with a lot of moving parts and where a number of the veto points have relatively little incentive to change their operations, or as is the case with the root operators, be actively skeptical of doing so. If we look at the situation with DNSSEC deployment, which DNS operators are relatively enthusiastic about and which still has a lot of friction points, the prospects for any kind of signaling for ADoX don't look that great. The prospects for some sort of probing/unauthenticated mode—potentially with an HSTS-style upgrade—seem a little better, but even that seems like it may be a stretch.

Really, it would probably be on ns.example.net but I'm simplifying. ↩︎
This is the situation on the Web, hence HSTS. ↩︎
This may all seem obvious to people who understand DNS, but it took me a while to work through it, so I think it might help others too. ↩︎
Or, more accurately, rents. ↩︎
And recursively from the root. ↩︎
There is one case where this still sort of works: if (1) the target zone is signed and (2) the sensitive label is one deeper than the target zone, e.g., sensitive-label.example.com and (3) the recursive first queries the target authoritative to check the NS record (NS revalidation). In that case you can still protect the sensitive label. ↩︎
This does entail more complexity, because it probably requires a way to signal which kind of credential the authoritative will use so that a recursive which only knows WebPKI or TLSA/DANE knows if it will be able to connect. ↩︎

Educated Guesswork