Understanding The Web Security Model, Part I: Web Publishing
Posted by ekr on 04 Mar 2022
Note: This is one of those posts that is going to be best read on the Web, especially if you read your email using GMail or the like, as it will tend to mangle some of the HTML features.
Like many pieces of technology, the Web is one of those things that people are perfectly happy to use but have absolutely no idea how it works.[1] It's natural to think of the Web as a publishing system, and at some level it is: the Web lets people publish documents for anyone to read. But what the Web really is is a distributed computing platform that lets Web sites run code on your computer.[2] Originally, of course, that code just rendered documents, but now it's used for everything from documents (like the one you're reading now) to text-based applications like Slack or even videoconferencing apps like Google Meet. Unsurprisingly, then, the Web has a unique security model, which is the topic of this series of (some unknown number of) posts.
I meant to start right in on security but then I realized I first needed to provide enough background of how the Web works to have the security stuff make sense. This post is the first half of that background material, covering the structure of Web sites and pages. There will be a second post that covers Web "applications". This isn't a textbook or a specification, so I don't intend to provide a complete picture; the idea here is to cover the essential elements for understand the security model.
The URL #
Everything on the Web starts with the Uniform Resource Locator (URL), which, as Wikipedia puts it, is commonly called the "web address". Minimally, it's the thing that shows up in the address bar of your browser when you go to a Web page, but actually everything on the Web has a URL, not just web pages. For instance, most Web pages are made up of a mix of text and images and each of those images has their own URL. In fact, you can (usually) independently load each individual subcomponent of the page by right-clicking on it, like so:
What a URL really is is just the address of some thing (the technical term here is resource) on the Web. Given the URL for a thing, your browser can go to the indicated location (i.e., the Web server), load the resource, and do something with it. What that something is depends on the resource type and the context in which it's loaded, as we'll see below. For instance, if the resource is an HTML document or a PNG image, then the browser will try to display it. If it's a zip file, the browser might try to save it to your disk.
A URL (at last for the Web) has three major parts, shown in the diagram below. [Attention nitpickers: I'll get to query and fragment shortly.]
Scheme #
The first part of the URL is what's called the scheme, which indicates the protocol that the client (the browser) should use to access the resource. The Web itself has two important schemes:
http
, which means to use the Hypertext Transfer Protocol (HTTP)https
, which means to use HTTP with the Transport Layer Security (TLS) secure transport protocol.
Schemes and Protocols #
In practice, the scheme doesn't refer to a single protocol but actually to a family of protocols which have roughly the same externally visible properties and can be mutually negotiated. For instance, there are three main versions of HTTP (HTTP 1.1, HTTP/2, and HTTP/3), all of which are fairly different on the wire. Similarly, there are several different versions of TLS. Finally, HTTP/3 doesn't run over TLS but actually runs over the QUIC transport protocol which uses the TLS handshake for security. All of these different protocols can be addressed with the same set of URLs, with the browser and the server automatically selecting the right protocol. This is actually an important requirement for seamlessly deploying new protocols: for instance if HTTP/2 had required a new scheme it would have taken much longer for it to be deployed, if ever, because everyone would have had to change their pages.
There are a huge number of registered schemes,[3] but as a practical matter very few matter for the Web. When the Web was young, there were a number of different information transfer protocols and browsers used to support a number of other transports besides HTTP, such as the File Transfer Protocol (FTP) and the Network News Transfer Protocol (NNTP) and Gopher. However, as the information systems those protocols were associated with were subsumed by the Web, HTTP became the dominant protocol and those protocols were allowed to rot, and now HTTP(S) in its various versions is basically the only game in town for transferring Web pages.
There are, a few other URL schemes that matter on the Web for specialized
purposes, such as the mailto
scheme for indicating an email
address or the turn
scheme for indicating relays to be used
with the TURN
protocol in WebRTC. These serve an important purpose, but aren't
really used as part of the main structure of the Web. These schemes
will often have a different structure than Web URLs, for
instance mailto
URLs look like mailto:ekr@example.com
,
but we don't need to worry about that for now.
Host #
The second piece of an HTTP/HTTPS URL is the host, which is just the name of the server hosting content. As discussed in excruciating detail in my series on DNS, this host name is resolved to an IP address via the DNS and the browser then connects to that IP address. If the browser is dereferencing an HTTPS URL, it will also expect that the server present a certificate which has the hostname in it, thus—at least in theory—demonstrating that the browser is talking to the expected server.
Path #
The final piece of the URLs shown above is the "path" component, which
indicates the actual resource on the Web site which you are
accessing. The structure of this component is extremely server
specific. In theory, the server could just name
all of its resources 1
, 2
, etc. but
in practice, the path tends to somewhat mirror the
server's directory structure, with the /
separator
indicating directories on the server, etc., and this is what
common servers encourage.
Even for more sites that are more like applications and that don't really have directories of files, it's conventional for paths to have a hierarchical structure that mirrors the underlying information hierarchy. For example GitHub URLs look like:
https://github.com/[username]/[repository-name]/
with the list of issues at
/[username]/[repository-name]/issues/
and individual issues at
/[username]/[repository-name]/issue/[issue-number]
.
Query and Fragment #
There are two other pieces of the URL that I didn't show above but that are important to be aware of:
"Query arguments" are a list of keyword-value pairs, e.g.,
https://example.com/foo.html?foo=bar
These are automatically appended by the Web browser when the user interacts with specific kinds of elements, such as "web forms". These will make an appearance later.
"Fragments" allow the browser to refer to individual portions of the page. For instance, the URL:
https://educatedguesswork.org/posts/web-security-model-intro1/#query-and-fragment
goes to the section you are reading now. The key thing to know about the fragment is that because it's used for intra-page navigation, it doesn't get sent to the server, but is processed solely by the client. Moreover, if you click on a fragment link on the same page (you can try it with the link above), the browser will just scroll to that point, but doesn't need to connect to the server to reload the page.
The Web Architecture #
The diagram below shows the overall structure of a drastically oversimplified Web application, on both the client and the server.
Even this simplified version is pretty complicated, so I'll walk through it slowly.
As you would expect from the above discussion, the process starts with the URL, whether the user enters it directly, clicks a bookmark, or clicks on a link. The browser then goes to the server and requests that URL. In nearly every case what's going to come back is a HyperText Markup Language (HTML) page.
HTML #
We don't need to go into HTML in too much detail, but at a high level, HTML is structured text. What this means is that HTML is a text file that contains extra information ("markup") that tells the browser how to interpret it. As a simple example, consider the following HTML fragment:
<h4>This is a header</h4>
This is some text with a hyperlink. <a href="https://educatedguesswork.org/">hyperlink</a>.
Just to orient yourself, HTML markup mostly consists of paired "start" and
"end" markers ("tags") that indicate that the stuff in between them
is associated with the tag. If you have a tag xx
then the
start tag will be <xx>
and the end tag will be </xx>
and the stuff in between will be called the "xx element".
Tags can also have attributes that get attached to the start,
like:
<xx attr1="abc">
which means "tag xx
has attribute attr1
with the value abc
".
In this example, then, the h4
markers indicate that the text
inside them is a header (at header level 4) rather than body. The
<a href="https://educatedguesswork.org">
block indicates that the text inside it is a hyperlink,
which just means that it's a section of text that contains the
text "hyperlink" and when you click on it it navigates the
browser the the page indicated by https://educatedguesswork.org
. This will get rendered something like this:
This is a header
This is some text with a hyperlink. hyperlink.
It's important to recognize that this markup is (mostly) semantic. Instead of telling the browser that the margins should be size whatever, you're supposed to just provide the page structure the text of the page and leave the browser to figure out how to render it (though of course you should expect to have reasonable margins, emphasized headers, etc.) HTML does have some basic formatting stuff like bold and italics, but it's quite limited and insufficient for making the document look the way you really want; with just HTML you're mostly at the mercy of the browser's styling decisions, with results that tend to be somewhat less than satisfactory.
HTML has a whole pile of other types of markup for things
like lists, tables, buttons, etc. We mostly don't need to
worry about these right now. What is important, however,
is that HTML can also include tags that pull in other
resources from the site. For instance, you can have an
<img>
tag which loads an image off the site and renders
it at that place in the document, as in the following fragment,
which pulls in the diagram shown above. The src
attribute is the place to load the image from.
<img src="/img/overall-web.svg">
Already this is pretty useful: you can use HTML to publish fairly rich documents. In fact, this was pretty much all that was in the original Web. However, it quickly became clear that people wanted to have more control over sites. In particular, they wanted more control over how things looked and they wanted to be able to add arbitrary dynamic content that ran on the client. In the Web, these needs are addressed by allowing the HTML document to use two other kinds of resources that serve these functions:
-
Cascading Style Sheets (CSS), which allows you to tell the browser how to render your content.
-
JavaScript (JS), a general purpose programming language which, among other things, allows you to manipulate the HTML and CSS of the page.
It's possible to embed the CSS and JS in the page directly, but what's more common is actually to have HTML tags which reference CSS and JS files on the server. So, what happens in practice is that the HTML loads and then as the browser parses it, it finds the tags for CSS, JS, as well as images and the like and loads them all from the server to assemble the correct page.
CSS #
As I mentioned above, originally the Web mostly had semantic markup, so you could say "this is a header" and some very limited styling ("use this font) but not "render this column with 20 pixel margin". CSS allows you to apply styles to the content of the page. As noted above CSS can be embedded in the HTML (that's how the newsletter version of this site works) but is commonly loaded off of separate resources, with the HTML just pointing to the CSS. I don't intend to write too much about CSS; while there are security and privacy issues around CSS, most of Web security is concerned with other things.
JavaScript #
HTML and CSS are pretty powerful all on their own if what you want is a static Web site that publishes information. They also have some limited interactive capability: for instance you can have a web form where people can fill in information, click on radio buttons, etc., and even send that data to the server which can then act on it. But at the end of the day they're limited and lots of applications require a general purpose programming language. This is where JavaScript comes into the picture.
JavaScript itself is just a regular programming language at roughly the same level of abstraction as other "scripting" languages like Python or Ruby. You can use JavaScript for anything you would use those language for, though you might not want to. What makes JavaScript special to the Web is two things (1) browsers know how to execute it natively, which means if you send them JavaScript they will run it; if you send them Python, they'll just display it to the user or try to save it on disk[4] (2) the browser has special JavaScript APIs that let the JavaScript code interact with the user and the Web page.
The DOM #
HTML, CSS, and JavaScript work together to produce the experience you see on the Web via what's called the Document Object Model (DOM). The way this works is that the browser parses the HTML provided by the server into an abstract data structure that reflects the structure of the underlying HTML.[5] The DOM is then used to generate what you see on the screen. Both CSS and JavaScript work by addressing the DOM. For instance, CSS works by providing style information for certain elements of the DOM (e.g., this paragraph) or certain types of elements ("all headers") (simplifying, remember!).
JavaScript is much more powerful. First, it can manipulate the DOM
itself, by adding, removing, or changing elements. When changes
are made to the DOM, the browser will rerender the page, which means
that JavaScript can change what appears on the screen. This can
also have other side effects: for instance if JavaScript adds
a new <img>
tag, that will cause the image to be loaded off
the server and displayed as part of the page. On unobvious
consequence of this is ability is that
because JavaScript is loaded into the page with HTML <script>
tags, this means that one piece of JavaScript can load new pieces
of JavaScript by inserting new <script>
tags; it can do the
same for CSS as well of course. These turn out to be powerful
but also dangerous capabilities.
In addition to manipulating the DOM, the browser has lots of other APIs that let it interact with the network or the user. For example:
- Perform network requests to the server using
fetch()
- Read from the camera and microphone using
getUserMedia()
- Form peer-to-peer connections with other browsers using
RTCPeerConnection
One of the major ways in which the Web gets extended is by adding new APIs; obviously JavaScript can do any computation that any other language can do, but if you want to affect the outside world, then you generally need some API to do it.
The Server #
This brings us to the Web server.
The most basic Web server just serves static files to the client: the
client sends a URL and the server sends back the corresponding
file. In the early days of the Web, the structure of the URLs as shown
in the path component would mirror the structure of the server's
filesystem. For instance, you might have a server which stored files
in /home/server/
, in which case the URL
https://example.com/abc/def.html
would correspond to
/home/server/abc/def.html
. And those files themselves
would be Web pages or the other assets on them (like images).
But of course, over time, the world has gotten complicated.
This is still possible but of
course it's also possible for things to be a lot fancier.
In particular, instead of just serving static files the server
can perform computations and return the results to the client.
The Structure of Web Servers #
As I said, the original Web servers just served whatever
was on the file system to the client. But people quickly
realized that they wanted to be able to have the server
provide dynamic content. The original way to do this was
with something called Common Gateway Interface (CGI). The way CGI worked
was that you would have a special directory, by convention
called /cgi-bin
and instead of serving the
files in that directory, the web server in would run
them and send the output the client. This wasn't that
efficient, but it got the job done. You'll still see it
in some places on the Web.
More recently, it's become common to invert this structure and have Web servers which handle essentially every request programmatically. For instance, the popular Express framework for Node.js lets you register individual functions to handle portions of the URL namespace. These functions can just generate content directly or can use files as a template to generate the content based on the file and some information the server has. These servers can of course handle static files, but this is done by having a special code module which then reads those static files off the disk and then serves them.
A common pattern is to serve the dynamic files off one server and static files off another server, with each being specialized for its job. This is an especially attractive pattern if the static files are big and can be served off a fast content delivery network (CDN) which is optimized for that purpose. Of course, CDNs have now started to grow some capabilities to handle dynamic content in what's called edge computing.[6]
Obviously, the server can do any kind of computation it wants to return answers, but there are a few major common types.
Templates #
Suppose you want to send a more-or-less static page but you want to customize it slightly. For instance, you might want to put the user's username in the upper right hand corner or add the number of times someone has viewed this page. You could of course generate the whole page from scratch on your server, but an easier way to do it is with a template. Briefly, a template is a file containing HTML but with markers that allow you to fill in variables. For instance, you might have:
<h1>Page title</h1>
This page has been viewed [[num-views]] times.
The [[num-views]]
means "replace this string with the
value of the num-views
variable.[7]
The idea here is that the server has a template processor
which is configured with a set of variables, in this case
the number of views. The processor reads the template, finds the template variable
markers, and replaces them with the corresponding values.
There are a lot of different template languages, some more
fancy than others, including handlebars,
nunjucks,
mustache, etc.
Full Result Generation #
Suppose that instead most of your page is dynamic, like a news site or a search engine result page. In that case, a template doesn't really help you that much. Instead, you probably just want to have your server assemble the whole page, piece by piece (though probably from fragments of HTML stored in the server software). This is basically the dual of templates: templates are HTML (or markdown) with embedded code. Page generation is code with embedded HTML.
It's important to recognize that the precise method that the server uses to generate the page is largely invisible to the client: it could be a static file, a template, fully programmatic, or a mix of the above, with some pieces generated one way and some another. The Web just defines the protocol (i.e., the format of the page) and leaves the implementation to generate that protocol however it wants. This is a very important feature for allowing extensibility in the future.
Non-HTML Data Types #
Most of the text in this section sort of assumes that the server will be returning HTML, but of course HTTP is an extensible protocol and so you can transmit just about any content over HTTP. And because the server can do arbitrary computations, this means that it can return those results of the computation to the client. We'll see how that's useful in the next post.
Cross-Site Content #
If you were paying close attention before, you noticed that when you load an image on a Web site, you provide a URL where the browser can find the image. The same thing is true for other kinds of content, whether it's audio, video, CSS, or JavaScript. That makes sense, after all, because all that stuff was authored separately and you don't want to have all that stuff crammed into one giant file on your server? But who says that stuff has to be on your server? The content is being addressed by a URL and that URL can point anywhere, including some totally different Web server.
Take for instance, this image of the Dogefox logo:
Here's the HTML which loaded that:
<img src="https://i.redd.it/ldcju3p3w3x11.jpg" alt="DogeFox" width=400>
As you can see, the src
attribute, indicating where the image
comes from doesn't go to this site at all. It's pointing to a resource
on Reddit—but I was able to just
load it into my site and unless you use the browser developer tools
to look deeply, you wouldn't even notice. Importantly, the way
that this works is that the browser connects directly to the site
indicated in the URL; it doesn't go through the original server
at all (thought experiment: what happens if the server decides
to change the image?).
You can do this kind of cross-site loading with pretty much anything, including video, JavaScript and CSS. This, for instance, is how you embed YouTube videos in your site (you don't want to absorb the bandwidth costs, right?). The JavaScript thing is actually incredibly common because people often want to make use of JavaScript libraries but save bandwidth by serving them off their own server (because, as above, it gets served directly). Of course, now your Web site is incorporating an arbitrary program from someone else's server, so what could possibly go wrong?
This trick isn't limited to individual files either: you can actually load a whole Web page this way, like so:
<iframe src="https://educatedguesswork.org/posts/" width=800 height=400></iframe>
This fragment pulls the archive page of this site into a frame on the page, with scroll bars and everything:
This kind of mashup of cross-site content is one of the basic functions of the Web and the source of all kinds of powerful functions, good and bad, ranging from reusing open source content, to embedded maps and YouTube videos, to Facebook like buttons and online ads (with their associated tracking). It's an incredibly powerful feature and also one whose full implications weren't really understood at the time it was introduced, using to some exciting moments down the road.
Next Up: Web Applications #
At this point, we have the makings of a very fancy Internet-scale publishing system, complete with cool styling, mashups, and even a local programming language for producing cool effects.[8] But as as I said at the top, the Web isn't just a publishing system, and some of the most important parts of the Web (Facebook, Gmail, Google Meet, Slack) act much more like applications than they do like online publishing. But even though they have a lot more going on than say, this site, they use basically the same primitives I've introduced here, just in a number of new and interesting ways (and with a number of exciting new security problems!). In the next (hopefully shorter) part of this series, I'll talk about how those work.
Yes, I'm quoting Blackadder ↩︎
The Web actually isn't the first or only such platform; PostScript and PDF documents are actually programs that run on your printer or your computer. This provides a much more flexible system than alternative designs like sending a static image to the printer. ↩︎
The astute reader will note that the registry here talks about URI rather than URL schemes, where the I stands for Identifier. URI is the generic term with URLs being the subset of URIs which have enough information to dereference them as opposed to just uniquely identifying something. ↩︎
It is, of course, possible to run other languages on the Web by first compiling them into JavaScript and then running the JavaScript. For instance, Emscripten is a tool that does this for C/C++ code. This works but is a bit clunky. Eventually, there was so much demand for this kind of thing that people designed a special "low-level" language called WebAssembly that browsers would run alongside JavaScript and that was more appropriate as a compilation target for other languages. ↩︎
Technically, this is a set of nodes arranged in a tree structure. So, for instance, you might have the root of the tree and then paragraphs as children and within each paragraph, hyperlinks, etc. ↩︎
In the context of graphics, this cycle of specialized optimizations followed by the optimized system becoming more generalized and then the generalized system undergoing further specialized optimizations is sometimes called the wheel of reincarnation (this name due to Ivan Sutherland) ↩︎
More commonly the markers are curly braces, but if I use curly braces here, the template processor which renders this site will try to process it, so I'm using square brackets. ↩︎
Basically, Xanadu but built out of duct tape and cardboard. ↩︎