Educated Guesswork

Understanding Memory Management, Part 6: Basic Garbage Collection

Who's going to clean up this garbage (Spiderman)

This is the sixth post in my multipart series on memory management. You will probably want to go back and read Part I, which covers C, parts II and III, which cover C++, and parts IV and V which cover Rust. C++ RAII and Rust do a lot to simplify memory management but still force you to constantly think about how you are using memory (this is even more true with Rust). It's natural to ask why we have to do all this work and why the computer can't just figure things out. The answer is that it can—at a cost—which brings us to the other major approach to handling memory: automatic memory management/aka garbage collection.[1] In this post, I want to introduce the basic ideas behind garbage collection and describe the main algorithms which provide the basis for modern GC systems.

What behavior do we want? #

In C and C++, the programmer is responsible for two major memory management tasks:

  • Allocating new memory on the heap when it's needed
  • Returning unused memory back to the heap so it can be re-allocated later.

In C, these operations are both completely manual using malloc() and free() C++ provides manual memory management with new and delete, but also provides various mechanisms to automatically allocate and deallocate memory, such as container classes, RAII and smart pointers; the programmer is still frequently responsible for explicitly allocating objects and has to understand their lifetimes. Similarly, Rust requires you to explicitly manage memory, but protects you from errors when you fail to do so.

In a wide variety of languages, ranging from Go to JavaScript, the system handles all of these operations automatically completely. The programmer just creates variables and objects and the language takes care of allocating memory as required and de-allocates the memory at an appropriate time. To see this in action, consider the following trivial C function and its JavaScript equivalent:

C #

void foo() {
int a = 1;
int b[2] = {1};
size_t b_len = 1;
b[b_len++] = 2;
}

JS #

function foo() {
let a = 1
let b = [1]
b.push(2)
}

Both versions of this code to the same thing. First, they create two variables.

a
: An integer set to one
b
: A list consisting of the single integer 1

We then push another element onto b to make the list [1, 2].

The first line of the function is basically the same in both languages. Take a look at the second line, however, where we create b. This C code actually makes two memory-related decisions:

  1. It puts the memory on the stack (because we didn't call malloc())
  2. It allocates a fixed-size array of size 2.

Note that even though we only add one value (1) to b, the array is still of size 2. This is why we need a separate length field b_len to keep track of how many elements are actually in b. The syntax {1} tells C to initialize the array with 1 and then as many zeros as are required to fill the rest of the array. Except for the initialization, this should all be pretty familiar from part I.

Now compare the corresponding JS code, which looks fairly similar, but that's just because JS imitates C syntax. Just like in C, we make a local variable b that can hold a list of things, but we never explicitly tell JS whether it should be on the stack or the heap or how big it should be. So, what are the answers to those questions? Who knows? Who cares? None of your business! The JS engine will do whatever it thinks best, and may not even do the same thing every time. Specifically:

  • The JS engine will automatically grow b whenever you add new elements. In this respect it's like the C++ vector container.

  • Local variables can be stored on either the stack or the heap. In fact, they can be stored in one place and then moved to the other when conditions change; it's totally up to the JS engine.

Whatever the JS engine does, it's essentially transparent to the programmer, who just writes code and lets the language worry about it. The same thing applies to many other languages like JS, Python, Lisp, etc.

In these languages, just as you don't have to worry about when you are allocating memory, you also don't have to worry about de-allocating it; the language automatically detects when you aren't using memory and de-allocates it, in a process called "garbage collection" (commonly, GC). You just write code![2]

Memo: A Tiny Language #

In this post, we'll be taking a slightly different approach than with previous posts. Because the internals of garbage collection are (mostly) invisible and real-world garbage collectors are very complicated, working with a real language isn't that useful for explanatory purposes. Instead, I've designed and implemented a tiny language called Memo (for "memory demo") that lets us look at the impact of various garbage collection approaches without being distracted byu a lot of language mechanics.

Memo is deliberately not Turing complete—it has no conditionals or loops—and only contains a small number of operations for manipulating objects.

Data Types #

Memo comes with three native types. The first two of these are familiar:

  • Integers representing bare numeric values.
  • Pointers to objects in memory (i.e., the address of those objects).

The only other data type in Memo is the Tuple, which represents an ordered set of values, each of which can be either an integer or a a pointer. Tuples are wrapped in parentheses, as in (1 2 3). Unlike Python lists or JS lists—but like Rust or Python tuples—you can't extend tuples, so a tuple of length X can't be turned into a tuple of length Y, though of course you can create a new tuple of the right size.[3]

This isn't a particularly rich type system, but it's actually sufficient to construct a variety of data structures, as anyone who has worked with Lisp, will recognize. For example you can make a list of integers out of 2-valued tuple objects, with the first value of each tuple being an integer and the second value being a pointer to the next tuple.

Variable Naming and Addressing #

Like any language, Memo has named variables. All variables are global and variables are automatically created upon assignment without the need for let or var (simple, remember?). It's not legal to read variables which haven't been assigned to yet.

Variables in Memo are named in the usual fashion, as strings of letters and numbers starting with a letter, e.g., tmp1.

So, for instance, the following code:

a = 20

Creates a variable a and assigns the value 20.

You create a tuple in the way you expect:

a = (1 2 3)

This code automatically allocates a tuple of size 3 on the heap and assigns the address to a. It's also legal to have tuples contain other tuples, which really just means that one of the elements of the tuple is the address of another tuple.

a = (1 2 3 (4 5))

Variables themselves aren't typed, so it's perfectly legal to do:

a = (1 2)      # a is a pointer to the tuple (1 2)
a = 20 # a is the integer 20

In order to make this work, internally, Memo keeps track of whether a given value is a pointer or an integer (see below). This should be familiar if you've used other weakly typed languages like Python or JavaScript. The inner values in a tuple are addressed with the notation a.0, a.1, and so on.

Variables never go out of scope, but you can assign them to null, which has most of the same effect, except that they're still floating around in the namespace.

Demo #

The whole implementation of Memo is in JavaScript, so you can run it directly in your browser (which is why I did it). I've embedded a "read-eval-print-loop" (REPL) window so you can play with it:

A simple Memo REPL

Note that all of the operations we've looked at so far are "expressions" which is to say that they return the result of the operation, which then gets printed in the console (this is the "print" part of the REPL). For instance, if we set a=20 and then type a we will get Integer(20). This is how you get the value of an object, because there's no print() function.

It should be perfectly safe to type anything in this window: it doesn't interact with anything in the rest of this post or on your computer.[4] If you violate Memo's syntax, it will just complain to you and you can enter a new instruction.

The Allocator Interface #

Next we need to look at how memory allocation works in Memo, because we're going to need that to understand how the GC system works.

The Heap #

Any memory allocator needs a pool of memory blocks (the heap) to implement from. JS doesn't really allow for raw memory access, so instead we are going to emulate the heap as a single big JS ArrayBuffer—which is just JS's way of conveniently handling an array of bytes—encapsulated in a Memory object. Internally, we create the heap like so:

let heap = new Memory(10000); // Make a heap of size 10000 bytes

Addresses in the heap are just integers in the range [0, heapSize), so we can read and write with obvious-looking interfaces:

let byte = memory.readUInt(1);   // Read a byte from address 1.
let word = memory.writeUInt(32); // Read a 32-bit word from address 32.
memory.writeUInt8(1, byte + 1); // Write byte + to address 1.

And so on. There aren't any rules about aligned access in this implementation, so you can read a 32-bit word from address 3 or whatever. Similarly, you can read the pieces of a word byte by byte and then put it back together into a word. Many real systems actually do have alignment requirements.

Allocation #

Objects are allocated using the Allocator class, which has a simple interface. For instance, this code allocates an object of type Simple and returns its address (i.e., the index of the first byte of the object on the heap).

addr1 = allocator.allocate("Simple");

As a practical matter, the only allocated type in Memo is the tuple, but when I wrote this code originally I expected to have more than one kind of type (e.g., structures with named members) and so we have a more flexible system in which you can have objects of arbitrary types, which is more than we need for Memo. In Memo, however, each size tuple is its own type, automatically named something like (5) for a tuple of length 5.

This is all hidden from the programmer, with the consequence that when you write something like a = (1 2 3) internally the language runtime does:

allocator.allocate("(3)");

and then assigns the internal values.

Object Layout #

The diagram below shows the situation after running a trivial program which is shown below. This program allocates five tuples:

  • The tuple (1 2 3) which is assigned to the global variable a.
  • The tuple (3 4) which is assigned to the first element in a.
  • The tuple (8 9) which is at memory address 44 but is not assigned to any global variable.
  • The tuple (5 6 7 Pointer(44)) which points to the previous tuple.
  • The tuple () which is assigned to the global variable c.

You can use the "Previous" and "Next" buttons to step through this program line by line and see how the various tuples get made. The highlighted line of code is the one that is about to run (just like with a normal debugger) rather than the one that has just run.

RootsHeapReserved@16: (3)#0Pointer(32)#1Integer(2)#2Integer(3)@32: (2)#0Integer(3)#1Integer(4)@44: (2)#0Integer(8)#1Integer(9)@56: (4)#0Integer(5)#1Integer(6)#2Integer(7)#3Pointer(44)@76: (0)aPointer(16)bPointer(56)cPointer(76)
1: a = (1 2 3)
2: a.0 = (3 4)
3: b = (5 6 7 (8 9))
4: c = ()
Memo memory layout

The global variables are shown in the top row. These wouldn't typically be on the stack not the heap but of course are somewhere in memory, so I'm just showing them here for convenience. Memory address 0 starts at the left of the gray box marked "Reserved", so the first allocatable address is 16. This is actually reasonably realistic because the allocator typically needs to reserve some space for its own bookkeeping operations (e.g., the last allocated block). In my implementation, these values are just stored in JS variables outside of the allocated memory region, and I decided to block off the first 16 octets for more prosaic reasons: I wanted the value of the null pointer (the one that doesn't point anywhere) to be 0, and for that to work 0 cannot be a valid address for a real object. Note that there is actually nothing that requires the null pointer to have memory representation of all 0 bits, even in C, but it's convenient and common.

Each pointer is drawn with an arrow that shows the object it points to. You'll notice that each object starts out with a single 4 byte word which is where I store the type of the object (in this case, just the number of elements in the tuple). This word is also used for some other metadata as we'll see later. The result is that even an empty tuple like () consumes a minimum of 4 bytes of storage. We'll see some other reasons why we need this later. What these words currently show is the address of the object (e.g., @16) and the length of the tuple in parentheses (e.g., (3) for a tuple of length 3). Note: this is the representation with a specific type of garbage collector (mark-sweep). Other garbage collectors will have slightly different layouts, as seen below.

This is a very simple allocator, in which each object is allocated directly after the end of the previous object, with the result that all the objects are contiguous. This is what's called a "bump allocator" and is very easy to implement because we just need to store one piece of extra data, the address of the next object to be allocated, which is right after the end of the last object that was allocated. When you allocate an object of size S you then just "bump" this value up by S. Here is the actual code for our bump allocator, which fits neatly in your head.

  bump_allocate(size) {
let address;

if (this._end + size > this._context.heapSize()) {
throw new Error("Out of memory");
}
address = this._end;
this._end += size;

return address;
}

One interesting thing to note is that the tuple (8 9) appears in memory before the tuple (5 6 7 Pointer(44)) which points to it, despite the tuples being introduced in the opposite order, as in:

b = (5 6 7 (8 9))

What's going on here is that Memo's interpreter works from the bottom up, which means that it needs to first allocate the memory for (8 9) and then it can stuff the address in the other newly-created tuple. This is a common design for this kind of simple parser.

Reading Objects in Memory #

Importantly, everything we have stored on the heap is self-describing, which allows us to decode the contents of the heap without ancillary storage, as long as we know the address of the first allocated object in memory, in this case 16. This works beacause each object has a common prefix in the first word, as noted above:

 0 1 2 3 4 5 6 7 8 9 9 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 9 1 2 3 4 5 6 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|M|F| RESERVED | TypeId (24 bits) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Here we have a 32-bit (1 word on 32-bit processors) prefix which starts with a flags byte. The first two bits in this byte are assigned to a M (for marked) and F (for free) flag; we'll cover these later. The rest are reserved for futrue use.

The rest of the prefix contains the TypeId, which is an identifier for the type of the object. This identifier can be implemented in a number of ways, including:

  • A pointer to a type description.
  • An index into a table of type descriptions.

In either case, the type description will store the number of pointers in the object and their memory locations within the object. Importantly, every instance of an object needs to be laid out the same way, so that once you have the object pointer and the type, you know everything else about every element in the type.

The object decoding process for an object at address addr proceeds as follows:

  1. Read the first 32-bit word and use that information to extract the type ID, which, as above, is stored in the low order 24 bits.
  2. Look up the type using the type ID and from there determine the number of elements in the object.
  3. Iterate over the elements and decode them. As noted above, elements are always of type pointer or type integer, but any given element can be either and element types can change. In order to address this we steal a bit from the top of each element to use for the type (this is often called "coloring" the pointer). For pointers, the high bit is 0 and for integers, the bit for 0x80000000 is set. This allows you to immediately see whether a given element is a pointer or an integer, but at the cost that you can only express values up to 231.

Because all elements are the same size, the type also tells us the size of the object and so we can skip over to the next object, which, as noted above, starts immediately after the current object (we'll get to holes created by fragmentation later).

Aside: Dealing with binary flags in JS #

As an aside, it's a giant pain dealing with binary flags in JS because there's really only one number type (float) and JS has decided that 0x80000000 is a positive number but 0x80000000 is a negative number. As a result you get this kind of thing.

> flag = 0x80000000
2147483648
> a = 3
3
> a | flag
-2147483645
> (a & flag) == flag
false

LOLWAT?

I'm not a JS wizard but according to Gemini the fix is to add a bunch of 0-sized unsigned right shifts, like so:

> ((b & flag) >>> 0) & flag
-2147483648
> b = (a | flag) >>> 0
2147483651
> ((b & flag) >>> 0) == flag
true

So when you see these scattered all over the code you know why.

What is garbage? #

After that warmup, we're now ready to talk about garbage collection. As already stated, there's no way to explicitly tell Memo that we're not using a piece of memory, but we don't want to just have the amount of memory we use grow monotonically, so we need some way for Memo to reclaim that memory when it's no longer in use, hence garbage collection.

Conceptually, we have three kinds of memory:

  1. Un-allocated (or freed/de-allocated) memory
  2. Memory which has been allocated and is in use
  3. Memory which is allocated but is not in use ("garbage")

In C and C++, the allocator (malloc()/free()) knows what memory has been allocated and what has not but it does not know which allocated memory is in use and which is not; it leaves that responsibility to the programmer, who must free memory when it is no longer in use. Automatic memory management requires a mechanism to identify which allocated memory is garbage and collect it.

Defining "in use" is a somewhat tricky proposition: if I allocate some memory at time T and just keep it around until the program ends, is it in use?[5] It's entirely possible that under other circumstances I might have used it. For instance, suppose my browser loads some math fonts and then I never go to a page that renders math. But I might have, and obviously both I and the programmer would be unhappy if the language decided that the fonts would never be needed and just deallocated them, with the program crashing when I went to a site that used math!

Pretty much every system I am familiar with uses unreachability as the definition of garbage. Specifically, the assumption is that there are a set of "root" pointers which aren't themselves on the heap such as local variables (on the stack) or global variables. A piece of memory is defined as in use if you can reach it by following pointers from one of those root variables, e.g., root → B → C → D. If it's not reachable from one of the roots, then it's not in use. Because the language already knows which data is allocated and which isn't, it can no distinguish all three types of memory:

  1. Un-allocated (or freed/de-allocated) memory
  2. In-use memory is allocated and reachable
  3. Garbage is memory that is allocated but unreachable

Note that this excludes some data which is morally garbage in the sense that the programmer knows they will never use it, but the language has no way of knowing that. The reachability definition defines garbage as memory which the language can prove the program can't use because there's no way to reference it. The figure below provides a simple example: allocations AG are all reachable by either root1 or root2. Allocations HK are unreachable and are therefore garbage.

In-use and garbage memory

Some in-use memory and some garbage

Now that we understand what garbage is, let's take a look at how to collect it.

Reference Counting #

As noted above, the simplest form of garbage collection is reference counting, which we already saw in part III. Reference counting works similarly in a garbage collected language as it does in C++, except that all of the machinery is hidden under the hood:

  • Every pointer is a reference counted pointer. This can be done with an intrusive pointer style design because we're starting from scratch and so can just insist that every object have an embedded reference count.

  • It's not possible to unbox pointers, so you never have to worry about any aliasing issues such as a raw pointer and a shared pointer pointing to the same object.

The language runtime automatically takes care of incrementing and decrementing reference counts as appropriate, and freeing objects when the reference count goes to zero. After that, things just work without you having to think about it.

The following widget shows reference counting in action with a simple memo program. You can use the previous and next buttons to step through the program one line at a time.

RootsHeapReserved
1: a = (1 2 3)
2: a.0 = (4 5 6)
3: b = (7 8 (9 10 11))
4: a = null
Freeing memory when refct goes to zero

In the first three lines we build up a structure with four tuples:

  • The tuple (P1 2 3) pointed to by a
  • The tuple (4 5 6) pointed to by a.0 (P1).
  • The tuple (7 8 P2) pointed to by b.
  • The tuple (9 10 11 pointed to by b.2 (P2).

Note that in Memo it's not possible to create an object that isn't pointed to by anything, so we can ignore that case. In line 4, we set a to null, which turns both the object it points to into garbage as well as the (4 5 6) tuple that it points to. Once you execute line 4, both objects will be freed, leaving only the (7 8 P2) tuple pointed to by b and the (9 10 11) that P2 points to.

Note that the memory layout is slightly different than in the previous example in that we have an extra RefCt field between the type word and the elements. As suggested by the name, this field stores the reference count value. We have 32-bits for the reference count, which means that we can have up to 232 references, which should be enough given that we can only have 231 objects (because our pointers are 31 bits long). The result, however, is that when we use reference counting, we consume more memory per object than with some other kinds of garbage collection. This kind of efficiency concern can be a big deal in some systems, but not in a toy implementation like ours.

The key thing to notice is that what makes easy automatic memory management straightforward is that the system doesn't give you a choice. C and C++ were originally built with manual memory management and so every attempt to add automatic memory management has to contend with the old semantics. If you just build a language with automatic memory management from the ground up, things are a lot simpler.

Freed Memory #

I said above that when the reference count goes to zero, we free an object does that mean in practice? We've got one big contiguous memory region, so it's not like we can return the memory associated with a single object. Instead, freeing an object is a bookkeeping operation in which we note that that region is no longer in use. We do this by setting the F (for free bit) in the first word. Of course, even though the object is not in use, we still need to know how big it is. We address this by using the rest of the first word to store the object size.[6] Because even unused memory regions aren't unstructured, we can understand the entire memory layout just by starting at the bottom of the heap and working forward one object at a time until we get to the top. For instance, here is a simple function which counts all the live objects:

  function ct() {
let counter = 0;
let scan = this._start;

while (scan < this._end) {
const flags = ObjectManager.getFlags(this._context, scan);
const len = ObjectManager.getSize(this._context, scan);
scan += len;

if ((flags & FREE_BIT) === 0) {
counter++;
}
}

return counter;
}

The only extra information you need is the address of the start of the heap (the start of the lowest allocated object) and the end of the heap (the last allocated byte). From there, you can just scan the entire heap, stopping when you get to the end.

Circular References #

Unfortunately, reference counting has a number of disadvantages that prevent most languages from using it as the sole form of garbage collection. The most important of these is that, as discussed in part III is that it deals badly with circular references, as shown in the example below:

RootsHeapReserved
1: a = (1 (2 null))
2: a.1.1 = a
3: a = null
Circular references

As expected, when we have two objects which point at each other, neither will be freed even if we delete the reference from global variable a.

In part III we showed how to break reference cycles using weak pointers, but weak pointers require that the programmer explicitly tag some references as weak and some as strong, which undercuts the "it just works" value proposition of the garbage collector. Some languages do have support for weak pointers, but you really don't want the programmer to have to pay attention to this every time they create a reference cycle, which happens all the time. Less important, but still relevant, is that there is performance overhead from constantly having to increment and decrement the reference count of objects whenever you pass them around.

For these reasons, most garbage collected languages use another form of garbage collection, either on its own or in combination with reference counting. This nearly always means one of a broad class of algorithms called "tracing garbage collection".

Tracing Garbage Collection #

The basic idea behind a tracing garbage collector is to start from the root pointers and follow each pointer until you've enumerated every reachable object. Every other allocated object is garbage and can be freed. This is a simple idea, but doing it well is hard.

Mark-Sweep #

Let's start with the most elementary[7] type of tracing garbage collector: mark-sweep. A mark-sweep collector proceeds in two passes:

  1. Trace all the objects from the roots, recording which objects are reachable.
  2. Scan over the entire heap, examining each object and freeing those which were not recorded as reachable.

Marking #

The first problem is how to trace all the reachable objects. Conceptually this is just a standard graph traversal problem, where you want to start at the roots and touch every node connected by an edge. You may be familiar with algorithms for traversing trees, and this is a similar problem with two additional complications:

  1. You don't just start from the single root of the tree but from multiple roots, which may point to some of the same nodes.

  2. This isn't necessarily an acyclic graph in the you can have reference cycles; recall that this is why we can't just use reference counting.

Nevertheless, this isn't particularly complicated. Here's a simplified version of the marking algorithm from our code (I've removed some of the JavaScript generator machinery that we use to step through the GC one piece at a time.)

  function *mark_incremental(roots) {
this._work_queue = [];

for (let root of roots) {
if (!isPointer(root) || root == NULL_POINTER) {
continue;
}
ObjectManager.mark(this._context, root);
this._work_queue.push(root);
}

while (this._work_queue.length > 0) {
// Pop first, then process
const current = this._work_queue.pop();
const num_ptrs = ObjectManager.getNumValues(this._context, current);

for (let i = 0; i < num_ptrs; i++) {
const ptr = ObjectManager.getValue(this._context, current, i);

if (isPointer(ptr) && ptr !== NULL_POINTER) {
const flags = ObjectManager.getFlags(this._context, ptr);

if (!(flags & MARK_BIT)) {
ObjectManager.setFlags(this._context, ptr, flags | MARK_BIT);
this._work_queue.push(ptr);
}
}
}
}
}

The basic logic is that every time we encounter a pointer to a new object we add it to the work_queue. Initially the queue is populated by the pointers stored in the roots (global variables), but then as we chase them we encounter new pointers stored in objects on the heap, which are themselves added to the work queue. We continue to pop objects off the work queue until the work queue is empty, at which point the marking process is done.

This design needs some way to know which objects we have already seen before. Otherwise if we have a loop where A points to B and B points to A we'll just go around that loop indefinitely. Unlike the pointer/integer distinction, this "seen" information cannot be stored in the pointers themselves because you might have two pointers to the same object, and if you first reach the object via pointer A you want to know that it was marked when you reach it again via pointer B; instead, it has to be stored along with the object, just as the reference count was. In this case, we have plenty of space in the type word, so we use the M bit to store a "marked" value, which indicates that the object has already been seen. When we encounter an object, we only add it to the work queue if the "marked" bit is clear (i.e., 0)

Sweeping #

Once we've completed the marking phase, we move on to the sweeping phase. As described above, we can just move through memory from the bottom of the heap one object at a time, using the object type field to know the size of an object and thus where one object ends and another begins.

When we encounter a new object, we first check the free bit. If that is set, the object is free and we move on to the next object. This can happen if the object was freed in a previous pass. We then check the mark bit.[8]

  • If the mark bit is set, we clear it so that it can be marked in a future GC pass.
  • If the mark bit is clear, we set the free bit.

We then move on to the next object, continuing until we get to the end of the heap. Here's a slightly cleaned up version of the JS code Memo uses for mark-sweep.

  *gc_incremental(roots) {
// Mark.
this.mark_incremental(roots);

let scan = this._start;

while (scan < this._end) {
const flags = ObjectManager.getFlags(this._context, scan);
const len = ObjectManager.getSize(this._context, scan);
// This is already free.
if (flags & FREE_BIT) {
scan += len;
continue;
}

const nextscan = scan + len;
// This is not marked, so free it.
if ((flags & MARK_BIT) === 0) {
ObjectManager.setFlags(this._context, scan, FREE_BIT);
this._free_bytes += len;
} else {
ObjectManager.unmark(this._context, scan);
}

// Skip to the next entry.
scan = nextscan;
}
}
}

You can use the following widget to step through the entire mark-sweep process. This is the same code as we saw above with reference counting with one small change which I'll get to shortly. As before, you can step through the code and watch how memory changes.

RootsHeapReserved
1: a = (1 2 3)
2: a.0 = (4 5 6)
3: b = (7 8 (9 10 11))
4: a = null
5: #gc
Mark sweep

The first thing to notice is that after line 4 executes, we have the same two pieces of garbage (4 5 6) and (Pointer(48) 2 3), but unlike with reference counting they haven't been freed, but instead are just lurking around. This is because unlike reference counting, tracing garbage collectors don't free memory as soon as it becomes garbage; instead you have to explicitly run the garbage collection algorithm. The objects are still unreachable, so there's no way for them to be accessed, but they're still there taking up space:

RootsHeapReserved@16: (3)#0Pointer(32)#1Integer(2)#2Integer(3)@32: (3)#0Integer(4)#1Integer(5)#2Integer(6)@48: (3)#0Integer(9)#1Integer(10)#2Integer(11)@64: (3)#0Integer(7)#1Integer(8)#2Pointer(48)anullbPointer(64)Scan

Moving scan to 64

1: a = (1 2 3)
2: a.0 = (4 5 6)
3: b = (7 8 (9 10 11))
4: a = null
5: #gc
Mark sweep marking phase

In order to actually garbage collect these objects, we need to to run the mark-sweep algorithm. Ordinarily this is something that the system would do automatically, but to make things simple I've added a pseudo-instruction that invokes the garbage collector in the form of #gc. I say this is a pseudo-instruction because from the perspective of Memo it's a comment; instead the widget notices that you've asked for GC and runs the garbage collector externally.[9]

Once we get to the GC instruction, you can then step through the GC algorithm one step at a time. The orange "scan" pointer shows which object we are examining now, and at appropriate times the "work queue" will be shown. For instance, here is the situation when the algorithm is examining the object at 64 and has just marked the object at 48 ((9 10 11)) and added it to the work queue.

RootsHeapWork QueueReserved@16: (3)#0Pointer(32)#1Integer(2)#2Integer(3)@32: (3)#0Integer(4)#1Integer(5)#2Integer(6)@48: (3)M#0Integer(9)#1Integer(10)#2Integer(11)@64: (3)M#0Integer(7)#1Integer(8)#2Pointer(48)anullbPointer(64)Scan#0Pointer(48)

Moving scan to 64

Marking 48

Moving scan to 64

1: a = (1 2 3)
2: a.0 = (4 5 6)
3: b = (7 8 (9 10 11))
4: a = null
5: #gc
Mark sweep sweeping phase

Note that the marking phase doesn't proceed in any particular order through memory, because it's just tracing out the graph of object relationships. That's why we look at 64 first (because it's pointed to by b) and then 48 (because it's pointed to by 64). By contrast, the sweeping phase proceeds linearly through memory. Below, you can see partway through the sweeping phase, after we have freed 16 and right before we free 32.

RootsHeapReservedFree@32: (3)#0Integer(4)#1Integer(5)#2Integer(6)@48: (3)M#0Integer(9)#1Integer(10)#2Integer(11)@64: (3)M#0Integer(7)#1Integer(8)#2Pointer(48)anullbPointer(64)Scan

Moving scan to 64

Marking 48

Moving scan to 64

Moving scan to 48

Freeing 16

Moving scan to 32

1: a = (1 2 3)
2: a.0 = (4 5 6)
3: b = (7 8 (9 10 11))
4: a = null
5: #gc
Mark sweep complete. Note the holes

At the end of the sweep process, all of the unreachable objects will be freed, leaving only the reachable objects, just as with reference counting,.

RootsHeapReserved@48: (3)#0Integer(9)#1Integer(10)#2Integer(11)@64: (3)#0Integer(7)#1Integer(8)#2Pointer(48)anullbPointer(64)
1: a = (1 2 3)
2: a.0 = (4 5 6)
3: b = (7 8 (9 10 11))
4: a = null
5: #gc

Reclaiming Memory #

Now consider what happens if we do a new allocation as in:

c = (12 13 14)

At this point, there are two things that can happen:

  1. We can continue to bump allocate, and put the new object above the last object in memory, in this case at address 80.

  2. We can reuse one of the regions we've freed, in this case probably at address 16.

There's a tradeoff here in that bump allocation is generally faster—you just need to increment one pointer—but eventually we have to start reusing free memory or there wasn't any point in garbage collecting at all. The basic challenge is fragmentation: once the program has run for a while you end up with a lot of "holes", which is to say small free regions interspersed with allocated regions, and you can have a situation where you have plenty of total free memory but no region big enough for a new allocation. If you reuse aggressively, this conserves the open region at the top of the heap for big allocations, but at the cost of having to search for a region that will fit each new allocation rather than just incrementing the next pointer. Your allocation strategy needs to try to compromise between these two.

Memo's allocator uses a fairly simple compromise strategy where it bump allocates up to the point where about:

  1. The top of the heap is about halfway through the heap size.
  2. About half of the region that has been used is holes.

After that it tries to reuse free space; this means that you get to use the fast allocator when there is no memory pressure but you start trying to reuse before you still have plenty of room for allocations that don't fit into any existing holes. Memo will coalesce adjacent free regions as necessary, so multiple small holes can be merged into a single bigger hold.

There are a lot of fancier strategies one can use, especially for figuring out which hole to put new allocations into. For instance, you can have a table of the free regions of a given size or "bucket" allocations into a small number of sizes so that it's easier to find an appropriate location (at the cost of wasting space when an allocation is just over the size of one bucket and a lot smaller than the next biggest bucket). None of these eliminate fragmentation, but they can reduce it. Whatever strategy you use, this is just something you have to deal with with mark-sweep or reference counting.

The reason we have fragmentation is that we don't get to choose which objects will be freed. Suppose we have three objects at 16, 32, and 48 of size 16. If we then free the objects at 16 and 48, we now have 32 bytes worth of free memory, but we can't allocate a 32 byte object because that memory is discontinuous; instead we need to use the bump allocator. Eventually this process results in lots of fragmentation. But what if we could instead slide the object at 32 over to 16, leaving the whole 32--64 region free? This isn't possible in C-like languages where the pointers are directly exposed to the programmer, but if you don't let the programmer look at pointers, you have a lot more freedom to operate.

Mark-Compact #

The next garbage collector we'll be looking at is what's called mark-compact. As the name suggests, a mark-compact collector starts with a marking phase just like mark-sweep, but after the sweep phase is complete, instead of just leaving holes it slides every object as far towards the left (low memory) as possible, eliminating all the holes.

The following two diagrams show the situation before and after the GC pass.

RootsHeapReserved@16: (3)Moved0#0Pointer(36)#1Integer(2)#2Integer(3)@36: (3)Moved0#0Integer(4)#1Integer(5)#2Integer(6)@56: (3)Moved0#0Integer(9)#1Integer(10)#2Integer(11)@76: (3)Moved0#0Integer(7)#1Integer(8)#2Pointer(56)anullbPointer(76)
Before GC with mark-compact
RootsHeapReserved@16: (3)Moved0#0Integer(9)#1Integer(10)#2Integer(11)@36: (3)Moved0#0Integer(7)#1Integer(8)#2Pointer(16)anullbPointer(36)
After GC with mark-compact

As you can see, the tuples (7 8 Pointer) and (9 10 11) have moved from their original positions at 56 and 76 to 16 and 36 respectively. As a result, all the allocated memory is now contiguous and so you can just bump allocate all the time.

This seems great because allocation is now super fast, but the cost is complexity in the GC phase. Specifically, we need to rewrite every pointer—or at least every pointer which points to an object we are keeping—to point to the location where the object will eventually end up. There are a number of algorithms for making this work, but Memo uses the relatively simple "Lisp 2" algorithm,[10] which works in three passes.

  1. Scan through memory one object at a time computing the eventual location of each live object. This effectively simulates bump allocation because each live object will just be right after the previous one. This information is stored in a new "Moved" field in each object, which is now one word larger than with mark-sweep (but the same size as in reference counting, in our implementation). Note that we need a separate word here because the object has to remain intact until we copy it.

  2. Scan through memory one object at a time. For each pointer in each object, go to the object it points to and find the "Moved" pointer and rewrite the pointer with the "Moved" value (see the diagram below):

  3. Scan through memory one object at a time, copying each live object to its new location.

The figure below shows an early part of phase 1, in which the moved pointer for the object at 56 has been set to its new location at 16.

RootsHeapReserved@16: (3)Moved0#0Pointer(36)#1Integer(2)#2Integer(3)@36: (3)Moved0#0Integer(4)#1Integer(5)#2Integer(6)@56: (3)MMoved16#0Integer(9)#1Integer(10)#2Integer(11)@76: (3)MMoved0#0Integer(7)#1Integer(8)#2Pointer(56)anullbPointer(76)

Moving scan to 76

Marking 56

Moving scan to 76

Moving scan to 56

Setting forwarding pointer for 56 to 16

mark-compact with the moved value set

A simplified version of Memo's code for this is below.

  function gc_incremental(roots) {
// First mark.
this.mark_incremental(roots);

let free_ptr = this._start;

// Step 1. Set the future addresses for each object
// we are retaining.
for (let scan = this._start; scan < this._end; ) {
const size = ObjectManager.getSize(this._context, scan);
if (ObjectManager.isMarked(this._context, scan)) {
ObjectManager.setXword(this._context, scan, free_ptr);
free_ptr += size;
}
scan += size;
}

// Step 2. Update references for each marked object.
for (let scan = this._start; scan < this._end; ) {
const size = ObjectManager.getSize(this._context, scan);
if (ObjectManager.isMarked(this._context, scan)) {
const num_ptrs = ObjectManager.getNumValues(this._context, scan);

// Iterate over all the pointers and update the value to whatever
// is in the moved field in the pointed at value.
for (let i = 0; i < num_ptrs; i++) {
const ptr = ObjectManager.getValue(this._context, scan, i);
if (!isPointer(ptr) || ptr === NULL_POINTER) {
continue;
}

const new_ptr = ObjectManager.getXword(this._context, ptr);
ObjectManager.setValue(this._context, scan, i, new_ptr);
}
}
scan += size;
}

// Update the roots.
let new_roots = [];

for (let root of roots) {
new_roots.push(ObjectManager.getXword(this._context, root));
}
// Step 3. Move all objects into their expected locations.
let end = this._start;
for (let scan = this._start; scan < this._end; ) {
const size = ObjectManager.getSize(this._context, scan);
if (ObjectManager.isMarked(this._context, scan)) {
const target = ObjectManager.getXword(this._context, scan);
Memory.memmove(this._heap, target, this._heap, scan, size);
// Unmark the new copy so it can be GCed later.
ObjectManager.setXword(this._context, target, NULL_POINTER);
ObjectManager.unmark(this._context, target);
end += size;
}
scan += size;
}
this._end = end;
return new_roots;
}

The widget below will let you watch the mark-compact process in action.

RootsHeapReserved@16: (3)Moved0#0Pointer(36)#1Integer(2)#2Integer(3)@36: (3)Moved0#0Integer(4)#1Integer(5)#2Integer(6)@56: (3)Moved0#0Integer(9)#1Integer(10)#2Integer(11)@76: (3)Moved0#0Integer(7)#1Integer(8)#2Pointer(56)anullbPointer(76)
1: a = (1 2 3)
2: a.0 = (4 5 6)
3: b = (7 8 (9 10 11))
4: a = null
5: #gc
mark-compact widget

Mark-compact collectors minimize fragmentation but at a modest cost in terms of memory overhead (due to the "moved" field) and a performance cost in terms of multiple passes over memory. There are fancier mark-compact two pass algorithms (one mark pass, one compaction pass) that use ancillary storage for the forwarding addresses (see § 3.4 of the Garbage Collection Handbook for one such example). If you're willing to really go wild with memory consumption, however, you can have an even simpler GC phase. This is the idea behind a copying (also called "semispace") collector.

Copying Garbage Collectors #

The idea behind a copying collector is that you have two heaps, A and B.[11] Each of these heaps is about the same size as your normal heap, so this consumes twice as much memory.[12] You initially allocate your objects in A using a standard bump allocator, and then when it is time to perform garbage collection you copy all the live objects into B and abandon anything left in A. Because B is compact, you can continue to use a bump allocator, and then when you GC, you copy from B into A, and so on. This can all be done in a single pass because you're using the source heap as temporary storage while you copy into the destination heap.

Memo's copying algorithm is shown below, but it's helpful to walk through it.

  process_ptr(address) {
if (ObjectManager.isMarked(this._context, address, this.#from_heap)) {
// The first word is overloaded for the forwarding address.
return [
ObjectManager.readHeaderWord(this.#from_heap, address) & ~MARK_BIT,
false,
];
}

// Not moved yet.
const size = ObjectManager.getSize(this._context, address, this.#from_heap);
const new_address = this._end;
Memory.memmove(this._heap, new_address, this.#from_heap, address, size);
this._end += size;

// Now overwrite the first word to point to the new location.
ObjectManager.writeHeaderWord(this.#from_heap, address, new_address);
ObjectManager.mark(this._context, address, this.#from_heap);
this.#movedList[address] = { labels: [["Moved", new_address]], size };
return [new_address, true];
}

*gc_incremental(roots) {
this.#movedList = {};
this.#inGc = true;
this.flip();

this._work_queue = [];
let new_roots = [];
// First process the roots.
for (let root of roots) {
if (!isPointer(root) || root === NULL_POINTER) {
continue;
}
let [address, todo] = this.process_ptr(root);
new_roots.push(address);
if (todo) {
this._work_queue.push(address);
}
}

// Now trace through all objects, copying as we go.
while (this._work_queue.length) {
const current = this._work_queue.pop();

const num_ptrs = ObjectManager.getNumValues(this._context, current);
for (let i = 0; i < num_ptrs; i++) {
const pointer = ObjectManager.getValue(this._context, current, i);
if (!isPointer(pointer) || pointer === NULL_POINTER) {
continue;
}
let [address, todo] = this.process_ptr(pointer);
if (todo) {
this._work_queue.push(address);
}
ObjectManager.setValue(this._context, current, i, address);
}
}

this.#inGc = false;
return new_roots;
}

Like mark-sweep and mark-compact, a copying GC works by tracing objects from the roots. Every time you encounter a pointer p for the first time you do the following (this is mostly in process_ptr():

  1. Copy the object into the destination address space ("to-space") Call the new address n
  2. Overwrite the first word (at p) with the new address (n). This makes the object invalid, because valid objects have the type in the low-order three bytes of the first word, but this is safe because you have already copied the object so you can use the original (in "from-space") as scratch space. We don't need a separate word in each object like we do for mark-compact.
  3. Set the mark bit in the first word (at p) so you can tell you have seen it.[13]
  4. Add the object to the work queue.

Every time you pull an object off the work queue (by definition, all these objects are already copied) you look at each pointer p. If the pointer p is new, you copy the object (as above) and remember n. Otherwise, you just look at the type field to get n. You then overwrite the pointer with n, leaving this object with correct pointers. Once you have finished with the work queue, you have (1) copied every live object and (2) updated all their pointers and you are done. You can then abandon the source heap, which will become the destination heap the next time around.

This is a simple algorithm but can be a bit confusing, so it's helpful to go through this step by step.

RootsFrom-SpaceTo-SpaceWork QueueReserved@16: (3)#0Pointer(32)#1Integer(2)#2Integer(3)@32: (3)#0Integer(4)#1Integer(5)#2Integer(6)@48: (3)#0Integer(9)#1Integer(10)#2Integer(11)Moved16Reserved@16: (3)#0Integer(7)#1Integer(8)#2Pointer(48)anullbPointer(64)Scan#0Pointer(16)

Moving scan to 64

Copying object at 64 to 16

Copying the roots

The above figure shows the result of the first GC step, where we have processed the object pointed at by the first root, which was at address 64. As this was the first object processed (the only one pointed to by the root) it got copied to the lowest address in the other half the heap (to-space). Note that it was copied as-is, which means that it's internal pointers all still refer to some object that has not been copied yet (i.e., they point to from-space). This will have to be patched up later, which is why this object had to be added to the work queue. We used the original copy of the object (in from-space) to store a tombstone indicating where the object was moved to. The rest of the object has the original contents, but those will never be examined and could in principle be invalid.

Now that we've exhausted the roots, we move to process the work queue, which means processing the object at to:16. We iterate through all the pointers in that object, copying the objects into to-space (again, as-is), and then patch up the pointer in to:16 to match the new location, as shown below:

RootsFrom-SpaceTo-SpaceWork QueueReserved@16: (3)#0Pointer(32)#1Integer(2)#2Integer(3)@32: (3)#0Integer(4)#1Integer(5)#2Integer(6)Moved32Moved16Reserved@16: (3)#0Integer(7)#1Integer(8)#2Pointer(32)@32: (3)#0Integer(9)#1Integer(10)#2Integer(11)anullbPointer(64)Scan#0Pointer(32)

Moving scan to 64

Copying object at 64 to 16

Moving scan to 16

Copying object at 16 to 32

Updating pointer 16:2 to 32

Patching up pointers

Now, the work queue has the object we just copied, which is stored at to:32. Next we scan through that object looking through pointers, but there aren't any, so once we've completed that the GC process will be complete, and we can just abandon from-space and all its objects.

The widget below will let you walk through this all one step at a time if you want.

RootsHeapReserved@16: (3)#0Pointer(32)#1Integer(2)#2Integer(3)@32: (3)#0Integer(4)#1Integer(5)#2Integer(6)@48: (3)#0Integer(9)#1Integer(10)#2Integer(11)@64: (3)#0Integer(7)#1Integer(8)#2Pointer(48)anullbPointer(64)
1: a = (1 2 3)
2: a.0 = (4 5 6)
3: b = (7 8 (9 10 11))
4: a = null
5: #gc
Copying GC widget

One thing to notice here is that unlike mark-compact, which just slides all the allocations to the left, a copying GC does not necessarily preserve the relative order of allocations on the heap. For example, in the line

b = (7 8 (9 10 11))

This allocates the internal tuple first (at lower memory) and the external tuple second (at higher memory). However, when we trace from the roots, we encounter the external tuple first and so it gets copied first, ending up at lower memory, as seen below.

RootsHeapReserved@16: (3)#0Integer(7)#1Integer(8)#2Pointer(32)@32: (3)#0Integer(9)#1Integer(10)#2Integer(11)aPointer(16)bundefined
After GC

This won't have a correctness impact, but may have a performance impact depending on the original layout and memory access patterns. Note that on a second copy, this order will be preserved, because we access the outer tuple first.

Next Up: Advanced Garbage Collection #

The algorithms described in this post are the foundation of basically every modern GC, but I've only described them in their simplest form. In the next post, I'll be covering some of the complexities in making GC deployable, especially for a high performance interactive system (e.g., a Web browser). Importantly, all of these complexities are (nearly) completely hidden from the programmer, because they mostly have performance impacts in terms of when and how fast the GC runs. This allows the language implementor to improve the GC in their runtime without the language user having to do anything to get the benefits of the new implementation. This isn't to say that they aren't important, however: GC can have a huge impact on the performance of a system.


  1. There's also a minor approach where you can't do any memory allocation, like in old school FORTRAN, but we can ignore that. ↩︎

  2. Mostly. Note that this doesn't mean you don't need to think about whether you are doing deep or shallow copies because they have different programming semantics. You don't have to worry about whether there is memory being allocated, though. ↩︎

  3. Some languages have nice idioms for this, like the JS ... spread operator, but Memo is deliberately minimalist, and so there's not even a way to do this generically without knowing the length. However you can use Lisp-style lists. ↩︎

  4. Don't trust me, trust the Web security model. ↩︎

  5. Occasionally you'll see someone propose a system for avoiding memory leaks that comes down to just keeping a pointer to all allocated memory, with the result that that data is morally leaked but not formally leaked. I can never tell if these people are serious. ↩︎

  6. I went back and forth on whether to just keep the type field, which also carries the size, but eventually decided it was better to store the size. The reason for this is slightly subtle: when we get to mark-compact later, it is possible to temporarily have holes which are smaller than any valid object (because the smallest object will be two words and the hole can be one word). This isn't an issue for the garbage collector itself, which doesn't need to skip over them, but it messes up the code I'm using to draw the heap. Storing the length in the first word always works and doesn't have this problem. ↩︎

  7. Or, as I once heard Frank Jackson, who had worked extensively on the Smalltalk 80 garbage collector call it "the second lamest form of garbage collection". ↩︎

  8. Note that it's not possible for the mark bit to be set on a freed object because otherwise it wouldn't have been freed. ↩︎

  9. I could have added a GC instruction to Memo, but I didn't for two reasons. First, this would be unusual because GC is usually automatic. Second, I wanted to let you step through the GC one operation at a time and that wouldn't work if the instruction were processed by the Memo interpreter directly. ↩︎

  10. See § 3.2 of the Garbage Collection Handbook. ↩︎

  11. In our implementation, we just have two heaps that share the same address space starting at 0, and internally I keep track of which heap is in use. This is fine because the addresses are just indexes into a table. In a system which was closer to the metal, you might instead tag the addresses using the higher order bits, as we have been doing so far. ↩︎

  12. The other way of looking at it is that you have one heap which is split into two "semispaces". ↩︎

  13. This is all a bit fiddly because in other contexts the same bit (0x80000000) means that the field is an integer rather than a pointer, but in this case we know it's a pointer so we can overload the meaning. ↩︎