Understanding Memory Management, Part 7: Advanced Garbage Collection
Posted by ekr on 27 Jul 2025
This is the seventh and final (phew!) post in my multipart series on memory management. You may want to go back and read Part I, which covers C, parts II and III, which cover C++, and parts IV and V which cover Rust, and if you haven't read it, go to read part VI, which introduces the basic mechanisms of garbage collection. In this post, I want to touch on some of what you need to do to deploy garbage collection in a production system, as well as some of the reasons that systems designers might decide to avoid GC.
Garbage Collection Latency #
The biggest concern that engineers typically have with garbage collected systems is the cost of the garbage collector itself. Because these algorithms—other than reference counting, of course—require scanning all allocated memory, often multiple times, they can be quite expensive. This isn't just a matter of total program runtime, but also of latency. The basic GC algorithms I showed in part [VI] are what's called "stop-the-world" garbage collectors, which means that the entire program has to wait for the GC to finish. This might not be a big deal if you're processing some data in Python—what with buffering, context switching, etc. you may not even notice the latency—but the situation is totally different in an interactive application: when the program is garbage collecting it's not responding to your input; in this context even very small amounts of GC lag—or any other lag, for that matter—can be quite noticeable.
Garbage Collection Timing #
The first thing to do to control GC latency is to be fairly careful about when you actually garbage collect. Which strategy you follow isn't that big a deal in non-interactive program, because unless you do something severely wrong, you'll probably have to do approximately the same total amount of GC anyway, but if you GC at the wrong time in an interactive program then users will get annoyed because suddenly their program just stops responding and they have to sit and wait for it to come back. This is obviously quite annoying! Back when I worked on Firefox, we used to call this kind of stuff "jank".[1]
You might think that you want to put off GC as long as you could, for instance until you actually are unable to allocate any more memory without garbage collecting. This generally isn't the best idea: you may run out of memory right at the time when the user is trying to do something, which creates exactly the janky user experience that you were trying to avoid by deferring GC. Moreover, because GC algorithms run more slowly when there is a lot of memory in use (counting garbage here as "in use"), if you put things off too long, the latency may actually be quite bad. On the other hand, you don't want to GC too frequently because GCing is expensive.
There are a number of more sophisticated approaches for scheduling the GC. For example, in an interactive program you can look for when the program appears to have been idle (i.e., no computation or user input) for a given period of time, as is done in GCMH. V8's Orinoco uses a fancier version of this where it schedules GC during the idle period after it has rendered a frame and the time it needs to render the next one. Part of why this works is that they do only part of the GC each time, as described below. Modern GCs may also use multiple triggers for garbage collection. For instance the Java ZGC collector uses memory pressure, periodic collection, and allocation rate as triggers.
Good GC scheduling will only take you so far with a stop-the-world GC; you're always going to take a pause and there's some chance that it will be at an inconvenient time. If you really want to minimize latency, you need to find a way to avoid having invocation of the GC stall the entire program. The remainder of this section describes a number of approaches.
Generational Garbage Collection #
One of the most common mechanisms is to just garbage collect
some of the objects you've allocate, typically the most
recently allocated ones. To see why this makes sense, consider
the following simple Python JS [Corrected - 2025-07-27] program:
Code #
const fs = require("fs");
const WORD_COUNTS = {};
let LINE_NUMBER = 0;
let MAX_WORDS = 0;
const fileContent = fs.readFileSync("count-words.in", "utf-8");
let lines = fileContent.split("\n");
// Remove trailing empty line if file ends with newline
if (lines.length > 0 && lines[lines.length - 1] === "") {
lines.pop();
}
for (const line of lines) {
const words = line.split(/\s+/).filter((word) => word.length > 0);
const count = words.length;
if (!(count in WORD_COUNTS)) {
WORD_COUNTS[count] = [];
}
WORD_COUNTS[count].push(String(LINE_NUMBER));
LINE_NUMBER += 1;
MAX_WORDS = Math.max(MAX_WORDS, count);
}
for (let count = 0; count <= MAX_WORDS; count++) {
const countString = WORD_COUNTS[count] ? WORD_COUNTS[count].join(" ") : "";
console.log(`${count}: ${countString}`);
}
Output #
0: 1 5 10 16 20
1: 13
2: 0 19
3: 2 3 4 6 8 9 12 14
4: 7 15 18
5: 11 17
This program reads a file line by line and then makes a structure containing of the line numbers of the lines with various number of words per line.
The thing to notice here is that there are two kinds of object allocations here:
- The table of word lengths (
WORD_COUNTS
and the individual lists inside it) - The line read from the file[2] stored in
line
list of words in each line created at the top of the loop bywords = line.split()
[3]
Real Programs Don't Call Free #
When I worked with Allan Schiffman he used to say (IIRC, quoting Stanford Professor Dave Cheriton), "real programs don't call free". The idea was that there were two kinds of programs:
-
Short-running programs like compilers where returning memory didn't help that much and it was easier to just allocate and never free, and let the operating system clean up on program exit.
-
Long-running programs which had to be careful about their memory consumption and which therefore couldn't really trust the system
malloc()
and instead had to implement their own custom allocators.
Of course, computers have gotten a lot bigger since then and
malloc
has gotten a lot better so I suspect people are a lot
more willing to trust the allocator rather than rolling their
own. On the other hand, the whole point of garbage collection
is that you don't have to call free.
The overall WORD_COUNTS
table lasts for the entire lifetime
of the program, but the line
and words
variables only
live for one turn of the loop. This turns out to be a common
access pattern, especially for long-lived programs: you have
a lot of both short-lived and long-lived allocations. But this
means that when you garbage collect, you spend a lot of time
examining (tracing) over objects which will not be freed.
For example, imagine we modified the program to request
garbage collection on every turn of the loop (this is plainly
unnecessary in this program, but bear with me).[4]
With each turn we would free line
and words
but we would
also have to examine the (ever-increasing) WORD_COUNT
structure,
which is just wasted effort.
If memory allocation and freeing patterns were random (technically: distributed accordingly to a Poisson process), then we would basically just have to live with this waste. However, in fact allocation and freeing display patterns, specifically that it's common for objects which were just allocated to be freed quickly. For example. if the user clicks on some button, the various click handlers for the button, dialog boxes, etc. may get instantiated to handle the UI gesture and then torn down as soon as the program has finished processing the gesture. This observation is often summed up in what's called the Generational Hypothesis:
the most recently created objects are also those most likely to become unreachable quickly
We can exploit this observation to improve GC performance using what's called a "generational garbage collector".
The intuition behind a generational GC is that we segregate objects into "generations" and then garbage collect the younger generations more frequently. As objects age, we move them into older generations, which are garbage collected less frequently. As an example, I have implemented a simple GC, with only two generations. It behaves as follows:
-
The heap is divided into two regions, the "nursery" used for newly created objects, and the rest of the heap, used for older objects. The nursery starts at address
0
as usual, and the rest of the heap starts at address5000
. -
When an object is initially created, it is allocated out of the nursery.
-
We have two kinds of garbage collection: a minor GC, which only examines objects in the nursery and doesn't clean up objects in the rest of the heap and a major GC, which garbage collects the whole heap. Generally, we would do a minor GC fairly frequently and a major GC less often.
-
After we have done either kind of GC, any objects in the nursery will be promoted to the rest of the heap (technical term: tenured).
Let's walk through this piece-by-piece. First, we'll just allocate two small objects.
The situation here is just the same as with the other GCs we've
seen: the objects get created at the top of the heap in address 16
.
The only difference is that now have labels for Nursery
and
Tenured
. For presentation purposes I've drawn these as adjacent,
but of course these are both large regions; I'm just eliding the
big empty space between the last the allocations and the rest
of the region.
Now let's make b
garbage and then do a GC on line 4. Note that
I've asked for a minor GC with the new #gc0
pseudoinstruction,
but as there are no tenured objects the eventual result is much
the same either way.
After this GC pass, object b
has been collected and the object
pointed to by a
has been tenured and moved to address 5000
.
If we now create a new object and assign it to b
, it will
be allocated in the nursery, as expected.
Now let's make a
garbage, and do a minor GC.
Now we've tenured b
but the object at 5000
previously pointed to by a
is still there; it's just
garbage. This is what we expected, because a minor GC
doesn't try to clean up any of the tenured objects;
it just lets them sit there even if they're garbage.
If we now ask for a major GC, we'll clean up the whole
heap, and finally clean up that object.
The advantages of this design should be obvious, assuming the generational hypothesis is true for our program: we can clean up most of the garbage by just examining the nursery without having to look at any tenured objects. Generational style GCs are very common, especially in systems designed for interactive programs. Many modern GCs, such as those used by V8, SpiderMonkey, or Java use generation scavenging.
Design Choices #
A generational GC combines elements of a number of the garbage collection systems we've seen already.
Allocation #
In our implementation, we just use a simple bump allocator, always allocating out of the nursery. Because every GC pass always cleans out the entire nursery, promoting live objects and discarding dead ones, there are never any holes in the nursery that previously contained live objects, so there's no point in trying to reuse space--there's nothing to reuse. In a fancier GC (see below), we might have a different situation, however.
Minor GC #
In the specific design I've implemented, the minor GC is effectively a copying style collector but instead of copying into two equivalent semi-spaces, we copy from the nursery right into the tenured region of the heap. Importantly, unlike the copying collector we showed in Part VI, the destination region is probably not empty, because it will contain whatever objects have already been tenured. Just as with the copying GC, we can abandon all the objects in the nursery after the GC.
We're able to get away with this simple strategy because we immediately promote objects on every GC pass. However, if we waited multiple passes, then the situation would be different. For instance, if you promoted objects after two GC passes—note that this requires bookkeeping—then you might have objects which needed to be kept around in the nursery after a GC pass. This also has implications for allocation: if the minor GC uses mark-sweep, then we might have holes that could then be filled rather than just bump allocating (though it's still very attractive to bump allocate).
V8's Orinoco uses an interesting design which has a copying minor GC and then promotes objects after they have survived two passes:
Orinoco's minor GC (generation scavenger). From Google.
There are a few subtle implementation details, which we'll get to below.
Major GC #
The major GC is more or less the same as the stop-the-world GCs we saw in Part VI, except that it has to scan all the regions in sequence (in this case, just the nursery and tenured regions). However, as a practical matter you probably want the major GC to be either a mark-compact or copying collector, because otherwise you end up with holes in the tenured region which can't be filled in during the normal allocation process. You could in principle fill them in to some extent when you promote objects from the nursery, but that makes GC much more expensive because you have to try to fit each object being promoted somewhere rather than just bump allocating. I use mark-compact because otherwise you need to allocate a large block of memory for the other semispace in order to optimize the infrequent major GC.
Implementation Notes #
In this section we look a little more closely at the implementation of our generational GC. Much of this will be familiar from Part VI, but there are some details that I've had to change.
Regions #
First, we need to keep track of the which regions of the
heap correspond to each generation. This is done by keeping
a _generations
list, with each entry containing:
-
start
: The first address that can be allocated at. -
end
: The end of the allocated region (one byte past the end of the last object), and hence the next location we'll be allocating at. -
top
: The end of the region itself (again, one byte past it).
For convenience, we also have the variables
_gen0
and gen1
, which point directly to the generations objects
for the nursery and tenured objects respectively.
We can then use the _generations
variable to see which generation
a given pointer points, in the obvious way:
generationFromPointer(address) {
for (const [index, generation] of this._generations.entries()) {
if (address < generation.top) {
return index;
}
}
throw new Error("Pointer not in any generation");
}
At some level, this is all overly general: this design in principle supports an arbitrary number of generations but in practice we just use two generations of equal size. There are different programming philosophies here and exponents of YAGNI would probably say this is a bad set of design tradeoffs, but I prefer to avoid baking in a bunch of temporary design decisions. In a real system we would probably want the nursery to be smaller and this lets us make that change if we decide to.
Minor GC #
Now let's take a closer look at the minor GC. As I said, this is
similar to the copying GC we showed in Part VI, but with
some subtle differences. Here's process_ptr
:
process_ptr(address) {
if (ObjectManager.isMarked(this._context, address)) {
// The extra word is used for the forwarding address
return [ObjectManager.readXWord(this._heap, address), false];
}
// Not moved yet.
const size = ObjectManager.getSize(this._context, address);
const new_address = this._gen1.end;
Memory.memmove(this._heap, new_address, this._heap, address, size);
this._gen1.end += size;
// Now overwrite the first word to point to the new location.
ObjectManager.setXword(this._context, address, new_address);
ObjectManager.mark(this._context, address);
return [new_address, true];
}
The only real difference here is that when we move an object we store
the new address not in the first word but in the extra word which
we've allocated for the mark-compact moved
pointer. We could do this
either way, but this is cleaner and lets us be consistent between the
passes.
The actual copying phase is essentially identical to our copying GC, except that we have slightly different bookkeeping to deal with the fact that we're copying into the tenured region rather than a freshly initialized semispace. However, we have two subtle issues to address, dealing with what's called intergenerational pointers, which is to say pointers which are stored in an object of one generation but point to an object of another generation. There are two types of such pointers:
- upward pointers, which go from new to old objects
- downward pointers, which go from old to new objects
Upward pointers are easy to deal with: we don't want to examine tenured objects at all, so when we see them during the minor GC, we can just skip past them. This is fine because all they can do is keep tenured objects alive, and we're not going to free those objects anyway.
Downward pointers are more complicated: because we aren't tracing tenured objects, we're not going to see them, but they may be the only reference to some object in the nursery, so without them we would incorrectly free that object, leading to a dangling pointer from a tenured object and eventually maybe a UAF. This means we need to do something when we have a situation like this:
The standard procedure is to keep track of all downward pointers using what's called a "write barrier". Whenever we do a write to a pointer slot in an object, we check to see if it's a downward pointer and if so, we store it in a lookaside list, like so:
recordIntergenerationalWrite(sourceAddress, fieldIndex, targetAddress) {
const key = `${sourceAddress} ${fieldIndex}`;
// Check if the new target is an intergenerational pointer from old to young (gen0).
if (isPointer(targetAddress) && targetAddress !== NULL_POINTER) {
const sourceGenIndex = this.generationFromPointer(sourceAddress);
const targetGenIndex = this.generationFromPointer(targetAddress);
if (sourceGenIndex > 0 && targetGenIndex === 0) {
// It's an intergenerational pointer, record it.
this._rememberedSet[key] = targetAddress;
return;
}
}
if (key in this._rememberedSet) {
delete this._rememberedSet[key];
}
}
From the perspective of the GC, the downward pointers are just a new set of roots, so we add them to the root list before doing the minor GC:
*_gc_minor_incremental(roots) {
let inner_roots = [...roots];
let ig_ptr_addrs = [];
// Append the intergenerational pointers so that
// we can save them.
for (const [key, ptr] of Object.entries(this._rememberedSet)) {
ig_ptr_addrs.push(key);
inner_roots.push(ptr);
}
let new_roots = yield* this._gc_minor_incremental_inner(inner_roots);
With some careful refactoring ._gc_minor_incremental_inner()
could be the ._gc_incremental()
function from our copying
GC, but I'm trying to keep things a bit simple.
When the minor GC returns, it provides updated values for all the roots we passed in, so now we need to patch up the locations where the downward pointers were stored so that they point to the new locations in the tenured region. Note that at the end of this process we don't have anything in the nursery and so there will be no more downward pointers.
// Now update the IG pointers.
const new_ig_ptrs = new_roots.slice(roots.length);
for (const i in new_ig_ptrs) {
const new_addr = new_ig_ptrs[i];
const key = ig_ptr_addrs[i];
const [addr, slot] = key.split(" ").map((a) => parseInt(a, 10));
// Note that this will call recordIntergenerationalPointer()
// but |new_addr| is now of the same generation as |addr|.
ObjectManager.setValue(this._context, addr, slot, new_addr);
}
Major GC #
The major GC really just is mark-compact, with the additional
complication that we need to iterate over all the regions
for each generation. However the in-use versions of each
region aren't contiguous. For instance, we might have only
allocated 16
–512
in the nursery, which means that
512
–4999
is just unallocated blank space which
might have any values (e.g., if we previously had allocated
past 512
and then did a GC), and we don't want to try to
scan it. This wasn't a problem in the minor GC because
a copying GC just follows pointers, but mark-compact actually
does a linear scan of the whole allocated region, so if
there are garbage values there, it will misbehave, quite
likely in a dangerous fashion. Fortunately, we already
have a list of the allocated subregions of each region,
so we can just iterate over it, as in the following:
for (let i = this._generations.length - 1; i >= 0; i--) {
const generation = this._generations[i];
for (let scan = generation.start; scan < generation.end; ) {
const size = ObjectManager.getSize(this._context, scan);
if (ObjectManager.isMarked(this._context, scan)) {
ObjectManager.setXword(this._context, scan, free_ptr);
yield [{ op: "setmoved", addr: scan, newval: free_ptr }];
free_ptr += size;
}
scan += size;
}
}
Straightforward, right? Well, mostly. Why are we going through the list of generation regions backwards (from older to younger generations and from high to low memory) rather than forwards (from younger to older generations and from low to high memory)?
Recall that mark-compact works by sliding every object
as far left (towards low memory) as possible. It does
this by keeping a single end
pointer which points to
the location where the next object will be allocated
(initially pointing to the start of the target region)
and then incrementing it by the size of each object
allocated. This is normally safe because we are also
scanning left to right and so we never overwrite
any region of memory we are going to need later, even
if we leave it in a corrupt state, as shown in the figure
below:
What's in 28
–36
is probably the last two
words of the object previously at 24
(though
the GC could have done anything it wanted with it).[5]
It doesn't matter, though, because we've already advanced
the scan pointer past that region to 40
, so it's just going to
copy whatever is at 40
over it in a second anyway.
Now consider what happens if we have two generations and we process the nursery first, as shown below.
In this case, we copy the first value in the nursery over the first value in the tenured region (remember, the tenured region is always compacted); We've now destroyed that object. Even in the best case where the object was going to be freed anyway, we've quite likely left a piece of some object—as shown here—which will mess up the scanning process. And if we're not going to free the object we just stomped on, the data is just gone and now things are really bad.
Fortunately, this problem is easily solved by processing the generations in reverse order so that we've already processed everything in the tenured region before we start trying to copy stuff from the nursery into it.
Below I've provided a widget that lets you see the major GC in action.
Incremental and Concurrent GC #
Another way to reduce the latency of garbage collection is to do
incremental garbage collection, in which you only do part of the
garbage collection pass at each pause point. The obvious challenge
here is that the program itself may change some pointers in between
the GC phases. For example, the topology where A
points to B
which
points to C
, and the following sequence of operations:
- Start the GC and process
C
, markingB
and adding it to the work queue. - The program then sets a pointer from
A
→C
and erases the pointer fromB
→C
. - The next GC phase runs, processing
B
, which completes the marking phase. At this pointC
is now orphaned and will eventually be cleaned up during the sweep phase.
This problem can be solved with a similar "write barrier" approach to
how we handled intergenerational pointers, which detects
that you have changed something important from underneath the GC. For
instance, you could detect that you've changed one of the pointers
from an object you've already processed (A
) and re-add it to the
work queue.[6] With
the right set of barriers you can intersperse operation of the
program and the GC at relatively fine granularity by running
the program for a little while, doing a little bit of GC, then
running the program some more, etc., thus reducing
the apparent pause experienced by the user.
Incremental GC alone lets us reduce apparent pauses, but it doesn't let us use more than one core at once, but of course modern processors have more than one core. It's possible to extend incremental GC to actually have the GC run in parallel with the program, in what's called concurrent GC. This is obviously even trickier because we have to worry about the usual thread safety concerns when two threads try to work on the same memory at once, but the result is to further reduce the pauses experienced by the user—though not necessarily to zero because you still can have the program waiting on some contested resource that the GC is using. For this reason concurrent GCs are very common, including in the systems I named in the previous section.
Separately, you can also have the GC run in multiple threads at once. This is called parallel GC. The obvious advantage here is that you are using more than one core for your GC and thus making more efficient use of your computer. You can use various barriers here, but as an intuition pump consider that you could have a non-concurrent mark-sweep GC that just ran the sweep phase in parallel; this can be done without any thread locking at all once you've segmented the heap. You can have have parallel GC in both stop-the-world and concurrent modes, with the latter obviously providing the lowest latency impact.
Allocation Strategies #
So far we've looked at two basic allocation strategies:
- Bump allocate at the end of the allocated region
- Allocation in the first available free region (what's called "first-fit")
If we have a moving GC like mark-compact or copying, then quite possibly all we need is bump allocation; while there may be garbage in the form of unreachable objects, but at least in a stop-the-world GC, we discover the garbage and compact the heap at the same time, so there's never any need to allocate in the holes. By contrast, if we are using reference counting or mark-sweep, then we do need to re-allocate out of the holes—otherwise there's not much point in GCing in the first place—and it's important to do so efficiently.
Any allocation algorithm needs to balance a number of factors:
- Fragmentation
- Efficiently finding a hole we can fit into
- Space overhead
The design of allocators is a whole separate specialty from the design of the garbage collection system, and I'm not going to go into it here in detail, but I wanted to give you a flavor of the kinds of issues you face and some of the variety of implementation options.
In the naive first-fit algorithm, we scan from the beginning of the heap until we find a suitably large location. Depending on the fragmentation pattern and the size of the new object, this can take quite a long time and involve scanning through much if not all of the heap. Moreover, this can make fragmentation worse; there may be a better location—more closely matching in size—higher up in the heap but instead we allocate in the first available location. An alternate choice is to do what's called "best-fit" where we iterate over all the available memory locations to find the ideal location, but this requires scanning the entire heap, which is obviously slower.
There are a number of designs that allow us to have more efficient allocation, but typically at the cost of memory overhead.
Bucketed Allocations #
There are of course a lot of options here, but one natural thing to do
is to group allocations into size buckets (for instance by powers of
two). When you allocate a new object of size s
you round the
allocation up to the next bucket size B(s)
, with the remainder of the
allocation just being left empty. Similarly, when you free an
allocation of size s
it creates a hole of size B(s)
.
Bucketing allocations like this gives us a compromise between first fit and best fit. For instance, if we first select the first hole with the same bucket size, then actually we'll be selecting any hole within the bucket's range, which are presumably more common than exact matches. Moreover, if we are using powers of two bucket sizes, we can also select a any hole of the next bucket size up by splitting the original bucket in two. For instance, if we are allocating a region of size 512 but have a hole of 1024, then the result is an allocated region and a remaining hole of size 512.
Another advantage of bucketed allocations is that it allows you to easily resize objects. For example, if you allocate an object of size 300 in a bucket of size 512, you can increase the size of the object (up to 512, at least) without moving the object. This is more useful for some kinds of objects than others; for instance if you have a vector/array of objects then it's often quite useful to be able to make it bigger so you can add more entries.
Metadata Structure #
One of the reasons we've had poor efficiency so far is that we've been
forced to scan the heap linearly to find a suitable location,
so we end up with what's basically an O(n)
algorithm.
We need to do this because the only information we have
about the memory map is in the heap itself (which, recall,
is self-describing). We can do a lot better if we use
some extra memory to store a map of which regions are
in use and which are free (a "free list").
If we combine this idea with bucket allocation, then the obvious approach is to have a list of holes indexed by bucket size. When we free a region, we add the hole to the list for that size. When we need to allocate a region, we look to see if there are any holes of the right size. If so, we allocate from one of those holes. If not, we bump allocate in the usual fashion.
Multiple Arenas #
It's also possible to have different heap regions (sometimes called "arenas") for different kinds of objects. For example, instead of bucketing allocations but allocating out of the same region you could have one region for each bucket size. This has the advantage that holes are always the same size, so as long as there is room in the region you can always allocate, but at the cost of using up memory for the unused space for each bucket size.[7] It's also possible for different arenas to have different allocation and GC strategies. For example, you might use a generation scavenging GC for your small objects but a mark-sweep GC for large objects or growable objects like arrays on the theory that they tend to be long-lived and that copying them during the promotion phase will be expensive.
GC in the Real World #
As you should have gathered by now, real world GC systems are vastly more complicated than I've described here. However, pretty much all of them use some variation or combination of these basic techniques. For example, here's how Google describes Orinoco:
Over the past years the V8 garbage collector (GC) has changed a lot. The Orinoco project has taken a sequential, stop-the-world garbage collector and transformed it into a mostly parallel and concurrent collector with incremental fallback.
And here is a description of Java's ZGC:
Concurrent Region-based Compacting NUMA-aware Using colored pointers Using load barriers Using store barriers (in the generational mode)
Most of these words should seem familiar to you, and you should have enough background to figure out the other ones with a little searching.[8]
The design of fast garbage collectors is a whole subspecialty of CS, and often requires a good understanding of the specific dynamics of the system that the GC will be deployed in. If you want to go (much) deeper here, a good reference is the Garbage Collection Handbook, which I used extensively in preparing this and previous post. It's fairly dense, but really digs into far more detail than you are likely to need. In addition, many widely used GCs are open source, so you can look at the code yourself.
Why wouldn't you want GC? #
As should be apparent at this point, working in a garbage-collected system is vastly easier than working in a system where you have to handle memory management yourself. In the latter case, you either end up with a system that isn't safe by default (C++) or one where you have to do a lot of gymnastics in order to write ordinary-seeming code (Rust); in either case you spend a lot of time thinking about memory management. By contrast, with a GC language, you rarely have to think about what's going on with memory management at all, and when you do, it's mostly around performance reasons or when you have to do deep copies, rather than around "this bug is going to cause a vulnerability" or "I can't make my program work". This is true even for a systems programming language like Go.
Obviously, the people who designed Rust were aware of Garbage collection—all of the main techniques date back to the 80s and 90s and the basic concepts go back to the 1950s and Lisp—but they very deliberately chose to build Rust without it. There are a number of reasons why you might want to build your system without a GC.
Deterministic Performance #
As noted above, because the GC—at least tracing GC, but recall that most GC-based systems have some element of tracing—can introduce pauses, most GC-based languages do not provide completely deterministic performance. This is the kind of thing that makes systems programmers sad—though perhaps more than really necessary—as they like to think of themselves as close to the metal and knowing exactly what their program will do. By contrast, a language like Rust will have more deterministic performance (which isn't to say better) because you don't have to worry about the GC doing something when you want to use the CPU.
This isn't to say that you can't have a systems programming language that does GC. Java, Go, and C# are all garbage collected and people seem to get along just fine, but even so there is a persistent sense amongst systems programmers that there's something fishy about it.
Of course, it's important to recognize that Rust does have GC in the
form of reference counting for Rc
and Arc
, and many if not most
Rust programs use some form of reference counted pointer at least some
of the time. However, many Rust fans have managed to conveniently
forget that reference counting is a form of GC because one of the main
differences between Rust and Go is that Go is GCed (which is very bad)
and Rust is not.[9]
Deterministic Behavior #
As you'll recall from previous posts, many languages support some mechanism (C++ destructors, Rust drop trait, etc.) where an object gets to do something before it's destroyed. We used this in III to create C++ smart pointers, but you can also do other stuff, such as flush file buffers. Some garbage collected languages have similar features (often called finalizers or cleanup functions).
However, because the destructor/finalizer/cleanup function runs when the object is destroyed, and tracing garbage collectors destroy the object at some indeterminate time, the finalizer also runs at an indeterminate time, or potentially never. For instance, here's what Go's documentation has to say:
AddCleanup attaches a cleanup function to ptr. Some time after ptr is no longer reachable, the runtime will call cleanup(arg) in a separate goroutine.
..
The cleanup(arg) call is not always guaranteed to run; in particular it is not guaranteed to run before program exit.
Reassuring, right?
By contrast in a reference-counted system, the object is destroyed at a deterministic time (when the reference count goes to zero) and so you can have higher confidence about where it will run.[10]
What about GC for C? #
I've spent much of this series dumping on C and C++ for how hard they
make memory management, so it's natural to ask "why can't they be
garbage collected?" The answer is: they can be, at least sort of.
Back in 1988, Boehm, Demers, and Weiser designed a GC system for C
and C++. The basic idea is you (the programmer)
use GC_malloc()
and GC_realloc()
, but never call free, because
the GC does it for you.
The trick that makes this work is that it's a conservative GC. Recall that in all the examples above, we relied on knowing the structure of objects so we can find all the pointers. The Boehm GC doesn't have this information and so it has to assume that any pattern in memory that points anywhere in an allocated object is actually a pointer to that object.[11] This means that (for instance) if you have an integer value which just happens to have the same bit pattern as a pointer into an object, then it will be treated as a pointer to that object, which will then effectively leak.
It's also possible under certain circumstances for the garbage collector to fail to recognize a pointer. It's generally not legal in standards compliant C code to generate a pointer value outside of the allocated region,[12] but the compiler is allowed to do anything it wants, and so with the right compiler optimizations, it's possible that the right pattern won't appear in memory. See Boehm's issues page for more on this, though this looks kind of old so I wonder if compilers have gotten more aggressive in the time since this was written.
This is obviously clever stuff, but my experience is that very few C or C++ programs use this kind of garbage collection; people just suffer in the traditions of our ancestors.
The Bigger Picture #
This brings us to the end of our series on memory management. In closing, I'd like to step back from the details and look at the big picture. As should be clear, a huge amount of the work performed by a piece of software is in managing memory in one way or another: that work can either be pushed onto the programmer or handled by the software runtime automatically.
Each approach has advantages and disadvantages: having the programmer manage memory gives them tight control of the precise behavior of the program, but at the cost of adding to the programmer's cognitive burden, reducing the attention they have to pay to other pieces of the programming task. By contrast, letting the language runtime handle memory management frees up the programmer to think about other things but at the cost of losing control of the precise memory behavior of the program. This is a familiar tradeoff in software engineering as you climb the ladder of tool sophistication: you can get more done if you hand things off to the language—for instance, using C rather than assembly, or using R or Python rather than C—or to third party components, but at the cost of having to mostly just live with whatever behavior other people have decided upon.
I started this post talking about Rust, which represents a new point in the design space: programmer-managed memory but designed so it prevents unsafe operations (unlike C and C++). In theory you might think that this would remove cognitive burden from programmers because your mistakes are less serious, but it also frontloads that burden because it forces you to think about architectural issues upfront rather than just having the program appear to work but fail in the field (as with C and C++). I think comparing Go and Rust is instructive here: they were designed at roughly the same time but Go is vastly easier to learn than Rust, in large part due to its memory model. On the other hand, while Go is clearly carving out a serious niche, I think it's clear that Rust is the new language of choice for hardcore systems programmers. This isn't only due to memory model, of course, but the Rust memory model is part of a general philosophy of programmer control and semantic transparency in constrast to Go, which is designed specifically for ease of use.
Despite the age of these ideas, as an industry, I don't think we're done exploring the design space here. As an analogy, consider high versus low-level languages. As I mentioned above, higher level languages often have worse performance than lower level languages, so it's not uncommon for engineers to write big parts of their program in one language and then drop down to a lower-level language for performance critical code: we see this when C programs use inline assembly and R or Python programs write modules in C or C++. One possibility is to try something similar with memory management, namely to have GC most of the time but then (safely!) escape back to non-GCed mode for certain chunks of code,[13] though hopefully in a more idiomatic way than inline assembler. I've seen some systems that seem to be exploring this space (e.g., the Boehm GC above, or Objective C's garbage collection), but nothing in really wide use. Whatever the approach, I don't think we're done here!
It was especially bad on earlier versions of Firefox because (1) much of the UI was written in JavaScript and (2) the same JS VM was used for Web content and the browser UI. ↩︎
Probably because this could be on the stack ↩︎
Most likely not on the stack, at least in a naive implementation. ↩︎
I actually originally wrote this program in Python, but I forgot that Python has a reference counting GC and so the objects were being freed at the end of every loop anyway. ↩︎
In my code, I actually make a dummy free object in this space to make the memory display system, which also scans linearly, work properly. ↩︎
This particular approach is due to Leslie Lamport. ↩︎
Some allocators, for languages like C and C++ partition memory not just by size but by object type, which helps prevent type confusion attacks due to use-after-free, when an object of type
T
is freed and reallocated as an object of typeU
but some dangling pointer of typeT*
tries to use it as an object of typeT
. This isn't really an issue for memory safe languages, though. ↩︎"Region-based" means that it uses multiple chunks of memory rather than one contiguous heap. "NUMA-aware" means it understands the specific memory architecture of the machine and will try to use faster (processor-local) memory rather than slower memory. Colored pointers Load and store barriers refer to whether the kind of barriers we described above happen when you read pointers or write them. ↩︎
I do want to emphasize here that Rust's design also gives you thread safety for free, whereas Go's does not. ↩︎
This isn't a guarantee because the program might, for instance, crash. ↩︎
For instance, suppose that the programmer allocates an array. They might traverse the array by incrementing a pointer relying on the count of elements to be able to recover the original pointer to the beginning. In this case, there might not be a pointer to the start of the array during the GC phase. ↩︎
t's technically legal to have a pointer to just after the end of an array to allow people to write for loops, but you can't dereference it. ↩︎
Rust sort of does the opposite where it lets you jump into unsafe mode, outside of Rust's guarantees. ↩︎