Garbage Collector
In computer science, garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced—also called garbage.
Wow, in CS241E, for A11, I need to implement an automatic garbage collector. F.
Teacher cites this classic paper:
An automatic garbage collector is a memory manager that finds free blocks automatically.
We say that a block in the Heap is reachable if:
- the address of the block is in a block on the stack, or
- the address of the block is recursively stored in a reachable block
Cheney’s Copying Garbage Collection
In CS241E, we looked at the implementation of a specific algorithm called Cheney’s Copying Garbage Collector.
Wow, this algorithm is actually pretty clever, but the main speed slowdown is copying all the data over, which I don’t know if that is efficient? Actually yes, because else you need to move the memory blocks anyways, you might need to move less memory blocks but just finding the blocks might be a pain?
Cheney’s algorithm conceptually splits the Heap into 2 halves (called semispaces):
- from-space
- to-space
New blocks are allocated in the from-space. A pointer (heapPointer) keeps track of the first free address in the from-space, so allocation requires only incrementing this pointer, much like in the simple heap allocator from Assignment 6. (THIS IS SUPER SIMPLE!!)
When the heapPointer
reaches the end of the from-space, a compaction is triggered, which copies all reachable blocks from the from-space to the beginning of the to-space.
- These blocks are placed contiguously next to each other
Now, the blocks in the from-space are no longer needed and thus abandoned. We then flip between from-space and to-space. heapPointer
is set to the new from-space.
When the heapPointer reaches the end of the new from-space, another compaction occurs. Ahh, this makes sense!
To make this general description more concrete, let us first look at how we initialize the heap. We will make use of three constants:
- heapStart, the address of the first word in the heap,
- heapEnd, the address after the last word in the heap, and
- heapMiddle, the address halfway between heapStart and heapEnd.
We can split the heap into two equal semispaces → 1 between heapStart and heapMiddle, 1 between heapMiddle and heapEnd
In CS241E A11, the first 1/4 of memory is reserved for the code of the MIPS program, the second 1/4 of memory is the first semispace of the heap, the third 1/4 of memory is the second semispace of the heap, and the last 1/4 of memory is reserved for the stack.
We will designate two registers to implement the heap:
- heapPointer, as already mentioned, holds the address of the first free word in the from-space, and
- fromSpaceEnd holds the address after the last word in the from-space.
The fromSpaceEnd register serves both to identify which of the two semispaces is currently the from-space and to enable us to check when allocation has reached the end of the fromspace. These two registers are initialized as follows, making the first semispace the from-space and the second semispace the to-space:
To allocate a block of wanted
bytes, this time including the header, we increment the heapPointer
by wanted
, (we did the same thing with the simple heap allocator in Assignment 6):
If we have called init, followed by several calls to allocate to allocate some blocks, the overall structure of memory looks like this:
Okay, let’s see the code for the actual garbage collector, which contains the logic for doing Compaction.
Wait… I don’t understand how the above algorithm works…? How does it work
- The variable
free
keeps track of the first free address in the to-space (much likeheapPointer
) - The variable
scan
is used to loop through all blocks on the stack - The procedure
forwardPtrs
- looks through a block on the stack for all words that represent addresses and calls the
copy
procedure on each such address - replaces the old from-space address with the
newAddr
- looks through a block on the stack for all words that represent addresses and calls the
- The procedure
copy
copies a block from the from-space to the to-space and returns its new address
IMPORTANT: the frame of the procedure that calls gc must be on the stack, not on the heap. This is so that the dynamicLink
correctly points to the block on the stack below the frame of gc
.
The copy
procedure:
- Returns address unchanged if block is not in the from-space
- Else, copies it to the free part of the to-space and increments the free pointer. It also marks the original block as copied, for example by setting its size word to a negative number, and leaves the address of the copy in the to-space in the block, for example in the second, next word (avoids the thing getting copied multiple times)
What is the point of the 2nd for loop??
- Because the 1st loop does not account for all reachable blocks in the fromspace, since a block can be reachable indirectly by an address in another reachable block, rather than directly by an address on the stack. In addition, the blocks that have been copied to the to-space may contain addresses inside them, and these addresses still point into the from-space. 2nd loop resolves all of these issues
Ahh, i see, and during this process of copying, you care about the forwarding address, which is why this is important.
History
Garbage collection was invented by American computer scientist John McCarthy around 1959 to simplify manual memory management in Lisp.
C and C++
C and C++ don’t have automatic garbage collection. They have a DIY Philosophy: Whatever mess you create, you need to clean up yourself.
Other Languages
Other languages like Java, C#, Python use automatic garbage collection to reclaim “dead” variable storage, so you don’t need to do anything yourself!
- The language run-time system periodically runs a little routine in the background that grabs all of the objects not currently used