Garbage Collector

In computer science, garbage collection (GC) is a form of automatic memory management. The garbage collector attempts to reclaim memory which was allocated by the program, but is no longer referenced—also called garbage.

Wow, in CS241E, for A11, I need to implement an automatic garbage collector. F.

Teacher cites this classic paper:

An automatic garbage collector is a memory manager that finds free blocks automatically.

We say that a block in the Heap is reachable if:

  1. the address of the block is in a block on the stack, or
  2. the address of the block is recursively stored in a reachable block

Cheney’s Copying Garbage Collection

In CS241E, we looked at the implementation of a specific algorithm called Cheney’s Copying Garbage Collector.

Wow, this algorithm is actually pretty clever, but the main speed slowdown is copying all the data over, which I don’t know if that is efficient? Actually yes, because else you need to move the memory blocks anyways, you might need to move less memory blocks but just finding the blocks might be a pain?

Cheney’s algorithm conceptually splits the Heap into 2 halves (called semispaces):

  1. from-space
  2. to-space

New blocks are allocated in the from-space. A pointer (heapPointer) keeps track of the first free address in the from-space, so allocation requires only incrementing this pointer, much like in the simple heap allocator from Assignment 6. (THIS IS SUPER SIMPLE!!)

When the heapPointer reaches the end of the from-space, a compaction is triggered, which copies all reachable blocks from the from-space to the beginning of the to-space.

  • These blocks are placed contiguously next to each other

What is the difference with Heap Allocation?

  • Malloc goes through the list of free blocks, and finds the first block that is big enough
  • Free finds this block we want to free, and updates the pointers

Now, the blocks in the from-space are no longer needed and thus abandoned. We then flip between from-space and to-space. heapPointer is set to the new from-space.

When the heapPointer reaches the end of the new from-space, another compaction occurs. Ahh, this makes sense!

To make this general description more concrete, let us first look at how we initialize the heap. We will make use of three constants:

  • heapStart, the address of the first word in the heap,
  • heapEnd, the address after the last word in the heap, and
  • heapMiddle, the address halfway between heapStart and heapEnd.

We can split the heap into two equal semispaces 1 between heapStart and heapMiddle, 1 between heapMiddle and heapEnd

In CS241E A11, the first 1/4 of memory is reserved for the code of the MIPS program, the second 1/4 of memory is the first semispace of the heap, the third 1/4 of memory is the second semispace of the heap, and the last 1/4 of memory is reserved for the stack.

We will designate two registers to implement the heap:

  • heapPointer, as already mentioned, holds the address of the first free word in the from-space, and
  • fromSpaceEnd holds the address after the last word in the from-space.

The fromSpaceEnd register serves both to identify which of the two semispaces is currently the from-space and to enable us to check when allocation has reached the end of the fromspace. These two registers are initialized as follows, making the first semispace the from-space and the second semispace the to-space:

def init() = {
	heapPointer = heapStart
	fromSpaceEnd = heapMiddle
}

To allocate a block of wanted bytes, this time including the header, we increment the heapPointer by wanted, (we did the same thing with the simple heap allocator in Assignment 6):

def allocate(wanted) = {
	if(heapPointer + wanted > fromSpaceEnd) {
		gc()
	}
	val oldHeapPointer = heapPointer
	heapPointer = heapPointer + wanted
	oldHeapPointer
}

If we have called init, followed by several calls to allocate to allocate some blocks, the overall structure of memory looks like this:

Okay, let’s see the code for the actual garbage collector, which contains the logic for doing Compaction.

def gc() = {
	val toSpaceStart = if(fromSpaceEnd == heapMiddle) heapMiddle
						else heapStart
	var free = toSpaceStart
	var scan = dynamicLink // top of stack, not incl. frame of gc
	while(scan < maxAddr) { // 1st while loop: Traverse the stack
	forwardPtrs(scan)
	scan = scan + size(scan)
	}
	scan = toSpaceStart
	while(scan < free) { // 2nd while loop: Traverse the to-space
		forwardPtrs(scan)
		scan = scan + size(scan)
	}
	fromSpaceEnd = toSpaceStart + semiSpaceSize
	heapPointer = free
	def forwardPtrs(block) = {
		for each offset o in block that holds an address {
			val newAddr = copy(deref(block + o))
			assignToAddr(block + o, newAddr)
		}
	}
	// copy block from from-space to to-space, return new address
	def copy(block) = {
		if(block is not in from-space) { block }
		else {
			if(size(block) >= 0) { // not yet copied
				copyChunk(free, block)
				setNext(block, free) // leave forwarding address
				setSize(block, 0 - size(block)) // mark block as copied
				free = free + size(free)
			}
			next(block) // return forwarding address
			}
		}
}

Wait… I don’t understand how the above algorithm works…? How does it work

  • The variable free keeps track of the first free address in the to-space (much like heapPointer)
  • The variable scan is used to loop through all blocks on the stack
  • The procedure forwardPtrs
    • looks through a block on the stack for all words that represent addresses and calls the copy procedure on each such address
    • replaces the old from-space address with the newAddr
  • The procedure copy copies a block from the from-space to the to-space and returns its new address

IMPORTANT: the frame of the procedure that calls gc must be on the stack, not on the heap. This is so that the dynamicLink correctly points to the block on the stack below the frame of gc.

The copy procedure:

  • Returns address unchanged if block is not in the from-space
  • Else, copies it to the free part of the to-space and increments the free pointer. It also marks the original block as copied, for example by setting its size word to a negative number, and leaves the address of the copy in the to-space in the block, for example in the second, next word (avoids the thing getting copied multiple times)

What is the point of the 2nd for loop??

  • Because the 1st loop does not account for all reachable blocks in the fromspace, since a block can be reachable indirectly by an address in another reachable block, rather than directly by an address on the stack. In addition, the blocks that have been copied to the to-space may contain addresses inside them, and these addresses still point into the from-space. 2nd loop resolves all of these issues

Ahh, i see, and during this process of copying, you care about the forwarding address, which is why this is important.

History

Garbage collection was invented by American computer scientist John McCarthy around 1959 to simplify manual memory management in Lisp.

C and C++

C and C++ don’t have automatic garbage collection. They have a DIY Philosophy: Whatever mess you create, you need to clean up yourself.

title: Some Thoughts
I mean, this philosophy is fine, cuz it forces you to keep track of your errors and not be careless, which is good, but the con is if you actually make a mistake it can be really hard to fix the memory leak.

Other Languages

Other languages like Java, C#, Python use automatic garbage collection to reclaim “dead” variable storage, so you don’t need to do anything yourself!

  • The language run-time system periodically runs a little routine in the background that grabs all of the objects not currently used