Symbols and symbol garbage collection: Arcueid vs. Ruby

I don’t really know how other Lisp implementations represent symbols, but I am somewhat familiar with how Ruby does it, having studied the source code a while back when I was hoping to be able to contribute to the efforts underway some years ago to improve Matz’s Ruby Interpreter (MRI), that later became the Ruby 1.9 series. Alas, my attempts to break into hacking the interpreter never got anywhere as I got tangled in my other duties, but I digress.

Anyhow, one thing that Arcueid has to do that Ruby doesn’t is symbol garbage collection. Symbols in Ruby are known as a possible source of memory leakage, as such creating large numbers of symbols at runtime is a practice frowned upon in Ruby programming. However, creating large numbers of symbols at runtime is standard Arc programming practice, with functions like uniq that are widely used in macro programming. I thus realized that not having symbol garbage collection was unacceptable for any Arc interpreter, so I had to take a shot at doing it.

Arcueid, influenced by MRI, represents symbols as what are essentially small-ish integers. There is a lastsym entry in the arc struct that is incremented every time arc_intern is used to intern a new symbol that has never been interned before. This becomes the symbol’s identifier (ID), and it is stored into two hash tables, which are fields symtable and rsymtable in the arc struct. The symtable is indexed by the names of symbols, and gives their ID’s, while the rsymtable is indexed by symbol IDs and gives their names. The symbol ID is converted into an actual symbol value by the macro ID2SYM (this macro is also present in MRI and does more or less the same thing), and a SYM2ID macro does the reverse when symbols are being manipulated. The arc_intern function does all this, looking for the symbol name in the symtable, returning its value, otherwise installing the new symbol in the symbol tables. This has the nice effect of making symbols cheap to compare against each other, as they are immediate values just like fixnums, and takes care of making symbols unique in the interpreter. However, I realized soon enough that this approach made garbage collecting symbols somewhat complicated, and this was probably the reason why MRI doesn’t do it at all. I nevertheless came up with a solution, which proved not to be all that complicated, but I had stumbled several times trying to get it to work right in all cases.

Anyway, my approach to garbage collecting symbols represented in this way involves treating the symbol tables specially. MRI makes the symbol tables part of the garbage collector root set, which obviously makes garbage collecting symbols impossible. What Arcueid does is something less than that. It marks the symbol table objects as immutable, and as such, they can never be garbage collected themselves. The buckets inside the symbol tables, which represent the actual symbol data, are not immutable though, and a bucket in the symbol tables must be referenced by some other object reachable from the root set, or else it will not be marked and will wind up getting freed by the sweeper. When the GC marker sees a symbol it will then look for the symbol’s buckets in symtable and rsymtable, and will mark them. Symbols which are not seen in this way will not have their buckets in the symbol tables marked, and they will be removed by the sweeper, which has to remove the entries in the slots of the symbol tables they used also. I ran into a number of pitfalls here but I think I’ve corrected most all of them in the code in Git today.

Well, when a symbol that has been garbage collected is created again somehow, it will obviously wind up with a different actual ID (a new one) from when it was created the first time. But I will ask why this should matter, because when it got garbage collected the first time no part of the garbage collection root set knew about its existence. But if it matters (and I’m not convinced that it does) I can think of a very easy way to get around this. Compute a hash value based on the string representation of the symbol with some suitable hash function (I use this hash function in Arcueid), and use that hash value as its ID rather than having a monotonically incrementing ID value as above. Hmm, I may actually use this approach in the future as it may obviate the need for having a reverse and forward symbol table. Something to consider.

I don’t see any reason why Ruby cannot use this approach as well. The only problem may be that Ruby does not treat hash table buckets as separate objects in their own right, allowing them to be garbage collected on their own. I had to write my own hash table code to do this explicitly.

~ by stormwyrm on 2012-02-10.

Posted in arcueid, devnotes, ruby

Arcueid 弧