C Extensions and the Garbage Collector

•2012-08-27 • Leave a Comment

Life really has a way of catching up with you when you least expect it. Just last week I and my whole family were very nearly killed in a car crash while heading home from a vacation. Luckily though, all that happened was that my car is a total wreck, but I, my wife, my son, and my mother all walked away from the crash completely unscathed. The huge truck that hit us only managed to destroy my car. Also, the police basically gave a report that blames the entire incident on the truck driver, so with any luck the trucking company’s insurance will cover the repairs to my car.

Well, now that the shock of that event has sort of died down for me, I should get back to work (and well, they all say that trying to get back to normal life is a good way of getting over such an event), and so here I am writing where I left off last time, and figuring out that another interesting problem has come up with the plan I came up with for dealing with C extensions and giving them the power to call Arcueid bytecode while at the same time preserving the multitasking semantics of the Arcueid  thread scheduler. And so here I am, tumbling down the rabbit hole, which winds up going much deeper than I anticipated. Several requirements, each of them seemingly very simple by themselves, have combined to increase the complexity of the Arcueid interpreter further than I expected:

  1. It should be possible to call Arcueid bytecode from within C code,
  2. Pre-emptive multitasking should be possible by means of a green thread scheduler: execution of bytecode invoked by a C extension should also be subject to this scheduler, and
  3. The garbage collector is a concurrent garbage collector scheduled itself as though it were a thread in the green thread scheduler.

The first requirement means that we have to have continuations that can restore execution to some point in C code. The second requirement means that these continuations must have essentially indefinite extent, and the third requirement means that any Arcueid values which are local to the C function and any C function that might have called it must also be examined by the garbage collector if the continuation it generated is seen in a garbage collection cycle.

Which of course makes this whole business messy in the extreme. Any continuations that contain saved stacks which are seen from the root set must now be inspected for Arcueid values, and should be marked accordingly by the garbage collector. This means that Arcueid values must then be recognizable as such from within a saved stack frame. I don’t know if this comes up as a problem in Guile, which also uses the stack copying technique I’m trying to implement, but trying to study the code of their interpreter is probably not such a good use of my time.

Well, the simplest way to do this would be to prefix a block of all Arcueid values within every C function with some fixed magic number which should be easy for my code to recognize. Follow it by the number of values within the block, and then end it with yet another magic number just to be sure. The garbage collector looks for the pattern of opening magic number, number of values, actual values, then closing magic number. We can set this up by defining a bunch of macros that must be used every time C code has at least one value in its local variables. Of course, not using these macros will cause problems if they are not used, and there’s a lot of old C code in the interpreter that needs to be amended. I’ll basically have to go over every function I ever wrote, but I guess that this was bound to happen sooner or later. Obviously this should be done for every function except those in the garbage collector itself.

Another possible way would be to save the difference between the address of a value and the bottom of the stack in a variable local to a thread. When a continuation is made from within C code, this variable gets cleared. Only trouble is, when a function returns, it must also remove the values it added from the list, and I don’t think I can write macros that do it efficiently.

Anyone out there with better ideas, or perhaps even thoughts on how better to approach this problem than this way, I’m listening.

Rethinking the implementation of Continuations

•2012-08-17 • Leave a Comment

Well, life has a way of catching up with you when you least expect it, and so I have once again left the Arcueid code base fallow for well over six months. Now, I hope, I’ll have a bit more time to work on it, and perhaps I’ll be able to do something more with it in this stretch, provided that the work I do which pays the bills doesn’t get in the way too much.

Running Conan Dalton’s testsuite on Arcueid has uncovered what appear to be some bugs in the continuation handling behaviour of the virtual machine core, especially that of ccc (call/cc).  Someone rooting around in the random writings I have in the Arcueid wiki on Github might have noticed me mentioning something about a calling convention 4, and the variant continuations it generates. In retrospect, I now think that this idea was ill-conceived, relying as it does on Simon Tatham’s coroutine technique which is rather unorthodox to say the least.

To permit C extensions (and naturally functions within the Arcueid runtime itself of course) to call Arc code, we could of course simply create another virtual machine engine context and use that to execute the code, and when it terminates, the return value is returned to the calling C code. This is what Ruby and apparently Python do. However, we want to have full pre-emptive multitasking in the virtual machine core, and doing this precludes that unless Arcueid gets native threads, and that’s a whole other can of worms which I’m not yet ready to open at this time.

A variation of the above technique could also be done, using an Appel Trampoline the way Chicken Scheme does it, but then that would mean that I’d have to rip the guts out of the garbage collector and make fundamental changes in Arcueid’s memory allocation, and that is not something I think I can do just now.

The approach I’m using today makes use of Simon Tatham’s coroutines to generate what amount to continuations capable of restoring to arbitrary locations in C code. This imposes a number of limitations on C functions that want to be able to call Arcueid bytecode, the most notable being that all variables must be Arcueid values, and these C functions cannot in turn call other C functions of this type either. It seems that interactions between these types of continuations and the normal type of continuations are the source of the bugs we have, and they seem to be fundamental.

The fourth technique, and perhaps the one I am about to code, involves copying the stack. This is actually something that we already do: it is currently done to allow I/O calls that would otherwise block the thread scheduler completely to return to the thread scheduler and also be restored when I/O will no longer block. Apparently, the technique is used by Guile and a few other Lisp and Scheme implementations to provide continuations and call/cc, so we aren’t exactly treading in strange ground here.

Other techniques for implementing reified continuations seem to require that interfacing with C code become more complicated, and well, one of my design goals for Arcueid is to produce an Arc implementation that can interface with C easily, so those techniques, while they may make for continuations and call/cc that may be much more efficient than stack copying or these other techniques I mention, run counter to my design goals.

That said, I have to wonder just what kind of a performance hit we would be looking at if we had only one continuation type that consisted of a copy of the current stack. For pure Arc code running only on the virtual machine the C stack of the virtual machine would not be getting any deeper, as these saved stacks would include only the local variables of the virtual machine main loop (arc_vmengine in vmengine.c). It’s only when we have deeply nested C calls that the stack copies will ever get much deeper than this. Perhaps multitasking pre-emption could also be done in this way…?

Well, let me code this and see how it goes.

I’ll be back

•2012-05-26 • Leave a Comment

I’ve gotten very busy with the work which pays the bills over the past two months and haven’t had as much time to work on Arcueid as I would wish. I’ll be back really soon.

Onions in the Varnish

•2012-03-03 • Leave a Comment

Work on Arcueid right now as I prepare for a 0.1.0 release comes down to three things:

  1. Minor cleanup of small loose ends
  2. Miscellaneous features, such as load paths, command line switches, and other minor controls
  3. A clean port to 32-bit x86

The last is what is consuming a large part of my development work these days, and my plans for doing it involve writing an exhaustive set of unit tests that verify the behavior of Arcueid’s compiler and built-in functions against that of Paul Graham’s Arc 3.1. Conan Dalton has directed me towards a set of unit tests that he devised for his Rainbow implementation of Arc, and running the tests against Arcueid has made me see just how much accidental behavior, which feel a lot like what Paul Graham has called onions in the varnish, but unlike most such onions, these ones are unintentional and artifacts of the implementation of Arc 3.1 under Racket.

For instance, Mr. Dalton has included a lot of tests that involve the use of vertical bars (|) inside of symbols, which behavior is, as far as I am aware, not provided by anything in ac.scm, which does absolutely nothing special with symbols, except to provide ssyntax. I suppose this is also why ssyntax doesn’t use the vertical bar either. So the vertical bar behavior is actually a property of the underlying Scheme core, and this behavior is not actually mandated by any of the R^nRS Scheme specifications, which actually don’t give a mandate on what characters are legal in symbols as far as I can tell. As a result, different Scheme implementations do different things when faced with bars in symbols: e.g. Scheme48 will declare the symbol as containing invalid characters, and Guile, like Arcueid today, will actually create a symbol with vertical bars. Racket makes use of | as an escape character from the looks of things, and well, if we are striving for maximum compatibility, I do suppose I’ll have to implement this behavior in Arcueid eventually. This does, however, smell a lot like an onion sizzling in the varnish to me. I feel that Arc really should develop its own rules for what characters inside symbols ought to be special, and how characters are supposed to be escaped, especially since Arc’s special syntax depends on it. Perhaps Arc 4 will address this issue.

Anyhow, Arcueid’s Git Head now passes 295 out of 313 of Rainbow’s tests, minus the vertical bar tests which Arcueid’s parser still will not accept as valid. There are a lot of subtleties involving numerics. Arcueid uses many naive algorithms for conversions of strings and floating point, especially floating point string constants in bases other than 10. These will need to be improved. The last 18 failed tests involve a lot of subtleties I hadn’t thought of before, and making these tests pass will probably be a lot of work.

Arcueid 0.0.12 released

•2012-02-29 • Leave a Comment

Arcueid 0.0.12 is released. You can download it here.

Release notes for Version 0.0.12:

  • 32-bit compatibility fixes (still incomplete though, but enough that you get a working REPL)
  • Miscellaneous bug fixes (e.g. with getb and string coercions)
  • is and iso will now accept any number of arguments
  • current-thread implemented
  • atstrings are now supported

Arcueid’s rendering of news.arc now sort of looks the way news.arc is rendered by standard versions of Arc. There are still enough bugs here and there that I won’t put a 0.1.0 label on it but it is a bit more stable.

There are still some issues with 32-bit support, most of them related to the fact that some parts of the code continue to believe that fixnums will have more than 31 bits.

A testsuite is being built that will attempt to validate the behavior of the entire Arc runtime as completely as possible.

32-bit support

•2012-02-25 • Leave a Comment

My primary computer is running Ubuntu Oneiric x86-64, and as such most of Arcueid’s coding and testing is done in a 64-bit environment. As such, 32-bit support is not as well tested, and I have wound up allowing it to deteriorate to the point that 0.0.11 won’t even run properly on 32-bit platforms because of alignment issues. I got a bug report about that today, and it was related to the fact that memory blocks returned by the allocator were not aligned on 16-byte boundaries on 32-bit platforms. 16-byte alignment of all allocated memory is an absolute requirement for Arcueid to properly function, because it uses the bottom four bits to distinguish various immediate values that are represented by something other than a cell structure, such as fixnums (bit 0 is 1), true, nil, undefined/unbound, and symbols. A pointer to a boxed value must have all four low bits 0. All other even values in the low four bits, 0 – nil, 2 -true, 4 – undef, 6 – unbound, 14 – symbol, have special meaning. As such, all memory allocations must be properly aligned on 16-byte boundaries for this to be true. I tried to take care of doing this but the fact that many structures came out to multiples of 16 bytes in x86-64 masked some bugs. Specifically, the Bhdr struct that is used by the garbage collector to prefix all allocated memory is precisely 48 bytes in size on x86-64, so even if I didn’t bother doing any alignment fixes at all, my allocations would still wind up being naturally aligned on 16-byte boundaries thanks to that. Of course, none of that is true on 32-bit x86, where the size of a Bhdr is only 36 bytes. This proved to be a bit messy, but after a few false starts I managed to get alignment to work on 32-bit platforms as well.

Well, I do want to get Arcueid to run on other machines besides 64-bit GNU/Linux, and while 32-bit systems on the desktop might be going away, there are still lots of other 32-bit machines out there, such as most ARM-based tablets such as my Asus Transformer.

32-bit support is partially fixed in git head at the moment, enough that Arcueid loads and runs instead of immediately segfaulting, and most basic functionality is available, however there are a number of other places where 32-bit support seems to fail. There seem to be places where the code seems to rely on the fact that fixnums are more than 31 bits in size. Arcueid can run news.arc somewhat stably on 64-bit, but more stuff is broken on 32-bit. Looks like a lot of work ahead.

Arcueid 0.0.11 released

•2012-02-20 • Leave a Comment

Arcueid 0.0.11 is released. You can download it here.

Release notes for Version 0.0.11:

  • Big bag of pages memory allocator
  • Many bug fixes, including permitting optional function arguments to refer to previously declared arguments, and proper expansion of !sym ssyntax
  • Changes in some built-in functions for better Arc 3.1 conformance

This version can actually load and run news.arc. It runs unstably though due to some interactions with threading and I/O, but this is being debugged. The server seems to have trouble with static files for some reason. To try this out, after installing Arcueid, go to the arc/news directory, create an arc directory there, and create a file there named admins with the login names of your administrative users (just as with running news.arc on Arc 3.1). Then, start Arcueid, and do the following:

Arcueid 0.0.11 REPL Copyright (c) 2012 Rafael R. Sevilla
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; for details type (warranty)
arc> (load "libs.arc")
nil
arc> (load "news.arc")
nil
arc> (nsv)
rm: cannot remove `arc/news/story/*.tmp': No such file or directory
load items: 
ready to serve port 8080
accepting requests

Then go to http://localhost:8080/news, create an account for an admin user in the admins file you created previously, and log in.

Arcueid’s interpretation of news.arc is still broken in a lot of ways, obviously, which is why this is still a 0.0.11 release instead of 0.1.0. But we’re getting there, gradually!

Big bag of pages

•2012-02-17 • Leave a Comment

Implementing the big bag of pages memory allocator proved to be not all that difficult. Arcueid still uses malloc to allocate objects greater than 512 bytes in size, but smaller objects are allocated using the big bag of pages. A simple free list is maintained for each of the sizes from 1 to 512 bytes, and if the free list for a particular size is found empty, a new chunk of memory (marked immutable) is allocated that should be big enough to hold up to 64 (by default) objects of that size. This chunk is carved out into objects of the requested size and added to the free list.

This approach has shown a large performance increase. Running the ab test as below produces the following results:

$ ab -n 100 -c 10 http://localhost:8080/said
This is ApacheBench, Version 2.3
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done

Server Software:
Server Hostname:        localhost
Server Port:            8080

Document Path:          /said
Document Length:        166 bytes

Concurrency Level:      10
Time taken for tests:   5.282 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      24000 bytes
HTML transferred:       16600 bytes
Requests per second:    18.93 [#/sec] (mean)
Time per request:       528.151 [ms] (mean)
Time per request:       52.815 [ms] (mean, across all concurrent requests)
Transfer rate:          4.44 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:   440  522  99.9    497     857
Waiting:      408  489  95.8    464     808
Total:        440  523 100.0    497     857

Percentage of the requests served within a certain time (ms)
  50%    497
  66%    508
  75%    518
  80%    522
  90%    615
  95%    850
  98%    856
  99%    857
 100%    857 (longest request)

This is almost twice the speed , showing that memory allocations are a large factor dominating performance: obviously the simple free list that is now used for objects in our big bag of pages is much, much faster than malloc. This has also reduced total memory usage somewhat, though not that much, using 37 megs on initial load of arc.arc, 150 megs after loading libs.arc, and 253 megs after the above test, vs. 41, 170, and 283 respectively before implementing BiBOP.

One thing that the current  BiBOP implementation cannot do is release memory back to the OS. Trying to make a shot at doing this will in any case make the free list not such a simple free list any more, and hence slow down memory allocations as well. We’ll probably have to use mmap instead of malloc to get such memory blocks, and the block size should be made larger than it is today, and some algorithms need to be devised for managing the multiple free lists efficiently.

One other thing: we are getting very close to being able to run news.arc! Some bug fixes involving ssyntax (available in git head) now allow news.arc to finally load correctly. However, there still seem to be other bugs though that prevent it from running properly, that’s also being fixed.

Watch the size of your structs…

•2012-02-15 • Leave a Comment

I’ve gotten somewhat careless. I hadn’t noticed that the struct defining a thread is now 392 bytes, so its inclusion in the Arcueid cell structure means that the struct had become 400 bytes in size. This meant that every boxed object used up at least 400 bytes! By modifying the way thread objects are allocated, I removed the explicit definition of the thread struct from a cell, and so reduced the size of the cell struct to 48 bytes. Now, on loading arc.arc, the initial virtual size of the Arcueid process has gone down from 130 megs to only 41. After loading libs.arc, the virtual size has gone down from over 400 megs to about 170. After using ab to do the same test as below, virtual size is 283 megs, from over 1.6 gigs before. It’s still a lot, but a lot less than before, and the rest of the excess memory usage seems to come largely from fragmentation. Actual memory usage as reported by the garbage collector shows only about 3 megs of actual memory used by stuff reachable via the roots, after loading arc.arc, and 7.8 megs after loading libs.arc. Fragmentation still appears to be a big problem.

Arcueid 0.0.10 released

•2012-02-14 • Leave a Comment

Arcueid 0.0.10 has been released. You can download it here.

Release notes for Version 0.0.10:

  • More garbage collection bug fixes
  • Tail recursion optimization
  • Additional Arc libraries included

This is the first version of Arcueid that can actually serve up web content and can run the Arc challenge code. To try doing this, after installing Arcueid go to the arc/news directory under the Arcueid source, run Arcueid and do the following:

Arcueid 0.0.10 REPL Copyright (c) 2012 Rafael R. Sevilla
This is free software; see the source code for copying conditions.
There is ABSOLUTELY NO WARRANTY; for details type (warranty)
arc> (load "libs.arc")
*** redefining arg
nil
arc> (defop said req
arc>   (aform [onlink "click here" (pr "you said: " (arg _ "foo"))]
arc>     (input "foo")
arc>     (submit)))
#
arc> (serve)
ready to serve port 8080

Try going to http://localhost:8080/said afterwards. Add as many defop definitions as you please. There’s still a bug that prevents us from running news.arc but I hope to fix that in the next version to be released. It’s still very slow though, but I am a firm believer in Knuth’s maxim against premature optimization.