Changing garbage collectors

•2013-05-02 • Leave a Comment

Well, it seems that replacing the old garbage collector with VCGC improved matters markedly. The results of ApacheBench with the new garbage collector are as follows:

$ ab -n 100 -c 1 http://localhost:8080/news
This is ApacheBench, Version 2.3
Copyright 1996 Adam Twiss, Zeus Technology Ltd,
Licensed to The Apache Software Foundation,

Benchmarking localhost (be patient).....done

Server Software:
Server Hostname: localhost
Server Port: 8080

Document Path: /news
Document Length: 2260 bytes

Concurrency Level: 1
Time taken for tests: 8.271 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 233400 bytes
HTML transferred: 226000 bytes
Requests per second: 12.09 [#/sec] (mean)
Time per request: 82.710 [ms] (mean)
Time per request: 82.710 [ms] (mean, across all concurrent requests)
Transfer rate: 27.56 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 72 83 20.6 76 193
Waiting: 41 49 14.9 44 134
Total: 72 83 20.6 76 193

Percentage of the requests served within a certain time (ms)
50% 76
66% 77
75% 79
80% 83
90% 98
95% 132
98% 169
99% 193
100% 193 (longest request)

This is a rather major improvement I should think. Using VCGC improved matters by a factor of nearly 3 on average, and is only 1.6 times slower than disabling garbage collection completely. It’s still a long way from Anarki which can still get 1.865 ms on average on my machine.

Here’s what the profiler finds:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
10.07 2.24 2.24 3185 0.00 0.00 gc
8.34 4.10 1.86 144932952 0.00 0.00 TYPE
6.97 5.65 1.55 57782583 0.00 0.00 envval
4.63 6.68 1.03 88860737 0.00 0.00 nextcont
4.00 7.57 0.89 83769798 0.00 0.00 TYPE
3.69 8.39 0.82 89950076 0.00 0.00 contenv
3.53 9.17 0.79 69084312 0.00 0.00 mark
3.33 9.91 0.74 6093257 0.00 0.00 arc_hash_final
3.19 10.62 0.71 188313998 0.00 0.00 TYPE
3.06 11.30 0.68 72275198 0.00 0.00 MARKPROP
2.83 11.93 0.63 1129927 0.00 0.00 __arc_update_cont_envs
2.38 12.46 0.53 7127191 0.00 0.00 bibop_alloc
1.80 12.86 0.40 15619398 0.00 0.00 arc_hash_update
1.75 13.25 0.39 434522 0.00 0.00 __arc_vector_marker
1.66 13.62 0.37 40391647 0.00 0.00 __arc_getenv
1.53 13.96 0.34 71091716 0.00 0.00 TENVR
1.53 14.30 0.34 1436710 0.00 0.00 __arc_vmengine
1.39 14.61 0.31 72274982 0.00 0.00 __arc_wb

The garbage collector is still the largest component, but its proportion has gone down from nearly 30% from the previous version to only 10%. The next largest contributor is the TYPE function, which is supposed to be inlined. It seems the system is spending 8% of its time doing type dispatching. It makes 145 million such calls when attempting to serve the news page 100 times, so anything we can do to either reduce the number of times this is called or make it run faster will improve matters greatly.

However, I think this is a battle we can fight after the next release, and before then I’ll work on adding some small improvements as well, such as command line parameters, load paths, additional functions, and so forth.

Profiling Arcueid

•2013-04-30 • Leave a Comment

After compiling Arcueid with profiling information, and running news.arc with ApacheBench to generate 100 requests, the results seem to be rather interesting, although not quite unexpected:

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
28.50 11.29 11.29 287 0.04 0.09 gc
11.37 15.80 4.51 197333024 0.00 0.00 TYPE
3.91 17.35 1.55 3340957 0.00 0.00 __arc_vector_marker
3.89 18.89 1.54 35946478 0.00 0.00 arc_hash_update
3.57 20.30 1.42 190118632 0.00 0.00 mark
3.36 21.63 1.33 57789411 0.00 0.00 envval
2.85 22.76 1.13 88862762 0.00 0.00 nextcont
2.52 23.76 1.00 9775784 0.00 0.00 arc_hash_final
2.32 24.68 0.92 95741775 0.00 0.00 TYPE
2.09 25.51 0.83 142012842 0.00 0.00 VINDEX
1.92 26.27 0.76 89952347 0.00 0.00 contenv
1.77 26.97 0.70 188319700 0.00 0.00 TYPE
1.74 27.66 0.69 9540002 0.00 0.00 hash_lookup
1.72 28.34 0.68 287 0.00 0.00 free_unused_bibop
1.50 28.94 0.60 11201327 0.00 0.00 cons_marker
1.41 29.50 0.56 1130474 0.00 0.00 __arc_update_cont_envs
1.16 29.96 0.46 7119452 0.00 0.00 bibop_alloc
1.14 30.41 0.45 19305490 0.00 0.00 VINDEX

It seems that almost 30% of the system’s time is being spent in the garbage collector. I suppose the next step really is to make use of a more efficient GC algorithm, or to use what we have today less aggressively.

Nested quasiquotes

•2013-04-26 • Leave a Comment

There has been a long-standing issue within Arcueid: dealing with nested quasiquotes. Apparently reference Arc doesn’t handle this case all too well either, and as such none of the code that I use for compatibility testing like news.arc and others makes use of nested quasiquotes either. However, it seems that macro expansions in general might well result in cases where nested quasiquotes can arise, so it is important that this issue be fixed. Anyhow, in my searches I ran across a link to an implementation of a Common Lisp-style quasiquote as Arc macros. It’s an interesting exercise converting it to C… I suppose you might say that this is an example of a macro in Arcueid’s C code, however, it isn’t precisely implemented that way, but as a hook in the compiler, as this is a fundamental part of Arc’s functionality.

The basic example of a nested quasiquote is the following:

``(+ 1 ,,@(list 2 3) 4)

Current Arc 3.1 and Anarki both produce (+ 1 2 4) when asked to evaluate that. Common Lisp produces (+ 1 2 3 4), which is, I think, what we should be doing. Scheme (Racket) produces an error when presented with the same. Arcueid 0.1.1 and below produce a similar error. Akkartik mentions that this was working in some previous version of reference Arc, though he can’t seem to remember when. I think we should produce the same results as Common Lisp for nested quasiquotes.

Arcueid 0.1.1 released

•2013-04-25 • Leave a Comment

Well, that was a bit fast. Arcueid 0.1.1 is released. You can download it here.

This is just a bugfix release, but it is a rather significant one, as this is the first version of Arcueid that can actually run news.arc completely. There were several subtle bugs in 0.1.0 that caused it to fail to run news.arc correctly before, most notably involving the way that it handles ssyntax and its treatment of the andf operator. These have been fixed and now news.arc can run properly.

It is sorely lacking in performance however, and it’s probably unstable, but I hope to address all of those in forthcoming releases. The current version gives the following timings when running news.arc with ApacheBench:

Concurrency Level:      1
Time taken for tests:   23.292 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      233500 bytes
HTML transferred:       226100 bytes
Requests per second:    4.29 [#/sec] (mean)
Time per request:       232.915 [ms] (mean)
Time per request:       232.915 [ms] (mean, across all concurrent requests)
Transfer rate:          9.79 [Kbytes/sec] received

Connection Times (ms)
min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:   166  233  34.6    227     372
Waiting:      108  207  34.4    200     332
Total:        166  233  34.6    227     372

Percentage of the requests served within a certain time (ms)
50%    227
66%    242
75%    251
80%    252
90%    281
95%    308
98%    349
99%    372
100%    372 (longest request)

Rather dismal I’m sorry to say, but a large part of that seems to be the garbage collector, which is run much too aggressively, and turning off the garbage collector increases performance rather dramatically:

Concurrency Level: 1
Time taken for tests: 5.162 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 233500 bytes
HTML transferred: 226100 bytes
Requests per second: 19.37 [#/sec] (mean)
Time per request: 51.624 [ms] (mean)
Time per request: 51.624 [ms] (mean, across all concurrent requests)
Transfer rate: 44.17 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 44 52 7.1 50 102
Waiting: 22 24 2.3 24 37
Total: 44 52 7.1 50 103

Percentage of the requests served within a certain time (ms)
50% 50
66% 51
75% 52
80% 52
90% 58
95% 66
98% 72
99% 103
100% 103 (longest request)

I’ll be working on improving garbage collection performance and other improvements over the next few releases, as well as addressing some stability issues observed over the testing I’ve been doing.

Arcueid 0.1.0 released

•2013-04-23 • Leave a Comment

Arcueid 0.1.0 is released. You can download it here.

Version 0.1.0 is an almost complete rewrite and incorporates too many changes to list, but we do have the following notable ones:

  • Better performance (I think) because we now avoid copying stuff from the stack into the heap unless absolutely necessary
  • C foreign function interface revised
  • Readline persistent history (although it has its problems that I can’t seem to trace just yet)
  • 32-bit support now seems to be working fine

It can sorta run news.arc but it still has some problems, most notably that it’s not possible to post URLs as stories.

Status update: imminent release

•2013-04-23 • Leave a Comment

Well, after a few more days of work. Arcueid 0.1.0 is very nearly ready for release. You can check out the experimental branch from Github if you like and see how it runs. It’s now somewhat usable and more stable, and is definitely faster than the 0.0.x series. It can almost run news.arc properly, although there are some problems (e.g. posting URLs in stories causes it to crash with a failed apply error), and it only passes 291 of the 313 Conan Dalton unit tests, not counting the ones involving the double-bar symbol tests that reference Arc inadvertently inherited from MzScheme/Racket, which I have no intention of ever implementing in full generality.

I’ll do a little cleaning up before producing the final 0.1.0 release in a few days or so.

The Garbage Man’s Blues

•2013-04-19 • Leave a Comment

Arcueid 0.1.0 is now almost complete. The only stuff missing now before I can get it to run news.arc might be networking code, but that shouldn’t be difficult to do. I had, however, gotten into some trouble with garbage collection, which is a painful business. I’m starting off with a basic mark and sweep garbage collector, and now I sort of regret having removed the old write barrier code that used to be part of my original code, as it could have been used not just to support VCGC that we tried to use in the 0.0.1 series. It could have been used to do a lot of interesting things.

First of all, a write barrier would have enabled us to defer actually allocating an object until its reference was actually put somewhere, like in an environment, in a register, as part of a data structure, or on the stack, permitting us to implement essentially what amounts to a 1-bit reference count. Of course, that also means that the free list for BiBOP objects has to be implemented as a queue, or something of that nature. If a value is allocated from within a foreign function and is never stored in a variable, then the allocation technically uses no memory. Since we use __arc_mkaff everywhere to call foreign functions from C code, that produces a small allocation of something that almost always immediately turns into garbage. Many other similar scenarios arise in the interpreter for values that are essentially ephemeral in the same way, so many small objects may never need to be truly “allocated” at all, as they become garbage almost instantly.

The garbage collector also seems to be slowing the current code down significantly the way we use it now. Essentially, we perform a collection every time a cycle of all other threads has finished executing, with a default quantum of 1024 instructions, and as such it slows down the code noticeably. I’ve been playing around with the Ackermann function as a benchmark, and for up to (ack 3 4) it seems to be comparable in speed to reference Arc on Racket. However, for (ack 3 5) it already pauses noticeably on my machine, which I think may be due to the garbage collector, given that increasing the quantum to 65,536 produces a marked improvement, as GC is performed less often with such a high quantum. Profiler proof is something I’m going to try to get as well.

If I’m going to stick with the mark and sweep gc rather than going back to VCGC, I wonder under what situations it is best for the collector to run. As I said we do it now after a thread’s time slice completes, but this is not the only way it can be done. Ruby seems to invoke its GC every time the free lists become empty, and perhaps we can do something like that too: run the GC before creating a new BiBOP page or something like that. We could measure how much memory usage we have and set thresholds, and perform collections when such thresholds are reached.

I’ve tried to design the system so as to make garbage collection as abstracted as possible, so that we can swap out GC algorithms as easily as possible. The only thing missing for now is a write barrier mechanism, and that needs to be used whenever value writes are performed to anything that could potentially be reachable from the root set, viz.:

  1. Writes to registers that can contain Arcueid values, e.g. to the accumulator
  2. Writes to variables in an environment.
  3. scar or scdr operations
  4. Writes to vector indices (including writes to hash tables)

This should be simple enough to do without making too many mistakes, by changing the macros that have hitherto been used to do these things to stop producing C lvalues, so that we get an error on compilation like ‘lvalue required as left operand of assignment’. This tells us where to change them so we can put in a function or macro that incorporates the write barrier. This also suffices to provide hooks where we can put in a read barrier if a GC algorithm needs it.

As it is the compiler has at least 74 places where this needs to be updated… Now that I’ve gotten into progress making these changes it’s gradually becoming obvious that the code should not be written in C. While I’ve done many clever things with the C preprocessor to support all of this business I think I’ve just about reached the limits of what’s possible. The use of the AV(…) macro to access variables in the function’s local environment is already error-prone as it is, and now I have a WV(…) macro to write to variables like this. The internal use of an unsigned char type for these variables has gone some way to making this less error-prone (as the compiler may or may not complain about use of such a type in contexts where a value is required), but I really do think that some other language halfway between C and Arc is needed. Perhaps a PreArc analogous to Richard A. Kelsey’s PreScheme? Something to consider.

Maybe I’ll call it Shiki?