Optimising tail recursion

•2014-04-05 • Leave a Comment

I implemented the solution described by Kelsey in this last commit, and while it does seem to work (it no longer crashes and memory usage seems bounded), it appears to be less than optimal. This is particularly obvious when running the original reference code that triggered the bug, which while it doesn’t crash or cause memory usage to increase rapidly, pauses embarrassingly every 7000 or so iterations and then proceeds. This appears to happen because the solution also causes large numbers of garbage environments to proliferate in the heap, which creates more work for the garbage collector. It ought to perhaps be possible to reuse a heap environment when it appears in a tail recursive call, just as stack environments are reused, but how this can be done is not exactly clear. Perhaps the env instruction can be changed somewhat to make it do this. On a tail recursive call, what we can do is perhaps check if the environment is on the stack or the heap. If the former is true, then we need not do anything different from what we do today. If the environment is on the heap, we can move the required arguments which are on the stack into the heap, clear out the values of the other elements (to CUNBOUND), and then allow the instructions subsequent to the env instruction to take care of filling in the remaining environment values.

This needs some careful consideration.

Tail Recursive Stack Disciplines, redux

•2014-04-02 • 1 Comment

After taking once again a nearly year-long hiatus from Arcueid development I find that I’ve been thrust back into it thanks to Alexander Artemenko pointing out a fairly serious bug in it. The nature of the bug is fairly interesting, and is an artefact of my attempts last year to get Arcueid to do tail recursion with a stack.

The sample code that triggers the bug Mr. Artemenko discovered is fairly simple:

(let i 0 
  (while t
    (prn "Still going: " i)
    (++ i)))

This triggers a stack overflow exception because of the way the Arcueid interpreter handles moving environments from the stack to the heap. Expanding the macros and simplifying this we get the following:

((fn (i)
 (((fn (self)
   (assign self
       (fn (x)
         ((fn ()
            (prn "Still going: " i)
            (assign i (+ i 1))))
         (self t))))

The Arc compiler compiles the innermost function

  (fn (x) ((fn () (prn "Still going: " i) (assign i (+ i 1)))) (self t))

into the following virtual machine instruction sequence:

0000  env  01,00,00  ; environment has one element (fn (x) ...)
0001  cont 07,43     ; prepare for call to anonymous function (go to 0005)
0002  ldl  00        ; load anonymous function code (fn () (prn "Still going: " i) (assign i (+ i 1)))
0003  cls            ; turn into closure
0004  apply 00       ; apply (fn () (prn "Still going: " i) (assign i (+ i 1))) with no args
0005  true           ; parameter for call to self (destination of continuation created at 0001)
0006  push
0007  lde   01,00    ; self (created by assignments above)
0008  menv  01       ; tail recursion prepare
0009  apply 01       ; apply self with t

The whole thing falls apart because the cls instruction on line 0003, which creates a closure out of a code object and the current environment, has to then move the current environment from the stack to the heap. This is required because the closure could potentially be returned or assigned somewhere beyond its original scope, and the values in that environment might be needed by a subsequent invocation of the function (ah, the joys of static binding). But then what happens to the copy of the environment that’s still on the stack? It’s still there, but has become garbage. And so, as lines 0005 through 0009 are executed, and the application returns to line 0000 by the apply instruction on line 0009, the garbage environments accumulate, and eventually the stack fills up. So much for our tail recursion!

There are several possible solutions that could solve the problem. One would be to make the cls instruction destroy the environment on the stack that after transferring it to the heap. That involves moving everything on the stack below it back upwards. The only problem with this is that continuations on the stack are referenced as absolute offsets into the stack, so moving a continuation like that is problematic. Kartik Agaram also discovered this when he tried to hack Arcueid to make its stack grow dynamically.

Rereading Richard Kelsey’s paper on tail-recursive stack disciplines for an interpreter yields the following possibly simplest solution:

Another possibility is to wait until the stack is full and then reclaim any unused space. The stack can be treated as a constrained heap, containing only environments and continuations… The third possibility, copying the live data into a heap, reduces the possibility of thrashing… This is certainly the most attractive possibility for languages that already require a heap.

So what we can do is wait until all stack space is exhausted, and then move all environments and continuations that are still on the stack into the heap, and then drop everything on the stack except any scratch values that the program might have been computing that are still there. Let’s see how this all works.

Three Implementation Models for Scheme

•2013-05-22 • Leave a Comment

Well, the main issue with Arcueid at the moment is that it is rather slow. Probably about an order of magnitude slower than reference Arc built on MzScheme, and a profiling analysis of the system shows that one of the biggest performance-sucking parts of the system the fact that it has to move the stack-based environments onto the heap every time it makes a closure, and that also means updating the pointers to these environments stored in continuations on the stack. This is the third most time-consuming operation as can be seen from the profiling run we have in last post (this is what the ‘nextcont’ and ‘contenv’ functions do). Well, it would seem that this is a problem that Scheme implementors have solved, and I found a dissertation by R. Kent Dybvig that discusses approaches to solving this issue that are known.

I have been reading through the Dybvig dissertation and have found an idea that may be able to solve the performance issues. The current version of Arcueid basically takes a great deal of care to ensure that when an environment used in a closure is made to migrate to the heap, all continuations that make reference to that environment must also get updated. This is the reason why it takes so much time: it basically goes through the entire continuation chain, updating all places where the migrated environments might appear. Dybvig notes however that it is unnecessary to be so careful. Multiple copies of the same environment can coexist, provided that assignments to these variables are forbidden. This is true for a pure functional programming style, but Arc obviously has assignment operators. He also mentions however, that it is possible to accommodate such state mutations by means of what he calls boxes, which are used only for those variables for which assignments are done. Make a box for every variable that might conceivably be assigned from a closure, and set that as the value for the variable.

So we would add a box instruction that sets the value of a particular environment variable to a box, and modify the behaviour of the lde/ste instructions so that loads from a box result in the value referenced by the box getting loaded, and an ste to a box location result in the value getting written to the box. A box type would essentially be completely invisible to the rest of the system: it will never be seen outside of the environment, and only the GC and the virtual machine need know about them.

This has implications for C code that wants to use the same techniques though (e.g. the use of arc_dynamic_wind in arc_eval and our implementation of call/cc). Needs some thought on how this could be done.

Another thing about these boxes is that they remind me of a discussion I had with Pauan a few weeks ago on the Arc forum when proposing the design of a module system. Hyper-static scope was mentioned in this context, and apparently these same boxes that are pointers to other values are used as well. Perhaps they’ll wind up getting exposed as data types in their own right in the future as well, if we incorporate some of the ideas that came out of that discussion into Arcueid.

More performance tuning

•2013-05-15 • Leave a Comment

After doing a few more trials, I noticed that the GC quantum is set far too high, and we are using the full-blown TYPE function to decide whether something is a symbol or not, and the GC actually winds up running a lot, so I made a few minor changes that improved performance slightly. Reducing the GC quantum from 4096 objects checked per cycle to 768 had some effects (and for the final cut in the next formal release I think I’ll be setting the GC quantum by an adaptive algorithm similar to how the Inferno/Dis garbage collector does things). Profiling an ApacheBench of 100 requests on news.arc had the following rather interesting results:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 13.04      2.34     2.34     9911     0.00     0.00  gc
  7.69      3.72     1.38 57807893     0.00     0.00  envval
  5.24      4.66     0.94 71028697     0.00     0.00  MARKPROP
  4.40      5.45     0.79 64935366     0.00     0.00  nextcont
  3.85      6.14     0.69 65820037     0.00     0.00  contenv
  3.65      6.80     0.66  5852362     0.00     0.00  arc_hash_final
  3.09      7.35     0.56 48743506     0.00     0.00  TYPE
  2.90      7.87     0.52 139967108     0.00     0.00  TYPE
  2.56      8.33     0.46  7125944     0.00     0.00  bibop_alloc
  2.29      8.74     0.41 64390760     0.00     0.00  TYPE
  2.23      9.14     0.40  1116526     0.00     0.00  __arc_update_cont_envs
  1.95      9.49     0.35 13605816     0.00     0.00  arc_hash_update
  1.95      9.84     0.35 71028617     0.00     0.00  __arc_wb
  1.87     10.18     0.34  4605836     0.00     0.00  arc_restorecont
  1.78     10.50     0.32  1371627     0.00     0.00  __arc_vmengine
  1.62     10.79     0.29  6200889     0.00     0.00  __arc_mkenv
  1.62     11.08     0.29  5534865     0.00     0.00  hash_lookup

The envval function, which now takes 1.38 seconds of the total run time and was called over 57 million times, is the function that obtains environment values from the stack. Looking further down into the call graph, the following is seen:

                0.12    0.05 4891773/57807893     __arc_menv [39]
                0.30    0.13 12475968/57807893     __arc_putenv [30]
                0.97    0.41 40440152/57807893     __arc_getenv [12]
[10]    11.0    1.38    0.59 57807893         envval [10]
                0.37    0.00 57807893/64390760     TYPE [41]
                0.21    0.00 57807893/71107210     TENVR [54]
                0.00    0.01  551924/3639885     nextenv [113]

40 million of those calls came from __arc_getenv, to read environment variables. The current __arc_putenv/__arc_getenv functions are general and can obtain environments everywhere and do checks for parent environments, so perhaps the next optimisation we can try will be to make special case VM instructions for getting variables from a function’s own local environment (as opposed to its parent environments), and inline functions that do the same. We’ll embed those as inline functions in the virtual machine itself for greater efficiency. Reading and writing function local variables seems to be done quite a lot, and it should be something doable with the greatest efficiency, preferably without even the overhead of a function call.

Arcueid 0.1.2 released

•2013-05-14 • 4 Comments

Arcueid 0.1.2 is out. You can download it here.

Lots of new features in this release. I’ve finally fixed the nested quasiquotes bug that has been out for more than a year. Readline seems to be working better than it had in 0.1.1. Some bugfixes like memory leak issues and garbage collector problems. We now have gone back to using the Huelsbergen-Winterbottom VCGC algorithm for garbage collection. Large file support is now available even on 32-bit systems, provided bignum support is also available. We have some regular expressions based on the Plan 9/Inferno Regular Expression library (and it is for now similarly limited). The load function is now built into the interpreter and load paths are now supported. And now there is a script mode so Arcueid can be used in a shebang, e.g. it is now possible to run a file with the executable bit set with the following contents

#!/usr/local/bin/arcueid --script
(prn (+ 1 1))

and that does what one expects.

Regular expressions are provided by means of the r/…/ syntax. Only basic stuff like character classes, the Kleene star and plus operators, the ? operator, alternation, and capturing groups are supported. Many Perl/POSIX constructions are still unavailable, but soon enough the regexp support should evolve to support many of the most useful features. Hope to soon be able to add things like counted repetitions, character class abbreviations like \d, and non-capturing groups. To use regexps, they can be applied to strings, e.g.:

arc> (r/(abc)(def)/ "zzzabcdefgh")
(3 ("abcdef" "abc" "def"))

Which returns a list with the position of the match, and a list of all the capture groups, starting with the entire string matched by the regular expression. Returns nil if the regexp failed to match. The =~ macro can be used, which binds $$ to the position in the string the match obtained, and $0 to the whole matched portion, $1 to the first capture, and so on, similar to the way Perl does it, e.g.:

arc> (=~ r/(abc)(def)/ "zzzabcdefgh" (list $1 $2))
("abc" "def")

Load paths can be added by means of the loadpath-add function, which adds a directory to the load path list loadpath*.

Regular expressions progress

•2013-05-09 • Leave a Comment

Looks like I’m going to wind up using the Plan 9 regexp library after all. After reading some very informative articles by Russ Cox (this, this, and this), I think I now understand the internal workings of the Plan 9 regexp library well enough to be able to modify it to first of all work on Arcueid strings, and then add in some of the functionality to make it more useful, though it will still not be entirely compatible with Perl or POSIX regexps.  In particular, I don’t think I ever want to make Arcueid regexps support backreferences, which can only be implemented by the backtracking algorithm. However, I do hope to add in enough support for Perl/POSIX regexp features to make it sufficiently useful for most workaday uses of regexps.

I’ve also changed regexp syntax slightly. A regexp in Arcueid now looks sorta like r/…/. It seems that there’s no easy way to unambiguously parse a regexp in the Perl-like /…/ syntax and at the same time have operators like / for division. At least the r in front makes it unambiguous.

While at the moment Arcueid’s strings are simple arrays of Unicode runes (UCS-4), I have plans to make them something much more sophisticated, by turning them into Ropes. This should also permit much more efficient use on strings of the operations on conses like car and cdr, which is something that Paul Graham has considered.

Regular Expressions in Arc

•2013-05-06 • 1 Comment

Seeing as regular expressions are so useful in my day to day programming in both Perl and Ruby, I think it would be a useful thing to have them built into Arcueid as well, but as I am doing this I wonder at how exactly it should be done.

Obviously, matching a regex should be just as easy as applying the regex, e.g. (/(a?)a?(a?aaa)/ "aaaaaaaaaaaaaaa") should be sufficient, but then what is the value of such an expression?  I suppose it should of course be nil if the match fails, but then what happens if the match succeeds?

There are several ways we could go about it:

  1. Just return an integer with the position in the string at which the match began, and then implicitly bind $1 to be the first captured sub-string, $1 to be the second, and so forth.  These variables are essentially global in scope.  It is how Perl and Ruby do it though.
  2. Make a regex match return a list with the position at which the match began and any sub-string captures (backreferences) that were created.  So you can do something like: (let (pos sa sb sc) (/(a+)(b+)(c+)/ str) ... ). I do think this would be cleaner, and it would be possible I think to write a macro where one could do: (=~ /(a+)(b+)(c+)/ str ... $1 ... $2) where $1, $2, and $3 would be bound as in Perl.

And then there is the matter of a regular expression library.  I’m currently trying to adapt the Plan 9 regular expression library but have found that it doesn’t support a lot of the stuff I’ve come to expect from regular expressions: in particular, it doesn’t even support something as simple as non-capturing groups, i.e. (?: ...).

There are several other libraries I’ve been looking at:

  1. Oniguruma.  Suffers from two issues: first, it’s bigger than the rest of Arcueid at the moment.  Second, it does recursive backtracking and can take exponential time with some regexes.  Try doing /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/ =~ "aaaaaaaaaaaaaaaaaaaaaaaaaaaaa" in Perl or Ruby that uses the same algorithm.  That simple match takes over a minute on my 2.13 GHz Core i3 machine!  The same match is almost instantaneous using the Plan 9 regex library.
  2. libpcre.  Suffers from the same exponential time backtracking problems as Oniguruma.  Is also three-clause BSD, so I can’t use it without changing my license as well.
  3. TRE.  Seems to be better, but I’ll need to pare it down to something more minimal.

Other suggestions on a small (<5000 SLOC), efficient, Unicode-capable regex library are welcome. Text processing is something web apps do a lot of, and one of the basic tools for doing that is regular expressions. This, I think deserves to be baked into the language runtime even more than support for big numbers (although that's something that might become baked in sometime soon too). Arcueid may even get support for taint mode à la Perl after this.

Changing garbage collectors

•2013-05-02 • Leave a Comment

Well, it seems that replacing the old garbage collector with VCGC improved matters markedly. The results of ApacheBench with the new garbage collector are as follows:

$ ab -n 100 -c 1 http://localhost:8080/news
This is ApacheBench, Version 2.3
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient).....done

Server Software:
Server Hostname: localhost
Server Port: 8080

Document Path: /news
Document Length: 2260 bytes

Concurrency Level: 1
Time taken for tests: 8.271 seconds
Complete requests: 100
Failed requests: 0
Write errors: 0
Total transferred: 233400 bytes
HTML transferred: 226000 bytes
Requests per second: 12.09 [#/sec] (mean)
Time per request: 82.710 [ms] (mean)
Time per request: 82.710 [ms] (mean, across all concurrent requests)
Transfer rate: 27.56 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.0 0 0
Processing: 72 83 20.6 76 193
Waiting: 41 49 14.9 44 134
Total: 72 83 20.6 76 193

Percentage of the requests served within a certain time (ms)
50% 76
66% 77
75% 79
80% 83
90% 98
95% 132
98% 169
99% 193
100% 193 (longest request)

This is a rather major improvement I should think. Using VCGC improved matters by a factor of nearly 3 on average, and is only 1.6 times slower than disabling garbage collection completely. It’s still a long way from Anarki which can still get 1.865 ms on average on my machine.

Here’s what the profiler finds:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
10.07 2.24 2.24 3185 0.00 0.00 gc
8.34 4.10 1.86 144932952 0.00 0.00 TYPE
6.97 5.65 1.55 57782583 0.00 0.00 envval
4.63 6.68 1.03 88860737 0.00 0.00 nextcont
4.00 7.57 0.89 83769798 0.00 0.00 TYPE
3.69 8.39 0.82 89950076 0.00 0.00 contenv
3.53 9.17 0.79 69084312 0.00 0.00 mark
3.33 9.91 0.74 6093257 0.00 0.00 arc_hash_final
3.19 10.62 0.71 188313998 0.00 0.00 TYPE
3.06 11.30 0.68 72275198 0.00 0.00 MARKPROP
2.83 11.93 0.63 1129927 0.00 0.00 __arc_update_cont_envs
2.38 12.46 0.53 7127191 0.00 0.00 bibop_alloc
1.80 12.86 0.40 15619398 0.00 0.00 arc_hash_update
1.75 13.25 0.39 434522 0.00 0.00 __arc_vector_marker
1.66 13.62 0.37 40391647 0.00 0.00 __arc_getenv
1.53 13.96 0.34 71091716 0.00 0.00 TENVR
1.53 14.30 0.34 1436710 0.00 0.00 __arc_vmengine
1.39 14.61 0.31 72274982 0.00 0.00 __arc_wb

The garbage collector is still the largest component, but its proportion has gone down from nearly 30% from the previous version to only 10%. The next largest contributor is the TYPE function, which is supposed to be inlined. It seems the system is spending 8% of its time doing type dispatching. It makes 145 million such calls when attempting to serve the news page 100 times, so anything we can do to either reduce the number of times this is called or make it run faster will improve matters greatly.

However, I think this is a battle we can fight after the next release, and before then I’ll work on adding some small improvements as well, such as command line parameters, load paths, additional functions, and so forth.

Profiling Arcueid

•2013-04-30 • Leave a Comment

After compiling Arcueid with profiling information, and running news.arc with ApacheBench to generate 100 requests, the results seem to be rather interesting, although not quite unexpected:

Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
28.50 11.29 11.29 287 0.04 0.09 gc
11.37 15.80 4.51 197333024 0.00 0.00 TYPE
3.91 17.35 1.55 3340957 0.00 0.00 __arc_vector_marker
3.89 18.89 1.54 35946478 0.00 0.00 arc_hash_update
3.57 20.30 1.42 190118632 0.00 0.00 mark
3.36 21.63 1.33 57789411 0.00 0.00 envval
2.85 22.76 1.13 88862762 0.00 0.00 nextcont
2.52 23.76 1.00 9775784 0.00 0.00 arc_hash_final
2.32 24.68 0.92 95741775 0.00 0.00 TYPE
2.09 25.51 0.83 142012842 0.00 0.00 VINDEX
1.92 26.27 0.76 89952347 0.00 0.00 contenv
1.77 26.97 0.70 188319700 0.00 0.00 TYPE
1.74 27.66 0.69 9540002 0.00 0.00 hash_lookup
1.72 28.34 0.68 287 0.00 0.00 free_unused_bibop
1.50 28.94 0.60 11201327 0.00 0.00 cons_marker
1.41 29.50 0.56 1130474 0.00 0.00 __arc_update_cont_envs
1.16 29.96 0.46 7119452 0.00 0.00 bibop_alloc
1.14 30.41 0.45 19305490 0.00 0.00 VINDEX

It seems that almost 30% of the system’s time is being spent in the garbage collector. I suppose the next step really is to make use of a more efficient GC algorithm, or to use what we have today less aggressively.

Nested quasiquotes

•2013-04-26 • Leave a Comment

There has been a long-standing issue within Arcueid: dealing with nested quasiquotes. Apparently reference Arc doesn’t handle this case all too well either, and as such none of the code that I use for compatibility testing like news.arc and others makes use of nested quasiquotes either. However, it seems that macro expansions in general might well result in cases where nested quasiquotes can arise, so it is important that this issue be fixed. Anyhow, in my searches I ran across a link to an implementation of a Common Lisp-style quasiquote as Arc macros. It’s an interesting exercise converting it to C… I suppose you might say that this is an example of a macro in Arcueid’s C code, however, it isn’t precisely implemented that way, but as a hook in the compiler, as this is a fundamental part of Arc’s functionality.

The basic example of a nested quasiquote is the following:

``(+ 1 ,,@(list 2 3) 4)

Current Arc 3.1 and Anarki both produce (+ 1 2 4) when asked to evaluate that. Common Lisp produces (+ 1 2 3 4), which is, I think, what we should be doing. Scheme (Racket) produces an error when presented with the same. Arcueid 0.1.1 and below produce a similar error. Akkartik mentions that this was working in some previous version of reference Arc, though he can’t seem to remember when. I think we should produce the same results as Common Lisp for nested quasiquotes.