Sequencing Calls into CHICKEN Scheme

Embedding CHICKEN Scheme into a multithreaded C program isn’t straightforward. What happens when multiple native threads want to call into Scheme? And how fast can we make it? This post walks through experiments I ran with callseq, an experimental callback sequencer for CHICKEN Scheme 5.4.

The problem: multiple native threads, one Scheme runtime

CHICKEN Scheme embeds neatly into C. You can export a function with define-external and call it from native code. But trouble starts if multiple native threads attempt to do so. CHICKEN runtime heavily depends on the stack, keeping objects there for longer periods of time. That is a fundamental part of the garbage collection mechanism implemented in CHICKEN (see this amazing post for details). If multiple threads try to call into the runtime, references to stack objects won’t match, and the program might hang or outright crash. In general, the thread that initializes the runtime owns it during the whole execution.

The mailing list pointed me to two approaches to support multiple threads:

I wasn’t ready to jump into CRUNCH because I’d have to give up a large portion of the features that make CHICKEN powerful. Instead, I tried cncb, but found it much slower than I expected.

Benchmark

Here is a simple benchmark that measures C->Scheme call overhead with cncb and other mechanisms. Setup: a native thread (C thread) calls one Scheme function (foo) 1 million times, and we measure the time in C, around the loop:

#define SAMPLES 1000000

void bench_run(void) {
  size_t count = SAMPLES;
  nanosec_t ts = now();
  while (count--) {
    (void)foo(count);
  }
  nanosec_t then = now();
  printf("tput: %.2fop/s\tavg latency: %0.0fns/op\n",
          SAMPLES/in_sec(then-ts), 1.0*(then-ts)/SAMPLES);
  exit(0);
}

Scheme baseline:

(define-external (foo (integer x)) int
  (+ x 100))

All code built with -O3. CHICKEN module compiled with -O3 -d0. Platform: macOS, Apple M1.

Results

variant throughput (op/s) avg latency (ns/op)
plain 10741173.17 93
cncb 278135.99 3595
ucontext 673281.25 1485
assembly 8065283.63 124

The plain variant is our ideal baseline: a single thread does direct calls to CHICKEN runtime. The cncb variant employs the recommended library. As far as I understand, the caller thread encodes/serializes the arguments of the call in a request message, pushes that into a UNIX pipe; a dispatcher thread (which owns the CHICKEN runtime) pops the request message, deserializes it, executes it, serializes the response, and pushes it back to the caller. Functional, but ~40× slower.

The “dedicated stack” trick

The table above also shows the following alternative solution implemented in two ways.

Key idea: instead of proxying via a dispatcher thread, keep a reserved stack S for CHICKEN. Any native thread that needs Scheme:

  1. Takes a lock (only one thread in Scheme at a time).
  2. Swaps its stack to S.
  3. Calls the external Scheme function.
  4. Swaps back and releases the lock.

The ucontext variant employs the POSIX interface with the same name to create a dedicated stack S, which does not directly belong to any native thread (check man 2 ucontext in your system for more information). A native thread that wants to call into Scheme swaps its own stack with S, calls the runtime, and then swaps the stacks back. There is no cost of serialization or writing/reading UNIX pipes. That is definitely faster than cncb: ~1.5 μs per call.

We can do better, though. ucontext calls are syscalls, they always cross to the kernel. With the assembly variant, I create and swap a dedicated stack S, but with a tiny AArch64 assembly snippet, which is responsible for saving registers and switching the sp register (stack pointer) directly. The performance is much closer to the baseline: ~124 ns.

Note on portability: this assembly approach is architecture-specific. On Apple M1 (AArch64) it’s about 20 lines of assembly. A similar trick can be done on x86-64, but it requires writing the equivalent stack-switching stub. ucontext remains the portable fallback.

Callseq: packaging the idea and spicing it up

The callseq egg packages this into a usable workflow:

(import (chicken platform) callseq)

(define-sequenced-callback (foo (int a)) int
  (* a a))
(return-to-host)

From C:

void callseq_init(void);
int foo(int a);

int main(void) {
  callseq_init();
  printf("%d\n", foo(4)); // 16
}

See examples and benchmarks.

Flat-combining

This sequencer also implements a form of flat-combining. If multiple threads try to call the runtime at the same time, the thread that manages to take the lock not only serves its own request but also a (configurable) number of requests of the waiting threads.

In other words:

This mechanism amortizes the cost of swapping the stack and may reduce cache misses since the active thread’s core cache is already hot.

A detailed performance benchmark of flat-combining in this context is still left for future work. :D


Takeaway

⚠️ Disclaimer: This is an experimental approach. It works reliably in my benchmarks, but it is not an officially supported CHICKEN mechanism. If you depend on rock-solid portability, stick to the officially maintained libraries.