Sequencing Calls into CHICKEN Scheme
Embedding CHICKEN Scheme into a multithreaded C program isn’t straightforward.
What happens when multiple native threads want to call into Scheme?
And how fast can we make it?
This post walks through experiments I ran with callseq, an experimental
callback sequencer for CHICKEN Scheme 5.4.
The problem: multiple native threads, one Scheme runtime
CHICKEN Scheme embeds neatly into C. You can export a function with
define-external and call it from native code. But trouble starts if multiple
native threads attempt to do so. CHICKEN runtime heavily depends on the stack,
keeping objects there for longer periods of time. That is a fundamental part
of the garbage collection mechanism implemented in CHICKEN (see this amazing
post for details). If multiple threads try to call into the runtime,
references to stack objects won’t match, and the program might hang or outright
crash. In general, the thread that initializes the runtime owns it during the
whole execution.
The mailing list pointed me to two approaches to support multiple threads:
- CRUNCH Scheme — a R7RS subset where memory management is based on reference counting.
- Concurrent Native Callbacks (
cncb) — an egg that lets you proxy calls via a dispatcher thread.
I wasn’t ready to jump into CRUNCH because I’d have to give up a large portion
of the features that make CHICKEN powerful. Instead, I tried cncb, but found
it much slower than I expected.
Benchmark
Here is a simple benchmark that measures C->Scheme call overhead with cncb and
other mechanisms. Setup: a native thread (C thread) calls one Scheme function
(foo) 1 million times, and we measure the time in C, around the loop:
#define SAMPLES 1000000
void bench_run(void) {
size_t count = SAMPLES;
nanosec_t ts = now();
while (count--) {
(void)foo(count);
}
nanosec_t then = now();
printf("tput: %.2fop/s\tavg latency: %0.0fns/op\n",
SAMPLES/in_sec(then-ts), 1.0*(then-ts)/SAMPLES);
exit(0);
}
Scheme baseline:
(define-external (foo (integer x)) int
(+ x 100))
All code built with -O3. CHICKEN module compiled with -O3 -d0. Platform:
macOS, Apple M1.
Results
| variant | throughput (op/s) | avg latency (ns/op) |
|---|---|---|
| plain | 10741173.17 | 93 |
| cncb | 278135.99 | 3595 |
| ucontext | 673281.25 | 1485 |
| assembly | 8065283.63 | 124 |
The plain variant is our ideal baseline: a single thread does direct calls to CHICKEN runtime. The cncb variant employs the recommended library. As far as I understand, the caller thread encodes/serializes the arguments of the call in a request message, pushes that into a UNIX pipe; a dispatcher thread (which owns the CHICKEN runtime) pops the request message, deserializes it, executes it, serializes the response, and pushes it back to the caller. Functional, but ~40× slower.
The “dedicated stack” trick
The table above also shows the following alternative solution implemented in two ways.
Key idea: instead of proxying via a dispatcher thread, keep a reserved stack S for CHICKEN. Any native thread that needs Scheme:
- Takes a lock (only one thread in Scheme at a time).
- Swaps its stack to S.
- Calls the external Scheme function.
- Swaps back and releases the lock.
The ucontext variant employs the POSIX interface with the same name to
create a dedicated stack S, which does not directly belong to any native
thread (check man 2 ucontext in your system for more information). A native
thread that wants to call into Scheme swaps its own stack with S, calls the
runtime, and then swaps the stacks back. There is no cost of serialization or
writing/reading UNIX pipes. That is definitely faster than cncb: ~1.5 μs per
call.
We can do better, though. ucontext calls are syscalls, they always cross to
the kernel. With the assembly variant, I create and swap a dedicated stack
S, but with a tiny AArch64 assembly snippet, which is responsible for saving
registers and switching the sp register (stack pointer) directly. The
performance is much closer to the baseline: ~124 ns.
Note on portability: this assembly approach is architecture-specific. On Apple M1 (AArch64) it’s about 20 lines of assembly. A similar trick can be done on x86-64, but it requires writing the equivalent stack-switching stub.
ucontextremains the portable fallback.
Callseq: packaging the idea and spicing it up
The callseq egg packages this into a usable workflow:
(import (chicken platform) callseq)
(define-sequenced-callback (foo (int a)) int
(* a a))
(return-to-host)
From C:
void callseq_init(void);
int foo(int a);
int main(void) {
callseq_init();
printf("%d\n", foo(4)); // 16
}
See examples and benchmarks.
Flat-combining
This sequencer also implements a form of flat-combining. If multiple threads try to call the runtime at the same time, the thread that manages to take the lock not only serves its own request but also a (configurable) number of requests of the waiting threads.
In other words:
- Thread A wins the lock.
- Instead of just serving A’s request, it also batches a few pending requests from B, C, and D.
- That way, the cost of swapping to the Scheme stack is paid once for multiple calls.
This mechanism amortizes the cost of swapping the stack and may reduce cache misses since the active thread’s core cache is already hot.
A detailed performance benchmark of flat-combining in this context is still left for future work. :D
Takeaway
- If you just need correctness,
cncbworks. - If you need speed, a dedicated stack plus lightweight context switch gives near-native performance.
callseqmakes this approach usable in practice.
⚠️ Disclaimer: This is an experimental approach. It works reliably in my benchmarks, but it is not an officially supported CHICKEN mechanism. If you depend on rock-solid portability, stick to the officially maintained libraries.