Wednesday, 31 January 2018

Background reading on the reference counting vs tracing garbage collection debate

Eight years ago I answered a question on Stack Overflow about the suitability of OCaml and Haskell for soft real-time work like visualization:
"for real-time applications you will also want low pause times from the garbage collector. OCaml has a nice incremental collector that results in few pauses above 30ms but, IIRC, GHC has a stop-the-world collector that incurs arbitrarily-long pauses"
My personal experience has always been that RAII in C++ incurs long pauses when using non-trivial data (i.e. nested, structured, collections of collections of collections, trees, graphs and so on), non-deferred reference counting has the same problem for the same reason, tracing garbage collectors like OCaml work beautifully but there are many notoriously bad tools like Java that have given tracing garbage collection a bad name.
Now that I am revisiting this issue I am surprised to find many individuals and organisations repeating exactly the same experimental tests that I did and coming to the same conclusions.
Pusher published a blog post Low latency, large working set, and GHC’s garbage collector: pick two of three that explains how, after building their production solution in Haskell in the interests of correctness, they were forced to drop the Haskell programming language due to insurmountable problems with GC latency. They chose to switch to Google's Go language instead but I think OCaml would have served them well. I'd also like to stress that people in industry neglecting performance as a functional requirement is a bugbear for me!
Gabriel Schrerer published his research as a blog post Measuring GC latencies in Haskell, OCaml, Racket where he presented a related GC benchmark with results showing that OCaml attains 2ms max GC pauses compared 38ms with tuned Racket (formerly known as PLT Scheme) and to 51ms with Haskell.
Twitch published a post entitled "Go’s march to low-latency GC" that described their five year journey from GC pauses lasting tens of seconds down to 1ms using Go.
Hashtable latencies on Github is a web server benchmark ported to a variety of tracing garbage collected languages as well as Swift (which uses reference counting). Although at the beginning they say "the Swift version is mostly used as a reference of what would be ideal" at the end they actually conclude "Want to build low-latency, memory-intensive services in a GCed language? Use OCaml (or Reason if you prefer the syntax)". Note in particular their graph of measured latency profiles which shows no significant difference between OCaml and Swift at any scale:
Here is a discussion in the Rust subreddit about reference counting woes where the author ends up admitting "it might be interesting to deliberately leak the memory and just hope that the OS is smart enough to swap out pages that aren't being referenced any more".
Google's V8 team have blogged about their garbage collector including the post Getting Garbage Collection for Free about optimising their GC for latency by deferring collection work until idle time.