Tuesday, 26 December 2017

Does reference counting really use less memory than tracing garbage collection? Mathematica vs Swift vs OCaml vs F# on .NET and Mono

Our previous post caused some controversy by questioning the validity of some commonly-held beliefs. Specifically, the beliefs that reference counting (RC) always requires less memory than tracing garbage collection (GC) and that tracing GCs require 3-4x more memory in order to perform adequately. We presented a benchmark written in Swift and OCaml and noted that the RC'd Swift implementation ran over 5x slower and required over 3x more memory than the tracing GC'd OCaml implementation. This observation disproves these beliefs in their strong form and even brings into question whether there is even any validity in the weak forms of those beliefs. After all, we have never seen any empirical evidence to support these beliefs in any form.
We received a lot of criticism for that post. The vast majority of the criticism was not constructive but two valid points did arise. Firstly, although our result was anomalous it would be more compelling to see the benchmark repeated across a wider variety of production languages using both RC and tracing GC because this would help to reduce the error caused by differences other than GC strategy. Secondly, it would be more compelling to repeat the study with different benchmark tasks in case this one problem (computing symbolic derivatives) is anomalous. This post addresses the former concern.
We implemented equivalent solutions to the original Swift and OCaml, this time in F# and Mathematica. The F# uses a tracing garbage that is inherited from its runtime (either Mono or .NET). Mathematica is a domain specific language (DSL) designed specifically for computer algebra that uses its own heavily-optimized RC implementation so it stands the best possible chance of being able to compete with the original OCaml. So these additional measurements should usefully extend the applicability of our findings.
The F# 4.0 code was run on Mono 4.6.2 directly on the Raspberry Pi 3 alongside the Swift and OCaml and also on .NET 4.7 on Windows 10 (i.e. different architecture and platform!). Our results for Mathematica represent excess memory consumption over the initial consumption (because Mathematica has a uniquely huge standard library) and were obtained running on Mathematica 11.2 in the cloud and 10.0.1 on the Raspberry Pi 3.
According to our measurements, the solutions using tracing GCs require less memory than the solutions using RC in every single case:
Again, we find that this common belief that RC requires less memory to be unjustified.
Just out of interest, here are the run times too (on Raspberry Pi 3 only):

11 comments:

Félix said...

Jon, in the last post you said that you reached out to see if someone could optimize the Swift program and that no one came forward. I gave such a suggestion in a comment on that post (use a UnicodeScalar instead of a String for the Expr.Var case) and found that it reduced the memory usage by 50MB on my Mac. Since that's a very significant amount of memory directly attributable to how strings are handled rather than to RC overhead, and since you did seek advice earlier, don't you think that it would be reasonable to incorporate that change?

Are you comfortable drawing conclusions from GCed programs where the GC never runs? Do you find that representative of how these programming languages are actually used?

Jon Harrop said...

@Felix: I hadn't seen that before and definitely want to investigate it, yes. Am I correct in thinking that it restricts consideration to a single character for the Swift only when the OCaml and F# are still using arbitrary strings (albeit ASCII not unicode in the case of OCaml)? If so, that is a bit dodgy. Also, by what mechanism does it reduce memory consumption by 50% when there is only one string in the entire program?

"Are you comfortable drawing conclusions from GCed programs where the GC never runs? Do you find that representative of how these programming languages are actually used?"

The GCs are obviously running a lot on this benchmark so that doesn't apply here but I have used code that doesn't require the GC to run in other applications. One of the benchmarks for HLVM was a ray tracer that didn't incur any GC cycles in its steady state and it justified my conclusion that language designs that elide the GC can be valuable. I have used allocationless code in production to use Infiniband efficiently from F# for a low latency trading system but it is rare.

Félix said...
This comment has been removed by the author.
Félix said...

(Had a slight inaccuracy in the previous comment.)

Having mutable objects, Swift has objects with reference semantics (classes) and copy semantics (structs, enums, tuples). Objects with copy semantics are typically (but not always) passed as their own direct payload and not by address; if you have an (Int, Int) tuple or a struct Foo { Int, Int } (strawman syntax) and you pass it to a function, at the machine level, both are indistinguishable from a function that accepts two Ints.

In Swift, Strings, Arrays, Dictionaries and Sets all have copy semantics (they're structs with pointers and metadata about their payload). Off the top of my head, the String struct that Swift uses looks like (payload: char*, length: Int, flags: Int), so that part gets embedded anywhere you use it, as a function parameter or as a tuple member, which is what happens in your enum payload. On a 64-bit machine, the enum type has to be at least 24 bytes just to house the String value, whereas it's at least 8 when you use a UnicodeScalar (the UnicodeScalar itself is 4 bytes, but there are other enum cases that contain a pointer/reference).

All of String, Array, Dictionary, Set implement copy-on-write semantics to avoid pervasive payload copies: when they're modified, they check their payload's reference count. If it's 1, then the value is uniquely owned and changes won't be seen anywhere else, so the backing buffer doesn't need to be copied. If it's greater than that, then a new copy is created before it is modified. (Even if you don't like RC, at least Swift gets a bit more mileage off of it: it would be harder to implement nice COW collections without a pre-existing reference count.)

There are also several implementation details that make it more effective. For instance, the last time that you pass a String to a function, Swift can transfer its ownership to the callee, meaning that the reference count is not increased before the call and the caller is now responsible for decreasing it, and increasing the chances that the String is uniquely referenced.

You are correct that it makes the program a tiny bit less useful to use a single character instead of a string, but as you said, there is only one string in this program and it demonstrably fits in a single character. Of course, you're welcome to make all the other versions work with a single character, or wrap a String in a Swift class, if you feel that Swift unfairly benefits. :) I am only making this comment because you claim to measure RC overhead, and you seemingly haven't attempted to break down the memory usage differences. Without any additional information, a reader would most likely reach the incorrect conclusion that the RC overhead is significantly larger than it is, while in fact the memory management system is immaterial to this memory usage difference. That's fitting for a language comparison, but misleading in the comparison of different memory management systems.

Jon Harrop said...

@Felix:

"you seemingly haven't attempted to break down the memory usage differences"

I already tried various things including something similar but we need an order of magnitude improvement in both Swift and Mathematica to justify the claim "unless you give the GC 3-4x more memory than is needed, you’ll get thrashing and incredibly poor performance" and we only got 30% improvement so we're clearly still a long way off. Still, every little helps.

"On a 64-bit machine, the enum type has to be at least 24 bytes just to house the String value, whereas it's at least 8 when you use a UnicodeScalar (the UnicodeScalar itself is 4 bytes, but there are other enum cases that contain a pointer/reference)."

Add, Mul and Pow have two references so surely 16 bytes + a tag (byte?).

"You are correct that it makes the program a tiny bit less useful to use a single character instead of a string, but as you said, there is only one string in this program and it demonstrably fits in a single character. Of course, you're welcome to make all the other versions work with a single character, or wrap a String in a Swift class, if you feel that Swift unfairly benefits. :)"

There is a better solution that I don't think anyone can argue with: indirect the Var enum case. That recovers virtually the same memory consumption (71MB) without sacrificing any functionality.

"I am only making this comment because you claim to measure RC overhead"

No, I haven't claimed anything. I'm questioning the validity of this claim by Chris and others that GCs require 3-4x more memory to perform adequately. Where is the research to justify this claim and why, when I benchmark the memory consumption of some RC'd and tracing GC'd languages, am I seeing the exact opposite behaviour.

I'm not surprised that little things like the string representation in Swift can make a 30% difference but that hardly justifies this claim by Chris.

"while in fact the memory management system is immaterial to this memory usage difference. That's fitting for a language comparison, but misleading in the comparison of different memory management systems."

Hang on! This isn't just a remarkable coincidence. Data representations are always heavily optimised for the GC strategy. OCaml is extremely pointer heavy because its GC handles pointers very efficiently so it boxes a lot. Swift is almost certainly aggressively unboxing because RC makes pointers expensive. Indeed, this is precisely why I want to look at mature language implementations: because they have been carefully optimised in this way.

In fact, the 24-byte representation of the String type in Swift is a variant on the "fat reference" representation that was used in HLVM in 2007, i.e. before Swift was invented. HLVM was designed to provide fast numerics from a high-level OCaml-like language which it achieved by using a very simple and inefficient tracing GC combined with a data representation that was designed to unbox as much as possible in order to minimize the stress on that GC.

I actually really like the way Swift lets you choose where the indirections go and unboxes the rest. However, I think it needs a tracing GC to be competitively efficient. Also, union types and pattern matching are great but TCO really amplifies their usefulness. Shame to have these lovely immutable types only to impose mutation because you must use loops to iterate or you'll leak and stack overflow.

Félix said...

> Add, Mul and Pow have two references so surely 16 bytes + a tag (byte?).

Yes, that was an obvious mistake on my end.

> Hang on! This isn't just a remarkable coincidence. Data representations are always heavily optimised for the GC strategy.

I'm glad that we're getting to this point. Let's take the opportunity to mention that OCaml and its GC could not be used as a drop-in replacement for most of the tasks that Swift is used for, because OCaml programs do not support parallel execution within the same process, whereas Swift needs to thrive in the dispatch queue world of the Apple environment. Comparing with F#, Clojure and Go is probably more representative of "what Swift's garbage collection could be".

I also don't think that Swift is as mature as you seem to think: it's not even considered safe to ship a version of the runtime as a shared library and have everything link to it. Swift programs, even on macOS, are expected to include their own version of the Swift runtime because Apple still thinks that too many changes are coming to ensure binary compatibility.

> Shame to have these lovely immutable types only to impose mutation because you must use loops to iterate or you'll leak and stack overflow.

Surely this is a matter of personal preference. In my opinion, iterating with a "nest" function is not more aesthetically pleasing or clearer than using a for loop. In fact, when I forked your gist (https://gist.github.com/zneak/ae33bb970a08632cfb2925e2049f9e7a), removing that was one of the first things that I did. :(

Regardless, I have yet to run in a case where Swift stack overflows or leaks because of missed tail calls.

Jon Harrop said...

> OCaml programs do not support parallel execution within the same process, whereas Swift needs to thrive in the dispatch queue world of the Apple environment. Comparing with F#, Clojure and Go is probably more representative of "what Swift's garbage collection could be".

OCaml is bad for parallelism but good for concurrency. Parallelism is for throughput performance and OCaml is 5x faster than Swift here so I imagine you must mean concurrency in which case it is just a case of serializing the messages.

> I also don't think that Swift is as mature as you seem to think: it's not even considered safe to ship a version of the runtime as a shared library and have everything link to it. Swift programs, even on macOS, are expected to include their own version of the Swift runtime because Apple still thinks that too many changes are coming to ensure binary compatibility.

Swift is used in production though, right?

> iterating with a "nest" function is not more aesthetically pleasing or clearer than using a for loop

Sure but there are many places where TCO pays off, e.g. continuation passing style.

> Regardless, I have yet to run in a case where Swift stack overflows or leaks because of missed tail calls.

Leaks probably go unnoticed but I'm surprised you don't see stack overflows because the Scala guys sure do.

Félix said...

> OCaml is bad for parallelism but good for concurrency. Parallelism is for throughput performance and OCaml is 5x faster than Swift here so I imagine you must mean concurrency in which case it is just a case of serializing the messages.

No, I do mean parallelism. Swift doesn't auto-parallelize code, but frameworks on Apple platforms heavily rely on independent, parallel task queues. I would encourage you to read about dispatch queues (https://developer.apple.com/library/content/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html) if you care about understanding the context in which Swift is meant to run. I think that you'd like it. Swift 5 will have async/await keywords that turn your code into a CPS form.

I'm also seeing that you tested 3 GCs, and the two that support parallel execution do use 2-3x more memory than OCaml's. That doesn't mean that Swift is doing any better in this specific case, of course.

> Swift is used in production though, right?

If maturity is a boolean attribute mapping to whether it's used in production, then yes, it's "mature" (although no shared Swift library is currently deployed with any operating system).

> Leaks probably go unnoticed but I'm surprised you don't see stack overflows because the Scala guys sure do.

Why don't you throw a piece of code at it? Swift doesn't run on the JVM. TCO is currently hindered by refcount calls as you might imagine, but Swift has ownership semantics called "guaranteed ownership" that allow the compiler to pass a reference-counted object without touching the refcount because it knows that some parent frame has already done it, and "owned to guaranteed" transitions that allows an object "owned" by a caller to be released in a callee. If the compiler doesn't take advantage of it, then it's probably because it's not... mature enough. :)

Jon Harrop said...

> No, I do mean parallelism

If you're interested in parallelism then you must also be interested in 5x faster serial performance from OCaml, i.e. even with linear speedup you couldn't match it with Swift on a quadcore.

> I'm also seeing that you tested 3 GCs, and the two that support parallel execution do use 2-3x more memory than OCaml's. That doesn't mean that Swift is doing any better in this specific case, of course.

Sure but RC'd languages like Swift are supposed to be doing 3-4x better according to its author.

> Why don't you throw a piece of code at it?

Will do.

> Swift doesn't run on the JVM. TCO is currently hindered by refcount calls as you might imagine, but Swift has ownership semantics called "guaranteed ownership" that allow the compiler to pass a reference-counted object without touching the refcount because it knows that some parent frame has already done it, and "owned to guaranteed" transitions that allows an object "owned" by a caller to be released in a callee. If the compiler doesn't take advantage of it, then it's probably because it's not... mature enough. :)

Would be great if Swift optimised that at some point.

Thanks for the tip, BTW! I'll reblog with the latest changes ASAP.

Félix said...

> If you're interested in parallelism then you must also be interested in 5x faster serial performance from OCaml, i.e. even with linear speedup you couldn't match it with Swift on a quadcore.

You should realize that most of Apple's frameworks use dispatch queues for parallel processing, and therefore Swift needs to work in a parallel environment, regardless of its performance characteristics.

I would like to present a second data point. I took your hash table performance test that showed up in the sidebar (https://flyingfrogblog.blogspot.ca/2009/04/f-vs-ocaml-vs-haskell-hash-table.html) and ported it to Swift. I get this using 64-bit integers all around:

$ fsharpc --out:htest-fs.exe -O htest-fs.fs
$ $(which time) -l mono htest-fs.exe
100
0.81 real 0.55 user 0.25 sys
541794304 maximum resident set size

$ ocamlopt -O3 -o htest-ocaml htest.ml
$ $(which time) -l ./htest-ocaml
100
4.38 real 4.17 user 0.19 sys
484954112 maximum resident set size

$ cat htest.swift
let dict = Dictionary(uniqueKeysWithValues: (1...10000000).map { ($0, $0) })
print(dict[100]!)

$ swiftc -O htest.swift -o htest-swift
$ $(which time) -l ./htest-swift
100
1.29 real 1.09 user 0.18 sys
436047872 maximum resident set size

As far as I can tell, you possibly have stepped on a case where Swift is pathologically bad, and you're getting a skewed depiction of the language's performance characteristics from your one sample. Hopefully, with this new one, we can avoid "5x faster serial performance" comments pending further evidence.

> Sure but RC'd languages like Swift are supposed to be doing 3-4x better according to its author.

That is not what Chris said:

> GC also has several *huge* disadvantages that are usually glossed over: while it is true that modern GC's can provide high performance, they can only do that when they are granted *much* more memory than the process is actually using. Generally, unless you give the GC 3-4x more memory than is needed, you’ll get thrashing and incredibly poor performance.

There is no comparison with (or mention of) reference counting mechanisms. The baseline is the memory that the process is actually using.

Regardless, we both agree that this effect is not visible in OCaml for your sample, and I think that it's fair to say that it used essentially the smallest possible amount of memory. However, the two tracing GCs that support parallel execution did end up using ~2x and ~3x that amount. The new sample probably has like 5 objects on the heap and programs almost certainly have different excess dictionary capacity, so I'm not going to try to interpret it, but I will note that Swift is very competitive.

While I can't tell from these samples if the memory needed by the two .NET implementations actually is tied to allocator performance and I can't tell if the alleged factor is proportional to some heap size or object count, I know for sure that Swift is not using more memory to avoid thrashing and incredibly poor performance. The excess is part refcount overhead and part malloc waste, which makes overhead inversely proportional to size of each object that you're allocating. It is no surprise that a program that rains tiny boxed objects causes lots of memory overhead in Swift, and yet it still does better memory-wise than Mono.

If there's going to be more of this, I think that testing with more garbage collectors would be the most interesting improvement to the data, and that the second most interesting improvement would be to have more code samples. Since the low overhead and low complexity of the tasks that the OCaml garbage collector needs to accomplish seemingly allows it to reserve little more than the memory that the process is "actually using", I think that using it as a baseline to judge whether another garbage collector is using more memory than necessary is the most interesting thing to do.

Jon Harrop said...

> You should realize that most of Apple's frameworks use dispatch queues for parallel processing, and therefore Swift needs to work in a parallel environment, regardless of its performance characteristics.

Not sure what you mean by a "parallel environment". You can run multiple OCaml processes or use threads in OCaml.

> I would like to present a second data point. I took your hash table performance test that showed up in the sidebar (https://flyingfrogblog.blogspot.ca/2009/04/f-vs-ocaml-vs-haskell-hash-table.html) and ported it to Swift

Ah, good idea. I've actually written a meatier benchmark that uses hash tables here that might be better.

> As far as I can tell, you possibly have stepped on a case where Swift is pathologically bad, and you're getting a skewed depiction of the language's performance characteristics from your one sample. Hopefully, with this new one, we can avoid "5x faster serial performance" comments pending further evidence.

I'll check it out, thanks.

> Regardless, we both agree that this effect is not visible in OCaml for your sample, and I think that it's fair to say that it used essentially the smallest possible amount of memory

Are you sure?

> the two tracing GCs that support parallel execution did end up using ~2x and ~3x that amount

Correlation isn't causation. I suspect that has more to do with them keeping run-time type information than the fact that they support parallel execution.

> I know for sure that Swift is not using more memory to avoid thrashing and incredibly poor performance

I find that statement really bizarre given that fact that you can crank up the parameter to 10 and watch Swift die with out of memory while OCaml and F# on .NET Core continue to run just fine. The idea that Swift is dying with out of memory whilst trying to avoid "incredibly poor performance" is hardly soothing.

> The excess is part refcount overhead and part malloc waste, which makes overhead inversely proportional to size of each object that you're allocating

And part floating garbage caused by deferred deallocation, which is unrelated to the size of your objects.

> It is no surprise that a program that rains tiny boxed objects causes lots of memory overhead in Swift, and yet it still does better memory-wise than Mono.

Mono uses a conservative GC that leaks by design so I'm not surprised it requires a lot of memory! :-)

Would be interesting to benchmark HLVM too, if I can get it working again...