Monday, 17 April 2017

Xavier Leroy's "standard lecture on threads"

Xavier Leroy’s standard lecture on threads post from the caml-list in 2002 seems to have disappeared from the INRIA archives so I am reproducing it here for anyone who is interested. This post is of historical interest because it explains why OCaml has no support for multicore parallelism to this day (twelve years after the release of the first consumer-level multicore CPUs).

2002-11-25 (10:01)
Xavier Leroy <xavier.leroy@i...>
Re: [Caml-list] Why systhreads?
It seems that the annual discussion on threads started again.  Allow
me to deliver again my standard lecture on this topic.

Threads have at least three different purposes:

1- Parallelism on shared-memory multiprocessors.
2- Overlapping I/O and computation (while a thread is blocked on a network
   read, other threads may proceed).
3- Supporting the "coroutine" programming style
   (e.g. if a program has a GUI but performs long computations,
    using threads is a nicer way to structure the program than
    trying to wrap the long computation around the GUI event loop).

The goals of OCaml threads are (2) and (3) but not (1) (for reasons
that I'll get into later), with historical emphasis on (2) due to the
MMM (Web browser) and V6 (HTTP proxy) applications.

Pure user-level scheduling, or equivalently control operators (call/cc),
provide (3) but not (2).

To achieve (2) with a user-level scheduler such as OCaml's bytecode
thread library requires all sorts of hacks, such as non-blocking I/O
and select() under Unix, plus wrapping of all I/O operations so that
they call the user-level scheduler in cases where they are about to
block.  (Otherwise, the whole process would block, and not just the
calling thread.)

Not only this is ugly (read the sources of the bytecode thread library
to get an idea) and inefficient, but it interacts very poorly with
external libraries written in C.  For instance, deep inside the C
implementation of gethostbyname(), there are network reads that can
block; there is no way to wrap these with scheduler calls, short of
rewriting gethostbyname() entirely.

To make things worse, non-blocking I/O is done completely differently
under Unix and under Win32.  I'm not even sure Win32 provides enough
support for async I/O to write a real user-level scheduler.

Another issue with user-level threads, at least in native code, is the
handling of the thread stacks, especially if we wish to have thread
stacks that start small and grow on demand.  It can be done, but is
highly processor- and OS-dependent.  (For instance, stack handling on
the IA64 is, ah, peculiar: there are actually two stacks that grow in
opposite directions within the same memory area...)

One aspect of wisdom is to know when not to do something oneself, but
leave it to others.  Scheduling I/O and computation concurrently, and
managing process stacks, is the job of the operating system.  Trying
to do it entirely in a user-mode program is just not reasonable.
(For another reference point, see Java's move away from "green
threads" and towards system threads.)

What about parallelism on SMP machines?  The main issue here is that
the runtime system, and in particular the garbage collector and memory
manager, must be MP-safe.  This means minimizing global state, and
introducing locking around accesses to shared resources.  If done
naively (e.g. locking at each heap allocation), this can be extremely
costly; it also complicates the runtime system a lot.  Finally,
garbage collection can become a limiting factor if it is done in the
"stop the world" fashion (all threads stop during GC); a concurrent GC
avoids this problem, but adds tremendous complexity.

(Of course, all this SMP support stuff slows down the runtime system
even if there is only one processor, which is the case for almost all
our users...)

All this has been done before in the context of Caml: that was
Damien Doligez's Concurrent Caml Light system, in the early 90s.
Indeed, the incremental major GC that we have in OCaml is a
simplification of Damien's concurrent GC.  If you're interested, have
a look at Damien's publications.

Why was Concurrent Caml Light abandoned?  Too complex; too hard to debug
(despite the existence of a machine-checked proof of correctness);
and dubious practical interest.  Shared-memory multiprocessors have
never really "taken off", at least in the general public.  For large
parallel computations, clusters (distributed-memory systems) are the
norm.  For desktop use, monoprocessors are plenty fast.  Even if you
have a 4-processor SMP machine, it isn't clear whether you should
write your program using shared memory or using message passing -- the
latter is slightly more expensive, but scales to clusters...

What about hyperthreading?  Well, I believe it's the last convulsive
movement of SMP's corpse :-)  We'll see how it goes market-wise.  At
any rate, the speedups announced for hyperthreading in the Pentium 4
are below a factor of 1.5; probably not enough to offset the overhead
of making the OCaml runtime system thread-safe. 

In summary: there is no SMP support in OCaml, and it is very very
unlikely that there will ever be.  If you're into parallelism, better
investigate message-passing interfaces.

- Xavier Leroy
To unsubscribe, mail Archives:
Bug reports: FAQ:
Beginner's list:

Monday, 30 May 2016

Disadvantages of purely functional programming

I have been asked to elaborate on my answer on Quora so here goes:


1.       There is no efficient purely functional unsorted dictionary or set

Since the 1990s the use of dictionaries in software has gone through the roof. Dictionaries are now a stock collection type that every programmer expects to find in their standard library.

Purely functional or persistent data structures such as those found in Okasaki’s fabulous monograph on the subject can be a great tool. In particular, the persistence they offer means you can reuse old versions of collections without having to worry about mutation. In many cases (particularly for some kinds of problems such as logic programming and compiler writing) this can make solutions shorter and clearer, partly because it makes backtracking trivial. However, persistence comes at a great cost in terms of performance: purely functional dictionaries are typically 10x slower than a decent hash table and I have seen them run up to 40x slower. For some applications this is just too slow.

Furthermore, most functional programming languages (OCaml, Haskell, Scala) are incapable of expressing a fast generic mutable hash table because they lack the killer combo of: reified generics, value types and a fast GC write barrier.

BEWARE: people who try to claim that Haskell’s purely functional dictionaries are fast by comparing them with Haskell’s mutable hash tables. The correct conclusion is that Haskell’s mutable hash tables are slow.


2.       There is no purely functional weak hash table.

With a garbage collected imperative language, the relationships between the vertices and edges of a graph can be expressed using weak hash tables. The garbage collector will then collect unreachable subgraphs for you. There is no purely functional weak hash table so, in a pure language, you must write your own garbage collector.

Note that this is a really fringe disadvantage with most developers never having used a weak hash table!


3.       There are no purely functional concurrent collections.

By definition, immutable collections cannot support concurrent mutation. Consequently, if you want a shared mutable collection such as an in-memory database then there is no efficient purely functional solution.


4.       Most graph algorithms look worse and run much slower when written in an FP style.

Purely functional programming is a great tool for some kinds of problems but graph algorithms are one place where I have noticed that pure solutions are often worse both in terms of speed and clarity.

Compare Prim’s algorithm in 12 lines of Python with Prim’s algorithm in 20 lines of Haskell. And why does the Haskell use Prim’s algorithm? Probably because Kruskal’s algorithm is built upon the union-find collection and there is no known efficient purely functional union-find collection.


5.       The inertia of traditional imperative data structures and algorithms is huge.

Beyond graph algorithms, there are many parts of computer science where 65 years of published literature has focused almost entirely upon imperative solutions. Consequently, imperative programmers can easily build upon the backs of giants whereas purely functional programmers are often left starting from scratch. After all, just a few years ago memoization in Haskell was the topic of a PhD thesis!

I once challenged a group of Haskell programmers (several of whom had PhDs in Haskell) to write an efficient generic parallelised quicksort in Haskell and this is what happened.


6.       All existing implementations of functional programming languages, both pure and impure, happen to allocate far too much by design.

Around 1960, McCarthy invented Lisp. The core data structure was the singly-linked list. Each list node was a separate heap allocated block. All modern functional languages evolved from this. In the 1970s, Scheme used essentially the same data representation strategy as Lisp. In the 1980s, SML added a little unboxing with tuples heap allocated as a single block of memory. In the 1990s, OCaml added a little more with unboxed float arrays. Haskell added the ability to unbox some data. But to date no functional programming language has unboxed tuples by default. Even F#, which sits on .NET which provides arbitrary value types, still uses .NET’s boxed tuples. Consequently, all modern functional programming languages incur very high allocation rates for essentially no good reason. Consequently, they all stress their garbage collectors far more than necessary. This is a serious problem not just because it makes serial code slow but because the garbage collector is a shared resource and, therefore, stressing the GC impedes the scalability of parallel programs.

I should note that calling this a “disadvantage” is contentious. Xavier Leroy of OCaml fame regards OCaml’s Lisp-like data representation as a good thing because it is the backbone of OCaml’s excellent performance when running the Coq theorem prover. Here Xavier asserted that “OCaml's strategy is close to optimal for symbolic computing”. Functional languages are often optimised for high performance purely functional collections at the expense of low performance imperative collections. Given that imperative collections are generally faster, this puts an artificially-low ceiling on the performance of almost all functional languages.


7.       Purely functional programming is theoretically good for parallelism but bad for performance in practice, which is the sole purpose of parallelism.

There are two reasons to write parallel programs today. The first is to write objectively fast solutions. The second is to make a slow solution less slow. In most cases, parallel programming in functional languages is a form of the latter. Almost nobody in high performance computing circles (i.e. supercomputers) is running functional code directly. When most functional programmers employ parallel programming today they do so not to attain the best absolute performance but just to improve the performance they have.

Purely functional languages like Haskell are designed to abstract away space and time. This gives you a higher-level perspective of your solution but it makes it very hard to reason about the amount of memory or length of time a Haskell program will require to produce a result. In parallel programming it is always important to make sure that the gain from parallelisation outweighs the administrative overheads of running code in parallel. Haskell makes this very hard. So hard, in fact, that published research on parallel Haskell notoriously cherry picks the degree of parallelisation that maximises performance even though that degree could not be predicted before running the program many times. I have found that straightforward parallelization often yields reliable speedups in languages like C++ but not in Haskell where performance is unpredictable.

BEWARE: People who talk only about scalability and disregard absolute performance. You can improve the scalability of almost any parallel program by redundantly recomputing the Mandelbrot set after each line of code for no reason because most of the time will then be spent in embarrassingly parallel code. Scalability is necessary but insufficient. You must also look at absolute performance.


8.       It took 50 years for normal people to dilute the smug weenies to the point where you can get a useful answer about functional programming on social media.

I’ve been doing functional programming for over 20 years now. For decades there was a social chasm between functional programmers and people who had real problems to solve. Thankfully this problem is now starting to dissolve away with functional languages like Scala, Clojure and F# being used for real work but for many years the predominantly-smug-weenies dominated the functional scene, making it hard for people to get real solutions to their real problems. Part of the reason for this was that, having languished in obscurity for so many decades, some communities (most notably Lisp) had highly evolved (but wrong) arguments as to why Lisp was good. It took me many years to figure out what was wrong with these arguments.


9.       There is a huge amount of misinformation about functional programming in circulation.

If you criticise the performance of hash tables in Haskell (more recently here) then you get pure misinformation from the leading lights of the community such as someone advising people to effectively turn off garbage collection.

For years the functional programming community brandished beautifully short implementations of the Sieve of Eratosthenes and Quicksort algorithms. These were even taught to students for years. Only many years later did it emerge that their solutions did not implement those algorithms. Melissa O’Neill even published a paper correcting the Sieve of Eratosthenes in Haskell. In particular, her genuine sieve requires 100x more code than the original Haskell. Same for quicksort where Haskell’s elegant two-line sort is over 1,000x slower than Sedgewick’s Quicksort in C because the Haskell deep copies lists over and over again, completely blowing the asymptotic IO complexity of Hoare original algorithm.

See also “Why is Haskell used so little in industry?” for a thorough debunking of the Haskell in Industry page.


Saturday, 7 May 2016

ARM code generation quality

I previously looked at x86 code generation quality and found that GCC does some interesting high-level optimisations but, in that case, any performance benefit was lost in poor quality code generation. In this article I’m going to look at ARM assembly instead.


Compiling the Fibonacci function in C from last time we obtain times ranging from 2.4s to 6.6s for fib(40) using –O2 to –O0. Interestingly, using –O3 actually worsens performance over –O2. As before, inspection of the generated assembly language shows that GCC employs nifty high-level optimisations when given the –O2 option.


The simplest implementation of our Fibonacci function in ARM assembly is perhaps:



        cmp     r0, #2

        movlt   pc, lr

        stmfd   sp!, {r1, r2, lr}

        sub     r1, r0, #2

        sub     r0, r0, #1

        bl      fib

        mov     r2, r0

        mov     r0, r1

        bl      fib

        add     r0, r0, r2

        ldmfd   sp!, {r1, r2, pc}


These 11 instructions are almost identical to the output generated by –O1 except we have been more frugal in order to avoid having to save and restore R3. This takes 3.9s to run.


Perhaps the most obvious optimisation is to inline the initial test (if n<2 then n else …) and then skip it when recursing:



        cmp     r0, #2

        bxlt    lr


        stmfd   sp!, {r1, r2, lr}

        sub     r1, r0, #2

        sub     r0, r0, #1

        cmp     r0, #2

        blge    fib2i

        mov     r2, r0

        mov     r0, r1

        cmp     r0, #2

        blge    fib2i

        add     r0, r0, r2

        ldmfd   sp!, {r1, r2, pc}


This immediately reduces the time taken to 1.975, over 20% faster than any of the C solutions. So with very little effort we have written assembly by hand that is both shorter and faster than the assembly generated by GCC.


Let’s take a look at that high-level optimisation that GCC did. With –O2, GCC generates 17 instructions:



        cmp     r0, #1

        stmfd   sp!, {r4, r5, r6, lr}

        mov     r6, r0

        ble     .L4

        mov     r4, r0

        mov     r5, #0


        sub     r0, r4, #1

        bl      fib

        sub     r4, r4, #2

        cmp     r4, #1

        add     r5, r5, r0

        bgt     .L3

        and     r6, r6, #1


        add     r0, r5, r6

        ldmfd   sp!, {r4, r5, r6, pc}


        mov     r5, #0

        b       .L2


This is equivalent to the following:


let rec loop(r4, r5, r6) =

  r5 += fib(r4-1)

  if r4>3 then loop(r4-2, r5, r6) else r5+(r6 & 1)

let fib(n) =

  if n <= 1 then n else

    loop(n, 0, n)


We can rewrite this high-level code in assembly by hand as 15 instructions:



        cmp     r0, #1

        bxle    lr

        stmfd   sp!, {r1, r2, r3, lr}

        mov     r1, r0

        mov     r2, #0

        mov     r3, r0


        sub     r0, r1, #1

        bl      fib3

        add     r2, r2, r0

        cmp     r1, #3

        subgt   r1, r1, #2

        bgt     loop

        and     r3, r3, #1

        add     r0, r2, r3

        ldmfd   sp!, {r1, r2, r3, pc}


Furthermore, whereas the C code took 2.4s this hand-written assembly takes just 1.9s. This is probably because the assembly generated by GCC takes 8 instructions to implement the identity function when n<=1 whereas our solution takes just 2 instructions.


GCC’s choice of high-level optimisation is interesting. Looking at the problem, the most obvious high-level optimisation to me is inlining the recursive calls. This is particularly beneficial because it results to two identical calls to fib(n-3) in the general case and that common subexpression can be factored out. The following assembly does this and runs in just 38ms:



        cmp     r0, #4

        bge     fib5mid

        cmp     r0, #2

        moveq   r0, #1

        movgt   r0, #2

        bx      lr


        stmfd   sp!, {r1, r2, lr}

        mov     r1, r0

        sub     r0, r1, #4

        bl      fib5

        mov     r2, r0

        sub     r0, r1, #3

        bl      fib5

        add     r2, r2, r0

        cmp     r1, #3

        sublt   r0, r1, #1

        addlt   r0, r0, r2

        blt     fib5end

        add     r2, r2, r0

        sub     r0, r1, #2

        bl      fib5

        add     r0, r0, r2


        ldmfd   sp!, {r1, r2, pc}


So it seems the folklore wisdom that it is impossible to beat the assembly generated by a modern C compiler simply isn’t true, at least in this case.


Wednesday, 27 January 2016

Another deleted answer of mine from Stack Overflow:
As I can see, smart pointers are used extensively in many real-world C++ projects.
True but, objectively, the vast majority of code is now written in modern languages with tracing garbage collectors.
Though some kind of smart pointers are obviously beneficial to support RAII and ownership transfers, there is also a trend of using shared pointers by default, as a way of "garbage collection", so that the programmer do not have to think about allocation that much.
That's a bad idea because you still need to worry about cycles.
Why are shared pointers more popular than integrating a proper garbage collector like Boehm GC? (Or do you agree at all, that they are more popular than actual GCs?)
Oh wow, there are so many things wrong with your line of thinking:
  1. Boehm's GC is not a "proper" GC in any sense of the word. It is truly awful. It is conservative so it leaks and is inefficient by design. See:
  2. Shared pointers are, objectively, nowhere near as popular as GC because the vast majority of developers are using GC'd languages now and have no need of shared pointers. Just look at Java and Javascript in the job market compared to C++.
  3. You appear to be restricting consideration to C++ because, I assume, you think that GC is a tangential issue. It isn't (the only way to get a decent GC is to design the language and VM for it from the beginning) so you are introducing selection bias. People who really want proper garbage collection don't stick with C++.
What are the reasons for using reference-counting smart pointers?
You are restricted to C++ but wish you had automatic memory management.

Monday, 23 November 2015

C++ vs C# performance [deleted]

The following answer to a question about C++ vs C# performance on Stack Overflow has sadly been deleted despite having 305 upvotes:


I often heard that people prefer C++ to C# mainly in the performance critical code,because the GC might turn up on critical path, causing the performance penalty.

I have heard that in some circles but never respectable circles.

For example, I consulted for a company in London who were selling stock exchange software that had been written in 1,000,000 lines of C++. Over 40 developers had been working on it for almost 15 years and they were convinced that C++ was the correct solution for such software because latency and throughput performance were both critical. They were achieving latencies as low as 50ms (with a single trader connected!) and throughput as high as 10k trades per second (tps). However, they were struggling to support more than 2,000 traders because they had several threads per trader (no async) and, in fact, traders were reporting latencies as high as six seconds because the latency of their C++ code increased exponentially with the number of traders. I rewrote their code in 3 months using F# on .NET and achieved latencies as low as 0.1ms and throughputs over 200ktps using just 6,000 lines of F#. My solution was fully asynchronous (supported over 10,000 simultaneous trader connections) and fault tolerant.

Now, I'm not saying that C++ could not have been used to achieve even better performance than mine. On the contrary, I'm sure it could have achieved better performance but I also believe it would have taken man-decades of work by real experts and cost millions of pounds. After all, there's a reason why the London Stock Exchange paid £18m for Millenium IT and their low-latency C++ solution. However, I do believe that the vast majority of the people who prematurely optimize away garbage collection don't know what they are talking about and would not be capable of building a good solution in any language. Such people usually only know C++ and have no knowledge of garbage collection algorithms, which is scary because C++ programmers reinvent GC algorithms every day. A good test is to ask them how garbage collection works. If they describe naive mark-sweep circa 1960 then they haven't done their homework.

On the other hand, some people write excellent low-latency and high-throughput code in garbage collected languages. For example, see the LMAX Disruptor (Java) and Rapid Addition FIX engine(C#). So people have written low-latency software in Java and C# and, therefore, it clearly is possible. In particular, the use of arrays of value types is a known but under-appreciated solution for low-latency programming on .NET.

However, when I read through the C++, I realized that C++ offers the smart pointer features in which the programmer did not need to worry about memory management. For example, the shared_ptr with reference counting will manage the memory for us. Hence,we did not really care about the life-time of an object and when did it being deleted. Wouldn't that similar to the C# GC and the destructor of the object would be called at the performance critical code?

Yes. C++ programmers often complain about tracing garbage collectors being non-deterministic and causing pauses. Thread-safe shared_ptr in C++ is non-deterministic because threads race to decrement the count to zero and the winner of the race condition is burdened with calling the destructor. And shared_ptr causes pauses when decrements avalanche, e.g. when a thread releases the last reference to a tree the thread is paused for an unbounded length of time while every destructor in the tree is called. Reference counting can be made incremental by queuing destructors but that recovers the non-determinism of tracing garbage collection. Finally, reference counting with shared_ptr is several times slower than tracing garbage collection because incrementing and decrementing counts is cache unfriendly.

On a related note, C++ programmers often mistakenly claim that shared_ptr collects garbage at the earliest possible point in the program and, therefore, collects more "promptly" than a tracing garbage collector can. In fact, scope-based reference counting like shared_ptr keeps floating garbage around until it falls out of scope which increases register pressure can even increase memory consumption compared to tracing garbage collection.

So shared_ptr is indeed nothing more than a poor man's garbage collector. After all, old JVMs and CLRs both used reference counting at some point in history and both dropped it in favor of better forms of garbage collection. Reference counting is only popular in C++ because there is no easy way to walk the stack and redirect pointers so accurate tracing collection is prohibitively difficult.

Also, another question is if we didn't use smart pointer in C++ and we just resort to raw pointer, we still need to call delete to clear the heap memory. So from my understanding, every object created by C++ or C# would still be destroyed but the difference is only in we manage the memory ourselves in C++ but in C#, we let the GC to manage it. So what is the NET effect of it when comparing C++ and C# since both object still need to be deleted?

In its simplest form, allocation in C++ boils down to calling a general-purpose shared (global) memory allocator like malloc and in C# it boils down to pointer bump allocating into a thread-local nursery generation (gen0). Consequently, ordinary allocation in C# is much faster than ordinary allocation in C++. However, that misrepresents real software. In practice, C++ programmers avoid calling the general purpose global allocator in favor of using thread-local pool allocators whenever possible. On the other hand, C# developers rely on the general purpose memory management solution provided by .NET because it greatly simplifies APIs (memory ownership has been abstracted away) and is more than fast enough in the vast majority of cases. In the few cases where the simple solution is not adequate, the C# developer drops to lower level C# code and writes a pool allocator using an array of value types.

So I'd probably just make two observations:

·       Accurate tracing garbage collection is extremely useful in general and is bundled with C# and prohibitively difficult with C++.

·       Memory management bit tricks (e.g. smuggling bits in pointers) are sometimes possible in C++ but prohibited in C#.

So there is no easy way to compare C++ and C# fairly in this context.

Moreover, memory management is arguably not the biggest performance concern anyway. Many other issues can have a significant effect such as the quality of generated code on obscure architectures (where C compilers are usually much more mature) vs JIT compiling for a specific CPU, vectorization like SIMD (.NET does little), JIT-compiled run-time-generated code (like regular expressions in .NET) vs an interpreter in C++ and compilation to GPUs or FPGAs.

I think the only piece of good advice I can give you here is: do your own research and don't listen to the unwashed masses.



Monday, 24 August 2015

Bjarne Stroustrup is catching up

Bjarne Stroustrup, creator of the C++ programming language, once famously said "There are only two kinds of languages: the ones people complain about and the ones nobody uses". Interestingly, Bjarne has gone on the defensive in his recent lectures, completely changed his tune and is catching up with the conclusions that most former-C++ developers have arrived at.

In a recent lecture Bjarne made many eyebrow-raising assertions. He is happy that people are no longer talking about C++ because that means it has succeeded. In reality, C++ demand in the job market has been in freefall for years and few new software projects are choosing it. He attacked computer scientists for copying data and said that "even babies don't do that", a very strange statement to make in a technical presentation. He also implied that other languages deep copy 10,000x10,000 matrices and claimed that a shared_ptr is "like an old fashioned garbage collector except it gets resource release correct". Perhaps most interestingly the topic of his presentation was OOP without inheritance.

So C++ is moving from templates to the kind of parametric polymorphism ML offered before C++ was invented. Is the backbone of OOP, inheritance, being deprecated? And new features in C++ are closer to first-class functions and garbage collection.

Looking at "modern" C++ makes me angry. I wasted so much time learning all of this incidental complexity that just gets in the way of software development. And I am angry that so many people are still being deceived by this nonsense. Thankfully fewer and fewer people each year but where did we go wrong? How did we let this happen? I think it reflects a serious disconnect between academic and industry.

"Premature optimization is the root of all evil" considered harmful

Computer science is coming full circle on performance. For decades, people worried intensely about performance and squeezed every ounce of speed they could from their code. But today the story is changing. The growing popularity of functional programming in the mainstream is encouraging people to think at a higher-level of abstraction. These people are often found reciting Knuth's famous quote "premature optimization is the root of all evil". Unfortunately the extremists among them are taking this too far and architecting systems with no regard for performance. The only fix is then tantamount to completely redesigning and reimplementing the entire system.

We can only conclude that this extremist form of "premature optimization is the root of all evil" must be considered harmful. Joe Duffy of Microsoft already expressed a similar opinion.