On Thu, 23 Jul 2015 08:56:05 +0200, Sönke Ludwig wrote:

Yes, The difference is just that the manual allocation is nothing more
than taking a pointer from/to a free list and the collection simply
doesn't happen, so in contrast to the GC situation there will
realistically never be contention. But why do you thing that the
synchronization in LockAllocator generally flushes the cache? If there
is no contention, it should basically boil down to a single CAS if the
implementation isn't totally dumb. That's of course still more expensive
than it should be (with thread-local allocation), but I simply don't
think we are there yet in terms of the type system when it comes to a
thread local GC.

For that to work reliably, using shared and immutable correctly at
the point of allocation is vital, but there are a lot of places where
those attributes are (and currently have to be) added/removed after
allocation using a cast. It also requires assistance of the Isolated!T
type to tell the GC to move memory between different thread heaps when
the isolated reference is passed between threads. That, or isolated
values would live on their own or on the shared heap until they are made
thread-local mutable or immutable.

These are my references:
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
http://www.agner.org/optimize/instruction_tables.pdf

LOCK CMPXCHG became cheaper with time so it's probably around ~20 cycles. But the cost of contention seems to be quite high even with fast mutexes (user-space LOCK CMPXHG) ~1000ns/ctx switch which is 2.5k cycles at 3ghz, and this is 3x higher on a virtual machine.

The cost of a system call (e.g. getpid) covers a wide range which estimates seem to suggest >100 cycles, so the typical mutex lock will be >100 cycles unless you're using a futex on linux.

This discussion also had me worried about TLB / L1 cache being blown for a system call (1k-10k cycles)

I don't know what to make of it anymore, but I prefer to turn down multi-threading / system calls when possible :)