HTTPServerOption.distribute and Cassandra database driver conflict

Pages: 1 2

Re: HTTPServerOption.distribute and Cassandra database driver conflict

Permalink: HTTP NNTP

Posted Wed, 22 Jul 2015 16:48:40 GMT in reply to Etienne Cimon

On Wed, 22 Jul 2015 16:46:22 GMT, Etienne Cimon wrote:

On Wed, 22 Jul 2015 17:30:19 +0200, Sönke Ludwig wrote:

single multi-threaded one because of the garbage collector. The GC is
currently very inefficient in a multi-threaded scenario, so it's very
important to minimize it's use for high-performance multi-threaded
applications (use -version=VibeManualMemoryManagement for vibe.d in that

The GC will lock (for all threads) during allocation and during collection, this happens also for manual memory management as well (but the collection is spared). Even the single-threaded scenario will also flush cpu L1 cache at every allocation because of this synchronization, costing probably >100 cycles + contention :/

The only real solution is to use an experimental TLS GC that I added to a custom druntime/phobos here:

https://github.com/etcimon/druntime/tree/2.068-custom
https://github.com/etcimon/phobos/tree/2.068-custom

I wonder what the benchmark numbers would be in comparison! I use it because I don't construct an object in a thread and let it get collected in another, so I never got any problems with it with my vibe.d fork.

Note, I added the "TLSGC" version tag throughout my libs to use thread-local allocations everywhere when I'm on the custom druntime, so that everything is completely lock-less except concurrency-specific code.

Re: HTTPServerOption.distribute and Cassandra database driver conflict

Permalink: HTTP NNTP

Sönke Ludwig

Posted Thu, 23 Jul 2015 08:56:05 +0200 in reply to Etienne Cimon

Am 22.07.2015 um 18:46 schrieb Etienne Cimon:

On Wed, 22 Jul 2015 17:30:19 +0200, Sönke Ludwig wrote:

single multi-threaded one because of the garbage collector. The GC is
currently very inefficient in a multi-threaded scenario, so it's very
important to minimize it's use for high-performance multi-threaded
applications (use -version=VibeManualMemoryManagement for vibe.d in that

The GC will lock (for all threads) during allocation and during collection, this happens also for manual memory management as well (but the collection is spared). Even the single-threaded scenario will also flush cpu L1 cache at every allocation because of this synchronization, costing probably >100 cycles + contention :/

The only real solution is to use an experimental TLS GC that I added to a custom druntime/phobos here:

https://github.com/etcimon/druntime/tree/2.068-custom
https://github.com/etcimon/phobos/tree/2.068-custom

I wonder what the benchmark numbers would be in comparison! I use it because I don't construct an object in a thread and let it get collected in another, so I never got any problems with it with my vibe.d fork.

Yes, The difference is just that the manual allocation is nothing more
than taking a pointer from/to a free list and the collection simply
doesn't happen, so in contrast to the GC situation there will
realistically never be contention. But why do you thing that the
synchronization in LockAllocator generally flushes the cache? If there
is no contention, it should basically boil down to a single CAS if the
implementation isn't totally dumb. That's of course still more expensive
than it should be (with thread-local allocation), but I simply don't
think we are there yet in terms of the type system when it comes to a
thread local GC.

For that to work reliably, using shared and immutable correctly at
the point of allocation is vital, but there are a lot of places where
those attributes are (and currently have to be) added/removed after
allocation using a cast. It also requires assistance of the Isolated!T
type to tell the GC to move memory between different thread heaps when
the isolated reference is passed between threads. That, or isolated
values would live on their own or on the shared heap until they are made
thread-local mutable or immutable.

Re: HTTPServerOption.distribute and Cassandra database driver conflict

Permalink: HTTP NNTP

Etienne Cimon

Posted Thu, 23 Jul 2015 14:09:48 GMT in reply to Sönke Ludwig

On Thu, 23 Jul 2015 08:56:05 +0200, Sönke Ludwig wrote:

Yes, The difference is just that the manual allocation is nothing more
than taking a pointer from/to a free list and the collection simply
doesn't happen, so in contrast to the GC situation there will
realistically never be contention. But why do you thing that the
synchronization in LockAllocator generally flushes the cache? If there
is no contention, it should basically boil down to a single CAS if the
implementation isn't totally dumb. That's of course still more expensive
than it should be (with thread-local allocation), but I simply don't
think we are there yet in terms of the type system when it comes to a
thread local GC.

For that to work reliably, using shared and immutable correctly at
the point of allocation is vital, but there are a lot of places where
those attributes are (and currently have to be) added/removed after
allocation using a cast. It also requires assistance of the Isolated!T
type to tell the GC to move memory between different thread heaps when
the isolated reference is passed between threads. That, or isolated
values would live on their own or on the shared heap until they are made
thread-local mutable or immutable.

These are my references:
http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html
http://www.agner.org/optimize/instruction_tables.pdf

LOCK CMPXCHG became cheaper with time so it's probably around ~20 cycles. But the cost of contention seems to be quite high even with fast mutexes (user-space LOCK CMPXHG) ~1000ns/ctx switch which is 2.5k cycles at 3ghz, and this is 3x higher on a virtual machine.

The cost of a system call (e.g. getpid) covers a wide range which estimates seem to suggest >100 cycles, so the typical mutex lock will be >100 cycles unless you're using a futex on linux.

This discussion also had me worried about TLB / L1 cache being blown for a system call (1k-10k cycles)

I don't know what to make of it anymore, but I prefer to turn down multi-threading / system calls when possible :)

Re: HTTPServerOption.distribute and Cassandra database driver conflict

Permalink: HTTP NNTP

Etienne Cimon

Posted Thu, 23 Jul 2015 14:48:52 GMT in reply to Sönke Ludwig

On Thu, 23 Jul 2015 08:56:05 +0200, Sönke Ludwig wrote:

the isolated reference is passed between threads. That, or isolated
values would live on their own or on the shared heap until they are made
thread-local mutable or immutable.

What I would do next, is migrate the current druntime GC to a shared GC, like you say.

Allocations to this GC would be made with new shared or new immutable. Or TLS data copied to the shared GC with .sdup

I tried using thread-local allocators with a shared GC, it leaked because GC calls destructors from any thread. Using scoped pools would be the only way to allow thread-local heap to co-exist at the moment..

Re: HTTPServerOption.distribute and Cassandra database driver conflict

Permalink: HTTP NNTP

Taylor Gronka

Posted Sat, 25 Jul 2015 19:33:13 GMT in reply to Sönke Ludwig

On Wed, 22 Jul 2015 17:30:19 +0200, Sönke Ludwig wrote:

Am 22.07.2015 um 16:08 schrieb Etienne Cimon:

Personally, I don't like listening on multiple threads, makes things more complicated. I would suggest you send "jobs" to worker tasks instead, using vibe.core.concurrency, and let the main thread deal with request/response.

Ditto, as long as you have payloads that you can reasonably put into
worker tasks. It simplifies the architecture a lot and can also improve
the response latency in case of CPU intensive tasks.

If you have an architecture where the process doesn't hold permanent
state in-memory (but in a database such as Redis), it's also often a
good idea to start multiple single-threaded processes instead of a
single multi-threaded one because of the garbage collector. The GC is
currently very inefficient in a multi-threaded scenario, so it's very
important to minimize it's use for high-performance multi-threaded
applications (use -version=VibeManualMemoryManagement for vibe.d in that
case).

Thanks for the great discussion. I'm going through Adam Ruppe's Cookbook to get a better handle on D. I'll likely start up a general database design questions thread at some point.

Re: HTTPServerOption.distribute and Cassandra database driver conflict

Permalink: HTTP NNTP

Sönke Ludwig

Posted Sun, 26 Jul 2015 20:36:35 +0200 in reply to Etienne Cimon

Am 23.07.2015 um 16:48 schrieb Etienne Cimon:

On Thu, 23 Jul 2015 08:56:05 +0200, Sönke Ludwig wrote:

the isolated reference is passed between threads. That, or isolated
values would live on their own or on the shared heap until they are made
thread-local mutable or immutable.

What I would do next, is migrate the current druntime GC to a shared GC, like you say.

Allocations to this GC would be made with new shared or new immutable. Or TLS data copied to the shared GC with .sdup

Forgot to send this reply: Just a side note about "sdup" that you may
already be aware of - it's important that this is only be allowed on
types that have no non-shared/non-immutable references, so that it
doesn't create references from the shared heap to the thread local one.

I tried using thread-local allocators with a shared GC, it leaked because GC calls destructors from any thread. Using scoped pools would be the only way to allow thread-local heap to co-exist at the moment..

Re: HTTPServerOption.distribute and Cassandra database driver conflict

Permalink: HTTP NNTP

Etienne Cimon

Posted Wed, 29 Jul 2015 13:16:11 GMT in reply to Sönke Ludwig

On Sun, 26 Jul 2015 20:36:35 +0200, Sönke Ludwig wrote:

Am 23.07.2015 um 16:48 schrieb Etienne Cimon:

On Thu, 23 Jul 2015 08:56:05 +0200, Sönke Ludwig wrote:

the isolated reference is passed between threads. That, or isolated
values would live on their own or on the shared heap until they are made
thread-local mutable or immutable.

What I would do next, is migrate the current druntime GC to a shared GC, like you say.

Allocations to this GC would be made with new shared or new immutable. Or TLS data copied to the shared GC with .sdup

Forgot to send this reply: Just a side note about "sdup" that you may
already be aware of - it's important that this is only be allowed on
types that have no non-shared/non-immutable references, so that it
doesn't create references from the shared heap to the thread local one.

I tried using thread-local allocators with a shared GC, it leaked because GC calls destructors from any thread. Using scoped pools would be the only way to allow thread-local heap to co-exist at the moment..

I actually want to make it produce a recursive deep copy in those cases but it would have to be possible to manually overload them. It's one of those question marks that are blocking me at this moment.

Pages: 1 2