On Wed, 28 May 2014 06:54:39 GMT, Sönke Ludwig wrote:
On Tue, 27 May 2014 20:48:12 GMT, Atila Neves wrote:
You may remember that last year I wrote an MQTT broker using vibe.d in response to a challenge from a work colleague about his implementation in Go, and that ended up being a fight between implementations in D, Go, C, C++, Java and Erlang. The D/vibe.d combo won the throughput benchmark by a mile but was middle-of-the-road in latency. That always got to me and I never did know why.
Last week at DConf I learned about the
perf
tool. It seemed pretty cool but I had to use it on something and I ended up looking at the Java and D implementations in terms of latency. Since the Java version had the best score in that benchmark, it was unsurprising to me that perf revealed that the CPU was idle a lot more for the D version than it was for Java.I dug deeper and, according to perf, the most time was being spent on
pthread_mutex_lock
, shortly followed by__pthread_mutex_unlock_usercnt
. The call graph points to, in the former case, to calls toFreeListAlloc.elementSize
, which in turn is called byAutoFreeListAllocator.free
.The Java version, in contrast, had
sys_epoll_ctl
at the top and a lower percentage for the top stop.I think I just found the reason for the performance discrepancy, at least on Linux. Then again I only measured the original benchmarks in Linux.
Atila
Eeek. That's definitely another reason why I'd always favor the explicit way to lock, like implemented by
vibe.core.concurrency
:some_shared_obj.lock().doSomething();
or
auto l = some_shared_obj.lock(); l.doSomething(); l.doSomethingElse();
That way, at least you see when you lock and don't implicitly lock on every single operation, but can aggregate multiple acesses.
Okay, for now I've removed all locks from the allocator hierarchy and instead employed a single lock for all of them (8f8c9f9). The further plan is to make the allocators thread-local as far as possible and as it makes sense at some point.
Maybe it's time for another performance tuning session for 0.7.21. That could generate some nice D/vibe.d publicity, if it comes out really fast.
I had to add an import std.typetuple
to private template iotaTuple(size_t i)
in vibe/utils/memory.d
to get it to compile with gdc. It's a mystery to me how and why dmd compiles it as it is now. Anyway, it's definitely faster (speed increase of 7.8%), but still 6% behind Java. Still, quite some progress from the original blog posts's results of Java being 30% faster!
After I measured the speed I ran perf again. Results:
2.80% pthread_mutex_unlock_usercnt
89.23% _D4core4sync5mutex5Mutex6unlockMFNeZv
75.39% _D2gc2gc2GC6mallocMFmkPmZPv
59.02% gc_qalloc
41.19% _d_arrayappendcTX
20.72% lev_unlock_mutex
40.30% event_base_loop
37.86% event_add
12.39% event_del
9.33% event_pending
6.59% lev_free
4.17% _D4vibe5utils6memory13LockAllocator5allocMFmZAv
2.43% pthread_mutex_lock
79.73% _D4core4sync5mutex5Mutex4lockMFNeZv
52.52% _D2gc2gc2GC6mallocMFmkPmZPv
39.88% lev_lock_mutex
7.56% _D2gc2gc2GC6extendMFPvmmZm
14.51% _D4vibe5utils6memory13LockAllocator4freeMFAvZv
5.76% _D4vibe5utils6memory13LockAllocator5allocMFmZAv
Still a bit of participation from vibe.d code, but now it's mostly the GC. Good thing it's a bank holiday today in Switzerland, it seems I have some performance tuning to do... :)
Atila