Re: vibe.d performance and synchronized classes

Posted Thu, 29 May 2014 07:38:35 GMT in reply to Sönke Ludwig

On Wed, 28 May 2014 06:54:39 GMT, Sönke Ludwig wrote:

On Tue, 27 May 2014 20:48:12 GMT, Atila Neves wrote:

You may remember that last year I wrote an MQTT broker using vibe.d in response to a challenge from a work colleague about his implementation in Go, and that ended up being a fight between implementations in D, Go, C, C++, Java and Erlang. The D/vibe.d combo won the throughput benchmark by a mile but was middle-of-the-road in latency. That always got to me and I never did know why.

Last week at DConf I learned about the perf tool. It seemed pretty cool but I had to use it on something and I ended up looking at the Java and D implementations in terms of latency. Since the Java version had the best score in that benchmark, it was unsurprising to me that perf revealed that the CPU was idle a lot more for the D version than it was for Java.

I dug deeper and, according to perf, the most time was being spent on pthread_mutex_lock, shortly followed by __pthread_mutex_unlock_usercnt. The call graph points to, in the former case, to calls to FreeListAlloc.elementSize, which in turn is called by AutoFreeListAllocator.free.

The Java version, in contrast, had sys_epoll_ctl at the top and a lower percentage for the top stop.

I think I just found the reason for the performance discrepancy, at least on Linux. Then again I only measured the original benchmarks in Linux.

Atila

Eeek. That's definitely another reason why I'd always favor the explicit way to lock, like implemented by vibe.core.concurrency:
some_shared_obj.lock().doSomething();
or
auto l = some_shared_obj.lock();
l.doSomething();
l.doSomethingElse();
That way, at least you see when you lock and don't implicitly lock on every single operation, but can aggregate multiple acesses.

Okay, for now I've removed all locks from the allocator hierarchy and instead employed a single lock for all of them (8f8c9f9). The further plan is to make the allocators thread-local as far as possible and as it makes sense at some point.

Maybe it's time for another performance tuning session for 0.7.21. That could generate some nice D/vibe.d publicity, if it comes out really fast.

I had to add an import std.typetuple to private template iotaTuple(size_t i) in vibe/utils/memory.d to get it to compile with gdc. It's a mystery to me how and why dmd compiles it as it is now. Anyway, it's definitely faster (speed increase of 7.8%), but still 6% behind Java. Still, quite some progress from the original blog posts's results of Java being 30% faster!

After I measured the speed I ran perf again. Results:

2.80% pthread_mutex_unlock_usercnt
      89.23% _D4core4sync5mutex5Mutex6unlockMFNeZv
             75.39% _D2gc2gc2GC6mallocMFmkPmZPv
                    59.02% gc_qalloc
                           41.19% _d_arrayappendcTX
                    20.72% lev_unlock_mutex
                           40.30% event_base_loop
                           37.86% event_add
                           12.39% event_del
                            9.33% event_pending
      6.59% lev_free
      4.17% _D4vibe5utils6memory13LockAllocator5allocMFmZAv
  
2.43% pthread_mutex_lock
      79.73% _D4core4sync5mutex5Mutex4lockMFNeZv
             52.52% _D2gc2gc2GC6mallocMFmkPmZPv
             39.88% lev_lock_mutex
             7.56% _D2gc2gc2GC6extendMFPvmmZm
      14.51% _D4vibe5utils6memory13LockAllocator4freeMFAvZv
       5.76% _D4vibe5utils6memory13LockAllocator5allocMFmZAv

Still a bit of participation from vibe.d code, but now it's mostly the GC. Good thing it's a bank holiday today in Switzerland, it seems I have some performance tuning to do... :)

Atila