Am 17.12.2015 um 04:32 schrieb Nikolay Tolstokulakov:

I'd still wait a little. It looks like there might be more optimization
opportunities and I'll create a new pre-release version when those have
been investigated. But in any case, at least the latest alpha release
should be used, as that fixes the multi-core scaling issues.

I would like share my thoughts about vibed performance.

I use Linux, git master with latest multicore fixes and improvements. I think Vibed has bottleneck in libevent2 library now: Libevent2TCPConnection class. I am not expert in libevent library, but I am sure that Libevent2TCPConnection class currently uses expensive and inefficient call sequence. I wrote another implementation libevent2_tcp.d

It works only for small request like hello-world from WebFrameworkBenchmark/benchmarks/vibed, but it has 2.5 performance gap over current version. The main idea is simple: read all data from one libevent2 chunk at once and do not use bufferevent_read in read method. You can take a look to peek() and read() methods in my implementation. I could not find correct way advance reading to next libevent2 data chunk and integrate it this Vibed.

This is great to know. I actually experimented a little with an
implementation that directly works on select/epoll and it also was much
faster. So it seems like the bufferevent API of libevent is inefficient
and we should simply ditch it in favor of an own read buffer.

Also I suppose read method is problem it self. I do not think that it is important right now, but it has argument ubyte[] and it makes impossible to use zero-copy approach. I always have to copy data in this method. It may be problem for high-speed processing with zero-copy solution like PFQ, DPDK, or Netmap.

This is true regarding the current implementation, there is also a
discussion about adding a new read overload somewhere. But for the
HTTP request benchmark game with its ~20MB/s per thread it should indeed
not matter.

My test result for my version:

 wrk -t 4 -d 2s "http://localhost:8081/"
 Running 2s test @ http://localhost:8081/
 4 threads and 10 connections
 Thread Stats   Avg      Stdev     Max   +/- Stdev
 Latency     2.10ms    5.24ms  48.73ms   94.95%
 Req/Sec    56.81k    13.56k   78.04k    65.85%
 463299 requests in 2.10s, 78.32MB read
 Socket errors: connect 0, read 717, write 0, timeout 0
 Non-2xx or 3xx responses: 717
 Requests/sec: 220691.82
 Transfer/sec:     37.31MB

Please notice that my version has 717 errors even with small requests, and average is worst than 2ms

git master:

 wrk -t 4 -d 2s "http://localhost:8081/"
 Running 2s test @ http://localhost:8081/
 4 threads and 10 connections
 Thread Stats   Avg      Stdev     Max   +/- Stdev
 Latency   318.67us    1.60ms  24.78ms   97.33%
 Req/Sec    21.92k     2.33k   30.23k    73.49%
 180981 requests in 2.10s, 30.20MB read
 Requests/sec:  86188.69
 Transfer/sec:     14.38MB

What CPU do you have? I'd be interested in how this roughly translates
to the system I tested on for the previous results.