Potential issue with libevent2 implementation and ManualEvents

Pages: 1 2

Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

punkUser

Posted Tue, 16 Apr 2013 21:52:53 GMT

So a while back we discussed implementing "broadcast"-like functionality between fiber's, each owning a TCPConnection and I was pointed at the broadcast example that uses Signal's (now called ManualEvents it seems) and rawYield(). While a bit finicky, using these with queues proved to accomplish my goals; namely being able to block a fiber on several different events (TCP read activity or pending sends implemented in a queue in the application).

However I've not got my application up and running on Linux and I'm having some crashing issues that seem like they might be related here. It's a bit hard to tell since the errors seem to happen when the GC decides to clean something up; it reports "corrupt or double freed" memory at that point, but I can't track it back to where such a thing happened. Valgrind is not particularly helpful either... it reports a bunch of irrelevant stuff from the GC's initial setup (apparently expected by the D folks and not a real issue), warns when fibers do various things (switching stacks) but otherwise doesn't give any indications of problems until the crash actually happens.

So on a whim I threw a GC.Collect in my main loop and indeed this seems to trigger the issue with more regularity and now I'm actually able to trigger something similar on Windows as well (I believe the windows GC tends to be lazier and thus I didn't see this issue during development). On windows it manifests as a libevent warning that says something like "[warn] eventdel: event has no eventbase set." or similar. After tracking a bit deeper into the vibe code, I found a potentially worrying warning in the ManualEvent destructor, namely:
"// FIXME: this is illegal (accessing GC memory)"

This seems like it might be related, although it's hard for me to be sure. Does this sound consistent with the evidence to you folks?

More importantly though, I'm left with what to do about it... I'd happily use the Win32 driver but I need this ultimately to run on Linux too and would prefer to test in both places with the same driver. (Aside: Win32 driver is still missing one thing that I need as well, namely the ability to get the remote address of a connection.) I briefly looked at Libev on Linux but even once I got that compiling it spit a bunch of warnings about unimplemented stuff in vibe at me so I got scared off :) Not sure the status or plans for that implementation.

Then I started thinking about how to avoid ManualEvent entirely... I can obviously do some sort of mutex and acquire/release on the TCPConnection itself for sending but that would block fibers that trigger sends to a client who is misbehaving, which is undesirable. I was trying to think whether there was perhaps a way to make the regular case work like this and then "detect" badness and kill the offending client, but it gets a bit messy.

Any suggestions on a way to work around this? I'm really just looking to get the application robust and working at this point ASAP (got some deadlines to hit) so non-ideal workarounds might still be reasonable, although typically such things tend to be less robust...

Thanks in advance!

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

Sönke Ludwig

Posted Wed, 17 Apr 2013 17:20:45 GMT in reply to punkUser

"// FIXME: this is illegal (accessing GC memory)"

This could indeed be a reason for the observed behavior. I've pushed a fix that uses a vibe.utils.hashmap.HashMap together with a manual memory allocator to fix this.

More importantly though, I'm left with what to do about it... I'd happily use the Win32 driver but I need this ultimately to run on Linux too and would prefer to test in both places with the same driver. (Aside: Win32 driver is still missing one thing that I need as well, namely the ability to get the remote address of a connection.) I briefly looked at Libev on Linux but even once I got that compiling it spit a bunch of warnings about unimplemented stuff in vibe at me so I got scared off :) Not sure the status or plans for that implementation.

Yeah, I want to get the libev driver to a complete and stable state at some point to hopefully get better performance than with the libevent driver, but I'm simply missing the time for that (the priority is very low, as it is basically only a performance improvement).

I'll implement the Win32TCPConnection.peerAddress property later today.

Then I started thinking about how to avoid ManualEvent entirely... I can obviously do some sort of mutex and acquire/release on the TCPConnection itself for sending but that would block fibers that trigger sends to a client who is misbehaving, which is undesirable. I was trying to think whether there was perhaps a way to make the regular case work like this and then "detect" badness and kill the offending client, but it gets a bit messy.

Another possibility would now be to use vibe.core.concurrency to send and receive messages and then start two tasks for each connection, one for reading and one for writing - I've just implemented read/write separation for TCP connections after a recent discussion with Vladimir Panteleev, which makes this possible. setMaxMailboxSize can be used to determine what happens if a certain connection stacks up too many messages. I also started a generic MessageHub class today, but I can't say when exactly I'm going to get to finish it.

But to just get it working quickly, hopefully the ManualEvent fix was the right one for this issue...

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

punkUser

Posted Wed, 17 Apr 2013 18:25:16 GMT in reply to Sönke Ludwig

This could indeed be a reason for the observed behavior. I've pushed a fix that uses a vibe.utils.hashmap.HashMap together with a manual memory allocator to fix this.

Cool. In the mean time I thought of a work-around that involved basically pulling the event I was using to trigger fibers to wake "up a level" so it is shared by all of the fibers in a room (i.e. "channel" or similar). This works great for broadcasts but of course needlessly wakes up fibers when the server only needs to send to a single client. Not currently a huge deal in my application (as those fibers fairly quickly just check their TCPConnections and send queues and go back to sleep), but not totally ideal.

The upside though is that it pretty much confirmed that was indeed the issue, as the crashes went completely away when I changed it.

Great to hear that you've put a fix in, as it was a rather difficult issue for me to track down (across two operating systems and GC implementations). Now I just need to decide if I should stick with the current solution that appears to be pretty robust or go back to the slightly more efficient per-fiber event one... hmm :)

Yeah, I want to get the libev driver to a complete and stable state at some point to hopefully get better performance than with the libevent driver, but I'm simply missing the time for that (the priority is very low, as it is basically only a performance improvement).

No rush from me, my application isn't particularly performance sensitive. I was just trying different drivers to try and work around potential issues :)

I'll implement the Win32TCPConnection.peerAddress property later today.

Great, I'll give that a try on Windows then as well. That said, the production home of my application is now on Linux, so while I do develop primarily on Windows, it's probably best for me to stick with libevent in both places to avoid as much dev/production divergence as possible :)

Another possibility would now be to use vibe.core.concurrency to send and receive messages and then start two tasks for each connection, one for reading and one for writing

That's an interesting possibility, and it sounds similar to the system that I had before I switched to vibe in the first place (i.e. some minimal event collection threads and signalling via mailboxes; i.e. one that was sitting on select, another for HTTP, etc). Might consider that in the future, but as I mentioned, I have no real need for true multithreading/parallelism currently, so introducing additional threads may just complicate things.

But to just get it working quickly, hopefully the ManualEvent fix was the right one for this issue...

Indeed, I'll give it a try and then make the call on whether to use the fixed version or my current workaround :)

Thanks for the quick response and fixes!

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

punkUser

Posted Wed, 01 May 2013 17:11:19 GMT in reply to punkUser

Hmm so I recently tried updating to .7.15 and ran into some issues that might be related to this fix, even in the case where I'm not deleting events (i.e. my workaround implementation).

Specifically, I'm seeing access violations and crashes inside "rawYield" and similar functions (and some of them seem related to HashMap, which I believe was the change that is made). Similar to before, if I force GC.collect() frequently it seems to be easy to reproduce, but I haven't had time to hunt more into what is going on.

I can maybe try to come up with a small-ish repro in the next week or two, but since I'm fairly busy in actually finishing this project I've had to revert to .7.14 in the mean time, which works fine if I avoid freeing manual events. If you have any ideas on what I can try in the mean time in .7.15, please let me know.

Thanks!

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

Sönke Ludwig

Posted Fri, 03 May 2013 08:24:31 GMT in reply to punkUser

On Wed, 01 May 2013 17:11:19 GMT, punkUser wrote:

Hmm so I recently tried updating to .7.15 and ran into some issues that might be related to this fix, even in the case where I'm not deleting events (i.e. my workaround implementation).

Specifically, I'm seeing access violations and crashes inside "rawYield" and similar functions (and some of them seem related to HashMap, which I believe was the change that is made). Similar to before, if I force GC.collect() frequently it seems to be easy to reproduce, but I haven't had time to hunt more into what is going on.

I can maybe try to come up with a small-ish repro in the next week or two, but since I'm fairly busy in actually finishing this project I've had to revert to .7.14 in the mean time, which works fine if I avoid freeing manual events. If you have any ideas on what I can try in the mean time in .7.15, please let me know.

Thanks!

A repro case (doesn't necessarily have to be reduced) would be great. But apart from that I'll keep it in mind to do some stress tests with explicit GC.collect() calls.

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

Sönke Ludwig

Posted Mon, 06 May 2013 18:54:49 GMT in reply to punkUser

On Wed, 01 May 2013 17:11:19 GMT, punkUser wrote:

Hmm so I recently tried updating to .7.15 and ran into some issues that might be related to this fix, even in the case where I'm not deleting events (i.e. my workaround implementation).

Specifically, I'm seeing access violations and crashes inside "rawYield" and similar functions (and some of them seem related to HashMap, which I believe was the change that is made). Similar to before, if I force GC.collect() frequently it seems to be easy to reproduce, but I haven't had time to hunt more into what is going on.

I can maybe try to come up with a small-ish repro in the next week or two, but since I'm fairly busy in actually finishing this project I've had to revert to .7.14 in the mean time, which works fine if I avoid freeing manual events. If you have any ideas on what I can try in the mean time in .7.15, please let me know.

Thanks!

Okay, I think I got it. HashMap missed to call GC.addRange() for its table. Since the table contains a bool[Task] map, this could cause access violations, as the map would be collected immediately by the first GC run.

https://github.com/rejectedsoftware/vibe.d/commit/97d87ccdf3c7fdd9aef29032e02c2e26fa60e1ef

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

punkUser

Posted Wed, 08 May 2013 16:47:10 GMT in reply to Sönke Ludwig

Okay, I think I got it. HashMap missed to call GC.addRange() for its table. Since the table contains a bool[Task] map, this could cause access violations, as the map would be collected immediately by the first GC run.

https://github.com/rejectedsoftware/vibe.d/commit/97d87ccdf3c7fdd9aef29032e02c2e26fa60e1ef

Ah, sweet, I will definitely check that out. That sounds consistent with what I was seeing (i.e. dies immediately even without any events being killed). Sorry for the lack of repro in the mean time, the issue was mostly that I didn't have a good way to emulate the client side (i.e. the game) of it in a test case, not so much that it was hard to pare down the server part.

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

punkUser

Posted Sun, 12 May 2013 03:39:23 GMT in reply to punkUser

Hmm just updated to the latest snapshot and now ".acquire/release" won't compile on my ManualEvents... any ideas? Is there now a better way to implement the "wait on multiple objects" for broadcast functionality I'm currently doing with acquire/rawYield?

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

Sönke Ludwig

Posted Sun, 12 May 2013 20:58:34 GMT in reply to punkUser

On Sun, 12 May 2013 03:39:23 GMT, punkUser wrote:

Hmm just updated to the latest snapshot and now ".acquire/release" won't compile on my ManualEvents... any ideas? Is there now a better way to implement the "wait on multiple objects" for broadcast functionality I'm currently doing with acquire/rawYield?

I made this change that removes the explicit ownership handling after some conversation with Vladimir Panteleev. A project that I'm working on finally convinced me that the explicit ownership model can get really involved up to a point where the advantages (mostly better avoidance of concurrecy bugs and the possibility to do low level trickery) start to lose all of their attraction. It also is consistent with how ordinary synchronous I/O works. Just "wait for multiple objects" now needs a different solution than using rawYield (which is a good thing, but this is still unresolved).

But together with this change it has become possible to read and write in parallel from different tasks. So a broadcasting facility could now be implemented similar to this (untested), using the new message passing facilities:

import vibe.core.concurrency;
import vibe.core.core;
import vibe.core.net;

struct Message { /*...*/ }
Message readMessage(InputStream str) { /*...*/ }

struct ConnInfo { Task readTask, writeTask; }
ConnInfo[] conns;

void onConnection(TCPConnection conn)
{
    ConnInfo ci;
    ci.readTask = runTask({
        while (!conn.empty) {
            auto msg = readMessage(conn);
            foreach (c; conns)
                if (c.readTask !is Task.getThis())
                    c.writeTask.send(msg);
        }
        ci.writeTask.interrupt(); // TODO: nicer shutdown
    });
    ci.writeTask = runTask({
        while (conn.connected) {
            receive((Message msg) {
                msg.writeOut(conn);
            });
        }
    });
    conns ~= ci;
}

My apologies for the breakage, it was unfortunately not avoidable in this case - but for the better, I think.

Re: Potential issue with libevent2 implementation and ManualEvents

Permalink: HTTP NNTP

punkUser

Posted Tue, 14 May 2013 00:23:29 GMT in reply to Sönke Ludwig

But together with this change it has become possible to read and write in parallel from different tasks. So a broadcasting facility could now be implemented similar to this (untested), using the new message passing facilities:

Hmm ok, well that was what I was originally doing anyways. I assume this means it's now safe to write to a TcpConnection from a fiber that doesn't own it?

I implemented something similar (didn't bother with the writer task to start with and just write directly to the other TcpConnections where necessary); unfortunately I'm having the issue that it appears that somehow the socket is now getting into a weird state where it doesn't detect disconnects while waiting in blocking functions like read(). Strangely, if I switch it to do a busy wait loop on dataavailablefor_read() around the read() it never even goes inside. But yet stream.connected remains true even when the client application has long since disconnected and even quit... this is all with the libevent2 backend.

Not sure if this is something you may have seen, but any suggestions on how to debug would be helpful.

My apologies for the breakage, it was unfortunately not avoidable in this case - but for the better, I think.

Meh, it's ok. I knew the rawYield thing was a temporary solution at best. I just need an alternative now...

Pages: 1 2