<div dir="ltr"><div>Right.<br><br>So, this email will start with an apology for all the noise, and false alarm.<br><br></div>Having said that, I do owe some explanation.<br><div><br>[ I also found it strange that leaking that socket helped. ]<br>
</div><div></div><div>That socket, is valid though, since it was returned by usock() in ubus_reconnect().<br><br></div><div>What really happened [and threw me off] was that I was calling ubus_complete_deferred_request() on an un-deferred request.<br>
This happened after adding libcares to our module. And ares_query() has this annoying habit of calling our callback during it's call, if it detects/has any errors during the call.<br></div><div>This is annoying, because it upsets the whole async-ness of things, where we would not expect to have our callback called, before exiting the ares_query() exits.<br>
</div><div><br></div><div>So, we would do:<br></div><div>1) we get a request ubus request<br>
</div><div>2) start async call setup (link FDs to uloop, etc)<br></div><div>3a) if all went well up until now, defer called ubus request; <br></div><div>3b) if we got an error during setup, do not defer ubus request and reply over ubus<br>
</div><div>4) if 3a) then our async callback finishes and completes the deferred request<br></div><div><br></div><div>Normally we would not expect step 4) to happen before step 3), but it's what was happening.<br></div>
<div>
The ubus_context was fine, but during ubus_complete_deferred_request() we would send an invalid request.<br></div><div><br></div><div></div><div>One idea would be to allow the ubus_complete_deferred_request() to execute, only if the deferred flag is set [since it only gets set by the ubus_defer_request() call ).<br>
</div><div>Something like this: <a href="https://github.com/commodo/ubus/commit/9e4a1d76859918ced3bd7f8d5d6b4e8c5aaa7482">https://github.com/commodo/ubus/commit/9e4a1d76859918ced3bd7f8d5d6b4e8c5aaa7482</a><br></div><div>
But that only hides/avoids the problem.<br>
</div><div><br></div><div>Our fix, was to add a uloop_timeout_set( 0 ) in the callback which we give to ares_query() and that would call our real callback, and the order of async things would be restored for all cases (including premature errors).<br>
</div><div><br></div><div>That's all from me.<br></div><div>Apologies again.<br></div><div><br></div><div>Thanks<br>Alex<br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Aug 3, 2014 at 3:15 PM, Felix Fietkau <span dir="ltr"><<a href="mailto:nbd@openwrt.org" target="_blank">nbd@openwrt.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On 2014-08-01 18:01, Alexandru Ardelean wrote:<br>
> Back with some findings.<br>
><br>
> Please check this:<br>
> <a href="https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899" target="_blank">https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899</a><br>
</div>That looks very wrong to me. If that triggers, you're leaking the fd.<br>
<div class=""><br>
> The current problem [high CPU usage] seems to have appeared after some<br>
> modifications we made to the code which were related to a recent<br>
> proposal patch I sent [ that got rejected ].<br>
> Specifically, this one : [PATCH] libubus: handle NULL ubus_context in<br>
> all sync requests by creating a local instance<br>
><br>
> [Some context]<br>
> We modified our code in some parts to allocate a ubus_context object for<br>
> certain sync calls, and free it right after we're done with them [the<br>
> calls].<br>
> This was done in order to avoid them getting mixed with async calls on<br>
> the same ubus_context, and avoid races [ in our logic ].<br>
><br>
> We then got this high CPU usage issue.<br>
><br>
> strace showed something like this [ this is just one cycle; strace was<br>
> spewing this continuously ]:<br>
> ============================================================================<br>
> clock_gettime(CLOCK_MONOTONIC, {46921, 889670571}) = 0<br>
> clock_gettime(CLOCK_MONOTONIC, {46921, 889680749}) = 0<br>
> epoll_wait(4, {{EPOLLIN, {u32=31068896, u64=31068896}}}, 10, -1) = 1<br>
> recvmsg(0, 0x7fff5e7ce638, 0) = -1 EAGAIN (Resource<br>
> temporarily unavailable)<br>
</div>That looks odd, recvmsg is polling fd 0. Did you close stdin?<br>
<div class=""><br>
> ============================================================================<br>
><br>
> Maybe we were also a bit too aggressive with our ubus_context alloc +<br>
> free approach. We were doing them for each call.<br>
> When we did some modifications to be a bit less aggressive, the high CPU<br>
> usage was not reproduce-able [ for the case we used to reproduce it ].<br>
><br>
> Also, for all our main/static ubus_context objects we add a<br>
> connection_lost callback of our own, which essentially just does a<br>
> ubus_reconnect() call and some setup again.<br>
> We could try to use the ubus_auto_connect() call in the future. We<br>
> implemented the connection_lost callback before this was implemented.<br>
><br>
> Seems that after the ubus_free() call the socket would get closed, which<br>
> would also call our connection_lost callback, which would reconnect.<br>
> This seemed to work fine the first few sometimes [ for our case ].<br>
> It was sometime later [ in our reproducing case 1-2 minutes ] that this<br>
> connection_lost cb + ubus_reconnect() call would cause strace would<br>
> start spewing continuously.<br>
><br>
> If we would have left the ubus_default_connection_lost() callback, it<br>
> would have called uloop_end() and then procd would restart our process,<br>
> and all would be well again for a while.<br>
> But we did not want that.<br>
</div>Are you sure that this is a bug in libubus? If this is indeed the call<br>
from within libubus, then it appears that something may have corrupted<br>
the ubus_context. I think you need to take another look at the code.<br>
<div class=""><br>
> [End Some context]<br>
><br>
> After trying out a few things, and tracing the code, it seemed that not<br>
> closing the socket in ubus_reconnect() helped.<br>
> I will say directly that I am not understanding the libubus code 100%<br>
> yet, and I am trying not to learn libubox yet, but I think that not<br>
> closing the socket if ctx->sock.eof is true, prevents a racy condition<br>
> where we would close some other context's socket.<br>
> Not sure about this, which is why I am asking some opinions.<br>
><br>
> Maybe Felix or someone can point us in a better direction with the<br>
> investigation.<br>
> Still, it makes sense to not close an eof-ed socket, and for our current<br>
> case, I do not see the process taking up all the CPU anymore.<br>
</div>I completely disagree. Why would it make sense to leak the fd if eof is<br>
set? The fact that it somehow prevents your 100% CPU loop could be<br>
something unrelated.<br>
<span class="HOEnZb"><font color="#888888"><br>
- Felix<br>
</font></span></blockquote></div><br></div>