<div dir="ltr"><div>Right.<br><br>So, this email will start with an apology for all the noise, and false alarm.<br><br></div>Having said that, I do owe some explanation.<br><div><br>[ I also found it strange that leaking that socket helped. ]<br>


</div><div></div><div>That socket, is valid though, since it was returned by usock() in ubus_reconnect().<br><br></div><div>What really happened [and threw me off] was that I was calling ubus_complete_deferred_request()  on an un-deferred request.<br>


This happened after adding libcares to our module. And ares_query() has this annoying habit of calling our callback during it's call, if it detects/has any errors during the call.<br></div><div>This is annoying, because it upsets the whole async-ness of things, where we would not expect to have our callback called, before exiting the ares_query() exits.<br>


</div><div><br></div><div>So, we would do:<br></div><div>1) we get a request ubus request<br>

</div><div>2) start async call setup (link FDs to uloop, etc)<br></div><div>3a) if all went well up until now, defer called ubus request; <br></div><div>3b) if we got an error during setup, do not defer ubus request and reply over ubus<br>


</div><div>4) if 3a) then our async callback finishes and completes the deferred request<br></div><div><br></div><div>Normally we would not expect step 4) to happen before step 3), but it's what was happening.<br></div>


<div>

The ubus_context was fine, but during ubus_complete_deferred_request() we would send an invalid request.<br></div><div><br></div><div></div><div>One idea would be to allow the ubus_complete_deferred_request() to execute, only if the deferred flag is set [since it only gets set by the ubus_defer_request() call ).<br>


</div><div>Something like this: <a href="https://github.com/commodo/ubus/commit/9e4a1d76859918ced3bd7f8d5d6b4e8c5aaa7482">https://github.com/commodo/ubus/commit/9e4a1d76859918ced3bd7f8d5d6b4e8c5aaa7482</a><br></div><div>

But that only hides/avoids the problem.<br>

</div><div><br></div><div>Our fix, was to add a uloop_timeout_set( 0 ) in the callback which we give to ares_query() and that would call our real callback, and the order of async things would be restored for all cases (including premature errors).<br>


</div><div><br></div><div>That's all from me.<br></div><div>Apologies again.<br></div><div><br></div><div>Thanks<br>Alex<br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Aug 3, 2014 at 3:15 PM, Felix Fietkau <span dir="ltr"><<a href="mailto:nbd@openwrt.org" target="_blank">nbd@openwrt.org</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">On 2014-08-01 18:01, Alexandru Ardelean wrote:<br>

> Back with some findings.<br>

><br>

> Please check this:<br>

> <a href="https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899" target="_blank">https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899</a><br>

</div>That looks very wrong to me. If that triggers, you're leaking the fd.<br>

<div class=""><br>

> The current problem [high CPU usage] seems to have appeared after some<br>

> modifications we made to the code which were related to a recent<br>

> proposal patch I sent [ that got rejected ].<br>

> Specifically, this one : [PATCH] libubus: handle NULL ubus_context in<br>

> all sync requests by creating a local instance<br>

><br>

> [Some context]<br>

> We modified our code in some parts to allocate a ubus_context object for<br>

> certain sync calls, and free it right after we're done with them [the<br>

> calls].<br>

> This was done in order to avoid them getting mixed with async calls on<br>

> the same ubus_context, and avoid races [ in our logic ].<br>

><br>

> We then got this high CPU usage issue.<br>

><br>

> strace showed something like this [ this is just one cycle; strace was<br>

> spewing this continuously ]:<br>

> ============================================================================<br>

> clock_gettime(CLOCK_MONOTONIC, {46921, 889670571}) = 0<br>

> clock_gettime(CLOCK_MONOTONIC, {46921, 889680749}) = 0<br>

> epoll_wait(4, {{EPOLLIN, {u32=31068896, u64=31068896}}}, 10, -1) = 1<br>

> recvmsg(0, 0x7fff5e7ce638, 0)           = -1 EAGAIN (Resource<br>

> temporarily unavailable)<br>

</div>That looks odd, recvmsg is polling fd 0. Did you close stdin?<br>

<div class=""><br>

> ============================================================================<br>

><br>

> Maybe we were also a bit too aggressive with our ubus_context alloc +<br>

> free approach. We were doing them for each call.<br>

> When we did some modifications to be a bit less aggressive, the high CPU<br>

> usage was not reproduce-able [ for the case we used to reproduce it ].<br>

><br>

> Also, for all our main/static ubus_context objects we add a<br>

> connection_lost callback of our own, which essentially just does a<br>

> ubus_reconnect() call and some setup again.<br>

> We could try to use the ubus_auto_connect() call in the future. We<br>

> implemented the connection_lost callback before this was implemented.<br>

><br>

> Seems that after the ubus_free() call the socket would get closed, which<br>

> would also call our connection_lost callback, which would reconnect.<br>

> This seemed to work fine the first few sometimes [ for our case ].<br>

> It was sometime later [ in our reproducing case 1-2 minutes ] that this<br>

> connection_lost cb + ubus_reconnect() call would cause strace would<br>

> start spewing continuously.<br>

><br>

> If we would have left the ubus_default_connection_lost() callback, it<br>

> would have called uloop_end() and then procd would restart our process,<br>

> and all would be well again for a while.<br>

> But we did not want that.<br>

</div>Are you sure that this is a bug in libubus? If this is indeed the call<br>

from within libubus, then it appears that something may have corrupted<br>

the ubus_context. I think you need to take another look at the code.<br>

<div class=""><br>

> [End Some context]<br>

><br>

> After trying out a few things, and tracing the code, it seemed that not<br>

> closing the socket in ubus_reconnect() helped.<br>

> I will say directly that I am not understanding the libubus code 100%<br>

> yet, and I am trying not to learn libubox yet, but I think that not<br>

> closing the socket if ctx->sock.eof is true, prevents a racy condition<br>

> where we would close some other context's socket.<br>

> Not sure about this, which is why I am asking some opinions.<br>

><br>

> Maybe Felix or someone can point us in a better direction with the<br>

> investigation.<br>

> Still, it makes sense to not close an eof-ed socket, and for our current<br>

> case, I do not see the process taking up all the CPU anymore.<br>

</div>I completely disagree. Why would it make sense to leak the fd if eof is<br>

set? The fact that it somehow prevents your 100% CPU loop could be<br>

something unrelated.<br>

<span class="HOEnZb"><font color="#888888"><br>

- Felix<br>

</font></span></blockquote></div><br></div>