<div dir="ltr"><div><div>Back with some findings.<br><br>Please check this: <a href="https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899" target="_blank">https://github.com/commodo/ubus/commit/c601505a6b33aa208e1a3492d3ade5ae2d853899</a><br>
<br></div>The current problem [high CPU usage] seems to have appeared after some modifications we made to the code which were related to a recent proposal patch I sent [ that got rejected ].<br></div>Specifically, this one : [PATCH] libubus: handle NULL ubus_context in all sync requests by creating a local instance<br>
<div><br></div><div>[Some context]<br></div><div>We modified our code in some parts to allocate a ubus_context object for certain sync calls, and free it right after we're done with them [the calls].<br></div><div>This was done in order to avoid them getting mixed with async calls on the same ubus_context, and avoid races [ in our logic ].<br>
</div><div><br></div><div>We then got this high CPU usage issue.<br></div><div><br><div><div>strace showed something like this [ this is just one cycle; strace was spewing this continuously ]:<br>
============================================================================<br>clock_gettime(CLOCK_MONOTONIC, {46921, 889670571}) = 0
<br>clock_gettime(CLOCK_MONOTONIC, {46921, 889680749}) = 0
<br>epoll_wait(4, {{EPOLLIN, {u32=31068896, u64=31068896}}}, 10, -1) = 1
<br>recvmsg(0, 0x7fff5e7ce638, 0) = -1 EAGAIN (Resource temporarily unavailable)<br></div><div>============================================================================<br><br></div><div>Maybe we were also a bit too aggressive with our ubus_context alloc + free approach. We were doing them for each call.<br>
</div><div>When we did some modifications to be a bit less aggressive, the high CPU usage was not reproduce-able [ for the case we used to reproduce it ].<br></div><div><br></div><div>Also, for all our main/static ubus_context objects we add a connection_lost callback of our own, which essentially just does a ubus_reconnect() call and some setup again.<br>
</div><div>We could try to use the ubus_auto_connect() call in the future. We implemented the connection_lost callback before this was implemented.<br></div><div><br></div><div>Seems that after the ubus_free() call the socket would get closed, which would also call our connection_lost callback, which would reconnect.<br>
</div><div>This seemed to work fine the first few sometimes [ for our case ].<br></div><div>It was sometime later [ in our reproducing case 1-2 minutes ] that this connection_lost cb + ubus_reconnect() call would cause strace would start spewing continuously.<br>
</div><div><br>
</div><div>If we would have left the ubus_default_connection_lost() callback, it would have called uloop_end() and then procd would restart our process, and all would be well again for a while.<br></div><div>But we did not want that.<br>
</div><div><br><div>[End Some context]<br></div><br></div><div>After trying out a few things, and tracing the code, it seemed that not closing the socket in ubus_reconnect() helped.<br></div><div>I will say directly that I am not understanding the libubus code 100% yet, and I am trying not to learn libubox yet, but I think that not closing the socket if ctx->sock.eof is true, prevents a racy condition where we would close some other context's socket.<br>
</div>Not sure about this, which is why I am asking some opinions.<br><br></div><div>Maybe Felix or someone can point us in a better direction with the investigation.<br></div><div></div><div>Still, it makes sense to not close an eof-ed socket, and for our current case, I do not see the process taking up all the CPU anymore.<br>
</div><div></div><div><div><br></div><div>Thanks<br></div><div>Alex<br></div></div>
</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Jul 29, 2014 at 4:41 PM, Alexandru Ardelean <span dir="ltr"><<a href="mailto:ardeleanalex@gmail.com" target="_blank">ardeleanalex@gmail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>I'll see what I can find.<br><br></div>At the time I created this patch I had it reproducing more often, but not often enough to trace a cause.<br>
<div class="gmail_extra">The patch seems to help, but [as you've said] the error seems like something else.<br>
</div><div class="gmail_extra"></div><div class="gmail_extra">Seems to be reproducing again, after a while, and in a certain configuration.<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">I'll check it out.<br>
</div><div class="gmail_extra">This high CPU usage thing got to annoy me enough that I gave it a bit more attention.<br></div><div class="gmail_extra"><br></div><div class="gmail_extra">Thanks<br>Alex<br></div><div><div class="h5">
<div class="gmail_extra">
<br><div class="gmail_quote">On Tue, Jul 29, 2014 at 4:30 PM, Felix Fietkau <span dir="ltr"><<a href="mailto:nbd@openwrt.org" target="_blank">nbd@openwrt.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div>On 2014-07-29 15:23, Alexandru Ardelean wrote:<br>
> Quick info follow-up: strace-ing on the PIDs (with high CPU usage)<br>
> shows a lot of EAGAIN errors from recvfrom() and a lot of recvfrom() +<br>
> poll() + clock_gettime_monotonic() calls.<br>
><br>
> Thinking about it, would it make more sense to add a short wait instead<br>
> [to reduce the throttle] and let it "infinitely" loop ? ["infinitely"<br>
> loop == loop until it does not error with EAGAIN]<br>
</div>I think it's much more likely that this is a bug, caused by errors on<br>
the ubus socket not being caught or handled properly.<br>
If that is the case, then both approaches are wrong, and we should just<br>
handle the error correctly.<br>
Can you figure out what triggers this behavior?<br>
<span><font color="#888888"><br>
- Felix<br>
</font></span></blockquote></div><br></div></div></div></div>
</blockquote></div><br></div>