[LEDE-DEV] libubox, procd: init process hangs

Mats Karrman mats.dev.list at gmail.com
Wed May 18 02:38:15 PDT 2016



On 2016-05-17 17:31, Mats Karrman wrote:
>
> On 2016-05-17 13:29, Felix Fietkau wrote:
>> I just took a look at the code and uloop's processing of signals looked
>> a bit racy to me. I've pushed a commit that makes it use signalfd if
>> available. I also found that waitpid wasn't being retried on signal
>> interrupt, so I added an extra check there. The changes are in libubox
>> git, but not in OpenWrt/LEDE yet.
>> Please test if this fixes your issue.
>>
>> Thanks,
>>
>> - Felix
> Tried that but no immediate success, but it might have provided
> some additional clues. Now the boot hangs early on *every* boot
> but after logging in I found something different in the ps list.
> There is a Broadcom utility (smd) that is called from one of the
> start scripts (S10environment). It's purpose is to set scheduling
> priority and cpu affinity for some of the Broadcom proprietary
> processes, The smd program handles fork rather ugly. The
> parent only loops until it receives SIGCHLD and then exits without
> any wait. With the modified libubox I get a zombie smd child and
> sleeping smd parent and S11environment (no other zombie).
>
> Not sure exactly how this happened but I got to think about
> something written in the wait man page:
>
> """
> If  a parent process terminates, then its "zombie" children (if any)
> are adopted by init(8), which automatically performs a wait to
> remove the zombies.
> """
>
> Is this wait really (unconditionally) implemented in procd or could
> that be what I accomplished with the "forced timeout" patch?
>
> I fixed the ugly fork and got the system to boot once.
> Then tried the original libubox with the fixed smd program but
> this was not enough to get things working (25 reboots to hang).
>
> Now I'm running reboot tests with your new libubox and fixed smd...
More than 250 reboots without problem :)

Clearly the smd program is broken, but still it doesn't feel good that it
manages to hang the init process. Considering that timing is involved
it's difficult to make any certain conclusions but it seems like having
uloop epoll_wait to time out occasionally isn't such a bad idea?

// Mats




More information about the Lede-dev mailing list