[LEDE-DEV] libubox, procd: init process hangs

Mats Karrman mats.dev.list at gmail.com
Wed May 18 05:00:11 PDT 2016



On 2016-05-18 13:01, Felix Fietkau wrote:
> On 2016-05-18 11:38, Mats Karrman wrote:
>>
>> On 2016-05-17 17:31, Mats Karrman wrote:
>>> On 2016-05-17 13:29, Felix Fietkau wrote:
>>>> I just took a look at the code and uloop's processing of signals looked
>>>> a bit racy to me. I've pushed a commit that makes it use signalfd if
>>>> available. I also found that waitpid wasn't being retried on signal
>>>> interrupt, so I added an extra check there. The changes are in libubox
>>>> git, but not in OpenWrt/LEDE yet.
>>>> Please test if this fixes your issue.
>>>>
>>>> Thanks,
>>>>
>>>> - Felix
>>> Tried that but no immediate success, but it might have provided
>>> some additional clues. Now the boot hangs early on *every* boot
>>> but after logging in I found something different in the ps list.
>>> There is a Broadcom utility (smd) that is called from one of the
>>> start scripts (S10environment). It's purpose is to set scheduling
>>> priority and cpu affinity for some of the Broadcom proprietary
>>> processes, The smd program handles fork rather ugly. The
>>> parent only loops until it receives SIGCHLD and then exits without
>>> any wait. With the modified libubox I get a zombie smd child and
>>> sleeping smd parent and S11environment (no other zombie).
>>>
>>> Not sure exactly how this happened but I got to think about
>>> something written in the wait man page:
>>>
>>> """
>>> If  a parent process terminates, then its "zombie" children (if any)
>>> are adopted by init(8), which automatically performs a wait to
>>> remove the zombies.
>>> """
>>>
>>> Is this wait really (unconditionally) implemented in procd or could
>>> that be what I accomplished with the "forced timeout" patch?
>>>
>>> I fixed the ugly fork and got the system to boot once.
>>> Then tried the original libubox with the fixed smd program but
>>> this was not enough to get things working (25 reboots to hang).
>>>
>>> Now I'm running reboot tests with your new libubox and fixed smd...
>> More than 250 reboots without problem :)
>>
>> Clearly the smd program is broken, but still it doesn't feel good that it
>> manages to hang the init process. Considering that timing is involved
>> it's difficult to make any certain conclusions but it seems like having
>> uloop epoll_wait to time out occasionally isn't such a bad idea?
> I agree, that definitely needs fixing. What kernel are you using?
It's the 3.4.11-rt19 from the Broadcom SDK v4.16, so very old...

Now I also noticed, with your libubox fixes (and my fixed smd) I still get
some zombies, even though the system seems to boot OK all the way
(the corresponding services being defunct though).
With my epoll_wait timeout fix on the original libubox, this does not 
happen.

BR // Mats



More information about the Lede-dev mailing list