[LEDE-DEV] libubox, procd: init process hangs

Mats Karrman mats.dev.list at gmail.com
Thu May 19 00:16:09 PDT 2016



On 2016-05-18 15:09, Mats Karrman wrote:
>
>
> On 2016-05-18 14:03, Felix Fietkau wrote:
>> On 2016-05-18 14:00, Mats Karrman wrote:
>>>
>>> On 2016-05-18 13:01, Felix Fietkau wrote:
>>>> On 2016-05-18 11:38, Mats Karrman wrote:
>>>>> On 2016-05-17 17:31, Mats Karrman wrote:
>>>>>> On 2016-05-17 13:29, Felix Fietkau wrote:
>>>>>>> I just took a look at the code and uloop's processing of signals 
>>>>>>> looked
>>>>>>> a bit racy to me. I've pushed a commit that makes it use 
>>>>>>> signalfd if
>>>>>>> available. I also found that waitpid wasn't being retried on signal
>>>>>>> interrupt, so I added an extra check there. The changes are in 
>>>>>>> libubox
>>>>>>> git, but not in OpenWrt/LEDE yet.
>>>>>>> Please test if this fixes your issue.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> - Felix
>>>>>> Tried that but no immediate success, but it might have provided
>>>>>> some additional clues. Now the boot hangs early on *every* boot
>>>>>> but after logging in I found something different in the ps list.
>>>>>> There is a Broadcom utility (smd) that is called from one of the
>>>>>> start scripts (S10environment). It's purpose is to set scheduling
>>>>>> priority and cpu affinity for some of the Broadcom proprietary
>>>>>> processes, The smd program handles fork rather ugly. The
>>>>>> parent only loops until it receives SIGCHLD and then exits without
>>>>>> any wait. With the modified libubox I get a zombie smd child and
>>>>>> sleeping smd parent and S11environment (no other zombie).
>>>>>>
>>>>>> Not sure exactly how this happened but I got to think about
>>>>>> something written in the wait man page:
>>>>>>
>>>>>> """
>>>>>> If  a parent process terminates, then its "zombie" children (if any)
>>>>>> are adopted by init(8), which automatically performs a wait to
>>>>>> remove the zombies.
>>>>>> """
>>>>>>
>>>>>> Is this wait really (unconditionally) implemented in procd or could
>>>>>> that be what I accomplished with the "forced timeout" patch?
>>>>>>
>>>>>> I fixed the ugly fork and got the system to boot once.
>>>>>> Then tried the original libubox with the fixed smd program but
>>>>>> this was not enough to get things working (25 reboots to hang).
>>>>>>
>>>>>> Now I'm running reboot tests with your new libubox and fixed smd...
>>>>> More than 250 reboots without problem :)
>>>>>
>>>>> Clearly the smd program is broken, but still it doesn't feel good 
>>>>> that it
>>>>> manages to hang the init process. Considering that timing is involved
>>>>> it's difficult to make any certain conclusions but it seems like 
>>>>> having
>>>>> uloop epoll_wait to time out occasionally isn't such a bad idea?
>>>> I agree, that definitely needs fixing. What kernel are you using?
>>> It's the 3.4.11-rt19 from the Broadcom SDK v4.16, so very old...
>>>
>>> Now I also noticed, with your libubox fixes (and my fixed smd) I 
>>> still get
>>> some zombies, even though the system seems to boot OK all the way
>>> (the corresponding services being defunct though).
>>> With my epoll_wait timeout fix on the original libubox, this does not
>>> happen.
>> Can you try backporting this to your kernel?
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=128dd1759d96ad36c379240f8b9463e8acfd37a1 
>>
>>
>> - Felix
> OK, did that.
>
> First, I tested original libubox without my epoll_wait timeout fix 
> (but fixed smd),
> init hung after 7 reboots. Same state as before.
>
> Second, I tested your libubox (with fixed smd), seems to be no change, 
> i.e.
> some processes/scripts ends up as zombies.
>
> Third, added my epoll_wait timeout fix to your libubox, made no 
> difference..

Also noted that other stuff is not working properly with your fixes in 
libubox,
e.g. system upgrade hangs with a zombie sh and after scp copying files
to the box I get a zombie dropbear. Child reaping in general seems to be
broken.

Backed down to latest libubox before your fixes and the zombies/problems
disappear so it has no relation to the kernel patch or smd (original 
intit hang
problem still there of course).

// Mats



More information about the Lede-dev mailing list