[LEDE-DEV] libubox, procd: init process hangs
Felix Fietkau
nbd at nbd.name
Wed May 18 05:03:52 PDT 2016
On 2016-05-18 14:00, Mats Karrman wrote:
>
>
> On 2016-05-18 13:01, Felix Fietkau wrote:
>> On 2016-05-18 11:38, Mats Karrman wrote:
>>>
>>> On 2016-05-17 17:31, Mats Karrman wrote:
>>>> On 2016-05-17 13:29, Felix Fietkau wrote:
>>>>> I just took a look at the code and uloop's processing of signals looked
>>>>> a bit racy to me. I've pushed a commit that makes it use signalfd if
>>>>> available. I also found that waitpid wasn't being retried on signal
>>>>> interrupt, so I added an extra check there. The changes are in libubox
>>>>> git, but not in OpenWrt/LEDE yet.
>>>>> Please test if this fixes your issue.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> - Felix
>>>> Tried that but no immediate success, but it might have provided
>>>> some additional clues. Now the boot hangs early on *every* boot
>>>> but after logging in I found something different in the ps list.
>>>> There is a Broadcom utility (smd) that is called from one of the
>>>> start scripts (S10environment). It's purpose is to set scheduling
>>>> priority and cpu affinity for some of the Broadcom proprietary
>>>> processes, The smd program handles fork rather ugly. The
>>>> parent only loops until it receives SIGCHLD and then exits without
>>>> any wait. With the modified libubox I get a zombie smd child and
>>>> sleeping smd parent and S11environment (no other zombie).
>>>>
>>>> Not sure exactly how this happened but I got to think about
>>>> something written in the wait man page:
>>>>
>>>> """
>>>> If a parent process terminates, then its "zombie" children (if any)
>>>> are adopted by init(8), which automatically performs a wait to
>>>> remove the zombies.
>>>> """
>>>>
>>>> Is this wait really (unconditionally) implemented in procd or could
>>>> that be what I accomplished with the "forced timeout" patch?
>>>>
>>>> I fixed the ugly fork and got the system to boot once.
>>>> Then tried the original libubox with the fixed smd program but
>>>> this was not enough to get things working (25 reboots to hang).
>>>>
>>>> Now I'm running reboot tests with your new libubox and fixed smd...
>>> More than 250 reboots without problem :)
>>>
>>> Clearly the smd program is broken, but still it doesn't feel good that it
>>> manages to hang the init process. Considering that timing is involved
>>> it's difficult to make any certain conclusions but it seems like having
>>> uloop epoll_wait to time out occasionally isn't such a bad idea?
>> I agree, that definitely needs fixing. What kernel are you using?
> It's the 3.4.11-rt19 from the Broadcom SDK v4.16, so very old...
>
> Now I also noticed, with your libubox fixes (and my fixed smd) I still get
> some zombies, even though the system seems to boot OK all the way
> (the corresponding services being defunct though).
> With my epoll_wait timeout fix on the original libubox, this does not
> happen.
Can you try backporting this to your kernel?
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=128dd1759d96ad36c379240f8b9463e8acfd37a1
- Felix
More information about the Lede-dev
mailing list