[LEDE-DEV] dnsmasq sometimes fails to start on 17.01.0

Gui Iribarren gui at altermundi.net
Thu Mar 23 12:10:06 PDT 2017


Thanks both for the replies!

I continued yeterday further debugging this, i played with this
particular number in /e/i/dnsmasq
(line 815 in http://pastebin.com/FV09f2jG)

    procd_add_raw_trigger "interface.*" 2000 /etc/init.d/dnsmasq reload

this number, as the following link suggests
http://wiki.prplfoundation.org/wiki/Procd_reference#procd_add_raw_trigger.28event.2C_timeout.2C_.5Bscript.5D.29
is the number of milliseconds that procd will wait after the trigger (in
this case, anything related to an "interface" AFAIU) before executing
"dnsmasq reload"

i put 60000 (60 seconds) and now the reload only happens once (after ~60
seconds from the first "/e/i/dnsmasq boot") and dnsmasq starts without
problem

so it looks to me like a race condition, where two "interface.*" events
are happening one after the other, triggering two consecutive reloads,
the first reload doesn't finish its work before the second reload comes,
and the second reload kills the first reload, and suicides itself for
some reason.

setting a long raw_trigger timeout works around the problem because the
"interface.*" events happen all inside the 60 second window, and procd
runs "/e/i/dnsmasq reload" only once

i can imagine that few people came across this since dnsmasq normally
takes less than 2 seconds to start, and so the first reload is already
done starting when the second reload is triggered.

my added "sleeps" help reproduce the issue by artificially making
dnsmasq take always longer than 2 seconds, but without them the race
condition is still there, only it doesn't end badly every boot (only
some boots, and on some hardware)

is this enough info to point the right direction for someone to look
into ....procd? (not me, not enough proficient in C, sorry)

(running two long "/e/i/dnsmasq reload" in parallel manually via
console, the second one doesn't kill the first one, which makes me think
that procd is the murderer)

On 22/03/17 23:29, Yousong Zhou wrote:
> From the log, there are at least 4 runs of /etc/init.d/dnsmasq
> 
> This is "/e/c/dnsmasq boot" and returns immediately because the script skips
> this phase and the dnsmasq service is expected to be brought up by interface
> trigger events
> 
> 	root at coco:~# logread | grep dns
> 	Wed Mar 22 18:25:56 2017 user.notice root: guidebug dnsmasq init
> 	Wed Mar 22 18:25:56 2017 user.notice root: guidebug dnsmasq boot
> 	Wed Mar 22 18:25:56 2017 user.notice root: guidebug dnsmasq startsrv 1,
> 	Wed Mar 22 18:25:56 2017 user.notice root: guidebug dnsmasq srvtrg
> 
> This one I do not have much clue about
> 
> 	Wed Mar 22 18:25:57 2017 user.notice root: guidebug dnsmasq init
> 
> This one was brought up by the interface trigger event.  Your log stopped at
> "wait 18" without "wait 16" because the service_triggers call in the
> /e/c/dnsmasq has set the timeout of event handling to be 2000ms.  That's how
> the logger and sleep call were impeding the process
> 
> 	Wed Mar 22 18:26:06 2017 user.notice root: guidebug dnsmasq init
> 	Wed Mar 22 18:26:06 2017 user.notice root: guidebug dnsmasq reload
> 	Wed Mar 22 18:26:06 2017 user.notice root: guidebug dnsmasq startsrv ,
> 	Wed Mar 22 18:26:06 2017 user.notice root: guidebug dnsmasq realstart cfg02411c wait 20
> 	Wed Mar 22 18:26:08 2017 user.notice root: guidebug dnsmasq realstart cfg02411c wait 18
> 
> This one I also have no clue about.  It didn't even make it to the "realstart"
> part.  So checking further what happened in between was causing that should be
> helpful
> 
> 	Wed Mar 22 18:26:10 2017 user.notice root: guidebug dnsmasq init
> 	Wed Mar 22 18:26:10 2017 user.notice root: guidebug dnsmasq reload
> 	Wed Mar 22 18:26:10 2017 user.notice root: guidebug dnsmasq startsrv ,
> 



More information about the Lede-dev mailing list