[RFC PATCH 1/1] sched/deadline: Fix RT task potential starvation when expiry time passed
Juri Lelli
juri.lelli at redhat.com
Fri Jun 20 08:22:17 PDT 2025
On 20/06/25 11:00, Kuyo Chang wrote:
...
> "DL replenish lagged too much" means the fair_server took much longer
> than expected to use up its running time,
> so the deadline fell way behind the clock (which is also why
> start_dl_timer() failed).
> In this situation, just replenishing one dl_period isn’t enough to
> catch up.
>
> A corner case is when there are too many IRQs or IPIs in the system.
> In this case, runtime gets consumed very slowly, and the fair_server
> keep running without being throttled.
> Even the runtime is exhausted finally, the fair_server would be
> restarted immediately.
> In the end, IRQs, IPIs, and fair tasks can take over the whole system,
> no chance for RT tasks to run.
Thanks for the additional explanation.
The way I understand it now is the following (of course please correct
me if I am still not getting it :)
- a dl_server is actively servicing NORMAL tasks, but suffers lot of IRQ
load and cannot make much progress
- it does anyway make progress, but it reaches update_curr_dl_se at throttle
only when its current deadline is past rq_clock
- dl_runtime_exceeded() branch is entered, but start_dl_timer() fails as
the computed act is still in the past
- enqueue_dl_entity(REPLENISH) call replenish_dl_entity() which tries to
add runtime and advance the deadline, but time moved on so far that
deadline is still behind rq_clock() and so "DL replenish ..." is
printed
- replenish_dl_new_period() updates runtime and deadline from current
clock and the dl-server is put back to run (so it continues to run
over/starve FIFO tasks)
It looks like your proposed fix might work in this particular corner
case, but I am not 100% comfortable with not trying to replenish
properly (catch up with runtime) at all. I wonder if we might then start
missing some other corner case. Maybe we could try to catch this
particular corner case before even attempting to start the dl_timer,
since we know it will fail, and do something at that point?
Thanks,
Juri
More information about the linux-arm-kernel
mailing list