[PATCH v4] ufs: core: wlun send SSU timeout recovery

Peter Wang (王信友) peter.wang at mediatek.com
Tue Sep 26 03:23:47 PDT 2023


On Mon, 2023-09-25 at 15:07 -0700, Bart Van Assche wrote:
>  	 
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>  On 9/25/23 01:51, Peter Wang (王信友) wrote:
> > On Fri, 2023-09-22 at 09:06 -0700, Bart Van Assche wrote:
> >> On 9/22/23 02:09, peter.wang at mediatek.com wrote:
> >> > When runtime pm send SSU times out, the SCSI core invokes
> >> > eh_host_reset_handler, which hooks function
> ufshcd_eh_host_reset_handler
> >> > schedule eh_work and stuck at wait flush_work(&hba->eh_work).
> >> > However, ufshcd_err_handler hangs in wait rpm resume.
> >> > Do link recovery only in this case.
> >> > Below is IO hang stack dump in kernel-6.1
> >>
> >> What does kernel-6.1 mean? Has commit 7029e2151a7c ("scsi: ufs:
> Fix a
> >> deadlock between PM and the SCSI error handler") been backported
> to
> >> that kernel?
> > 
> > Yes, our kernel-6.1 have backport 7029e2151a7c ("scsi: ufs: Fix a
> > deadlock between PM and the SCSI error handler")
> 
> Hi Peter,
> 
> Thanks for having confirmed this.
> 
> Please use text mode instead of HTML mode when replying. As one can
> see
> here, your reply did not make it to the linux-scsi mailing list:
> 
https://lore.kernel.org/linux-scsi/20230922090925.16339-1-peter.wang@mediatek.com/
> 
> > And this IO hang issue still happen.
> > 
> > This issue is easy trigger if we drop the SSU response to let SSU
> timeout.
> > 
> > 
> > 
> >>
> >> > diff --git a/drivers/ufs/core/ufshcd.c
> b/drivers/ufs/core/ufshcd.c
> >> > index c2df07545f96..7608d75bb4fe 100644
> >> > --- a/drivers/ufs/core/ufshcd.c
> >> > +++ b/drivers/ufs/core/ufshcd.c
> >> > @@ -7713,9 +7713,29 @@ static int
> ufshcd_eh_host_reset_handler(struct scsi_cmnd *cmd)
> >> >   int err = SUCCESS;
> >> >   unsigned long flags;
> >> >   struct ufs_hba *hba;
> >> > +struct device *dev;
> >> >   
> >> >   hba = shost_priv(cmd->device->host);
> >> >   
> >> > +/*
> >> > + * If runtime pm send SSU and got timeout, scsi_error_handler
> >> > + * stuck at this function to wait flush_work(&hba->eh_work).
> >> > + * And ufshcd_err_handler(eh_work) stuck at wait runtime pm
> active.
> >> > + * Do ufshcd_link_recovery instead shedule eh_work can prevent
> >> > + * dead lock happen.
> >> > + */
> >> > +dev = &hba->ufs_device_wlun->sdev_gendev;
> >> > +if ((dev->power.runtime_status == RPM_RESUMING) ||
> >> > +(dev->power.runtime_status == RPM_SUSPENDING)) {
> >> > +err = ufshcd_link_recovery(hba);
> >> > +if (err) {
> >> > +dev_err(hba->dev, "WL Device PM: status:%d, err:%d\n",
> >> > +dev->power.runtime_status,
> >> > +dev->power.runtime_error);
> >> > +}
> >> > +return err;
> >> > +}
> >> > +
> >> >   spin_lock_irqsave(hba->host->host_lock, flags);
> >> >   hba->force_reset = true;
> >> >   ufshcd_schedule_eh_work(hba);
> >>
> >> I think this change is racy because the runtime power management
> status
> >> may change after the above checks have been performed and before
> >> ufshcd_err_handling_prepare() is called.
> > 
> > If this function invoke and RPM state is in the RPM_RESUMING or
> RPM_SUSPENDING state,
> > it means error happen in ufs driver resume/suspend callback
> function and no chance
> > move to next state if callback function is not finished, just like
> backtrace stuck in send SSU cmd.
> > Or we can make it simple by check pm_op_in_progress, what do you
> think?
> 
> I'm concerned that this approach will fix some cases but not all
> cases. The
> UFS error handler (ufshcd_err_handler()) and runtime suspend or
> resume code
> may be called concurrently. No matter which checks are added at the 
> start of
> ufshcd_eh_host_reset_handler(), I think these checks won't protect
> against
> concurrent execution of runtime power management code just after
> these checks
> have finished.
> 

Hi Bart,

Yes, but ufshcd_err_handler will wait runtime pm resume to actvie if
concurrently run with runtime suspend or resume.
(ufshcd_err_handler -> ufshcd_err_handling_prepare ->
ufshcd_rpm_get_sync)
So, if runtime suspend or resume send SSU timeout, scsi error handler
stuck at this function and deadlock happen.
The ideas is, if runtime pm flow get error, do ufshcd_link_recovery is
enough.


> > Since the hang db backtrace is not involde below function, it is
> not help.
> 
> I think that conclusion is wrong. Can you please test the patch that
> I provided?
> 

Yes, I have check this patch is include in our code base, and io hang
still happen. The stack dump of this deadlock can tell how it cause io
hang.

Thanks.

> Thanks,
> 
> Bart.


More information about the Linux-mediatek mailing list