[PATCH v3 1/1] nvme-pci : Fix EEH failure on ppc after subsystem reset

Mon Jun 24 09:07:28 PDT 2024

On Sat, Jun 22, 2024 at 08:37:02PM +0530, Nilay Shroff wrote:
> On 6/21/24 22:07, Keith Busch wrote:
> >  static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
> >  {
> > +	u32 val;
> >  	int ret;
> >  
> >  	if (!ctrl->subsystem)
> > @@ -657,10 +660,10 @@ static inline int nvme_reset_subsystem(struct nvme_ctrl *ctrl)
> >  		return -EBUSY;
> >  
> >  	ret = ctrl->ops->reg_write32(ctrl, NVME_REG_NSSR, 0x4E564D65);
> > -	if (ret)
> > -		return ret;
> > -
> > -	return nvme_try_sched_reset(ctrl);
> > +	nvme_change_ctrl_state(ctrl, NVME_CTRL_LIVE);
> > +	if (!ret)
> > +		ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &val);
> > +	return ret;

> This is a nice idea! These changes look good. I have tested it on powerpc with 
> EEH and I observed that post nvme subsystem-reset, EEH is able to recover the disk. 
> I have also tested it on a platform which *doesn't* support EEH or pci error recovery
> and on this platform I observed that nvme disk falls through the dead state. 
> 
> So I think you may submit a formal patch with this change.

Just a little concerned about the reg_read32 at the end there. A hot
plug event is potentially expected outcome from the reg write, and that
may unmap the pci bar before read.

And come to think of it, a hot plug could occur before the reg_write32,
too, for a reason unrelated to the requested subsys-reset operation...

Anyway, I think this needs a driver specific op to handle it safely.
I'll send a patch.