[PATCH v2 2/2] nvme: handle persistent internal error AER from NVMe controller

Sat Jun 4 07:28:11 PDT 2022

From: Keith Busch <kbusch at kernel.org> Sent: Friday, June 3, 2022 12:23 PM
> 
> On Fri, Jun 03, 2022 at 10:56:01AM -0700, Michael Kelley wrote:
> 
> This series looks good to me. Just one concern below that may amount to
> nothing.
> 
> > +static void nvme_handle_aer_persistent_error(struct nvme_ctrl *ctrl)
> > +{
> > +	u32 csts;
> > +
> > +	trace_nvme_async_event(ctrl, NVME_AER_ERROR);
> > +
> > +	if (ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts) != 0 ||
> 
> The reg_read32() is non-blocking for pcie, so this is safe to call from that
> driver's irq handler. The other transports block on register reads, though, so
> they can't call this from an atomic context. The TCP context looks safe, but
> I'm not sure about RDMA or FC.

Good point.  But even if the RDMA and FC contexts are safe, if a
persistent error is reported, the controller is already in trouble and
may not respond to a request to retrieve the CSTS anyway.  Perhaps
we should just trust the AER error report and not bother checking
CSTS to decide whether to do the reset.  We can still check ctrl->state
and skip the reset if there's already one in progress.

> 
> > +	    nvme_should_reset(ctrl, csts)) {
> > +		dev_warn(ctrl->device, "resetting controller due to AER\n");
> > +		nvme_reset_ctrl(ctrl);
> > +	}
> > +}
> > +
> >  void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
> >  		volatile union nvme_result *res)
> >  {
> >  	u32 result = le32_to_cpu(res->u32);
> >  	u32 aer_type = result & 0x07;
> > +	u32 aer_subtype = (result & 0xff00) >> 8;
> 
> Since the above mask + shift is duplicated with nvme_handle_aen_notice(), an
> inline helper function seems reasonable.

Yep.  Will do in v3.

Michael