[PATCH v2 2/2] nvme: handle persistent internal error AER from NVMe controller

Keith Busch kbusch at kernel.org
Mon Jun 6 09:38:06 PDT 2022


On Sat, Jun 04, 2022 at 02:28:11PM +0000, Michael Kelley (LINUX) wrote:
> From: Keith Busch <kbusch at kernel.org> Sent: Friday, June 3, 2022 12:23 PM
> > 
> > On Fri, Jun 03, 2022 at 10:56:01AM -0700, Michael Kelley wrote:
> > 
> > This series looks good to me. Just one concern below that may amount to
> > nothing.
> > 
> > > +static void nvme_handle_aer_persistent_error(struct nvme_ctrl *ctrl)
> > > +{
> > > +	u32 csts;
> > > +
> > > +	trace_nvme_async_event(ctrl, NVME_AER_ERROR);
> > > +
> > > +	if (ctrl->ops->reg_read32(ctrl, NVME_REG_CSTS, &csts) != 0 ||
> > 
> > The reg_read32() is non-blocking for pcie, so this is safe to call from that
> > driver's irq handler. The other transports block on register reads, though, so
> > they can't call this from an atomic context. The TCP context looks safe, but
> > I'm not sure about RDMA or FC.
> 
> Good point.  But even if the RDMA and FC contexts are safe, if a
> persistent error is reported, the controller is already in trouble and
> may not respond to a request to retrieve the CSTS anyway.  Perhaps
> we should just trust the AER error report and not bother checking
> CSTS to decide whether to do the reset.  We can still check ctrl->state
> and skip the reset if there's already one in progress.

That sounds good to me. Christoph noted RDMA isn't safe to do this in the
callback anyway, and it's probably a bad idea in general to dispatch new
requests within another's completion: that may prevent reclaiming the only
available tag, and then deadlock.

So with that in mind, this AER persistent error handler could call
nvme_should_reset() with NVME_CSTS_CFS as a constant value for the csts
parameter.



More information about the Linux-nvme mailing list