[PATCH 1/1] nvme: retry security commands if media not ready

Wed Oct 2 09:51:56 PDT 2024

On Wed, 2024-10-02 at 10:16 +0200, Christoph Hellwig wrote:
> On Mon, Sep 30, 2024 at 11:48:43AM -0500, gjoyce at linux.ibm.com wrote:
> > +static u32 nvme_get_timeout(struct nvme_ctrl *ctrl)
> 
> get_timeout feels a bit too generic for this specific
> controller/media
> ready  timeout.

Agreed. I'll change the name to something more specific.

> 
> > +	timeout = NVME_CAP_TIMEOUT(ctrl->cap);
> > +	if (ctrl->cap & NVME_CAP_CRMS_CRWMS) {
> > +		u32 crto, ready_timeout;
> > +
> > +		ret = ctrl->ops->reg_read32(ctrl, NVME_REG_CRTO,
> > &crto);
> > +		if (ret) {
> > +			dev_err(ctrl->device, "Reading CRTO failed
> > (%d)\n",
> > +				ret);
> > +			return ret;
> > +		}
> 
> And we really should be caching these values instead of reading the
> register for every security command.

If we cache the timeout value(s) in nvme_ctrl then it may be possible
to just eliminate nvme_get_timeout() entirely. Is this the approach
that you were thinking?

> 
> > +	u32 timeout;
> > +	unsigned long timeout_jiffies;
> > +	int ret;
> > +
> > +	timeout = nvme_get_timeout(ctrl);
> > +	timeout_jiffies = jiffies + timeout * HZ;
> >  
> >  	if (send)
> >  		cmd.common.opcode = nvme_admin_security_send;
> > @@ -2335,8 +2376,19 @@ static int nvme_sec_submit(void *data, u16
> > spsp, u8 secp, void *buffer, size_t l
> >  	cmd.common.cdw10 = cpu_to_le32(((u32)secp) << 24 |
> > ((u32)spsp) << 8);
> >  	cmd.common.cdw11 = cpu_to_le32(len);
> >  
> > -	return __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, NULL,
> > buffer, len,
> > +	ret = __nvme_submit_sync_cmd(ctrl->admin_q, &cmd, NULL,
> > buffer, len,
> >  			NVME_QID_ANY, NVME_SUBMIT_AT_HEAD);
> > +	while (ret == NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY) {
> > +		if (time_after(jiffies, timeout_jiffies)) {
> > +			dev_err(ctrl->device,
> > +				"Device media not ready;
> > aborting\n");
> > +			return -ENODEV;
> > +		}
> > +		ssleep(1);
> > +		ret =  __nvme_submit_sync_cmd(ctrl->admin_q, &cmd,
> > NULL, buffer,
> > +				len, NVME_QID_ANY,
> > NVME_SUBMIT_AT_HEAD);
> > +	}
> 
> And this also feels a bit odd in that it doesn't catch
> NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY when it should be ready.
> I think just marking when the controller is past the timeout and
> only doing the retry until then might be the better approach.  And
> maybe we should have it in the __nvme_submit_sync_cmd helper for
> admin command as Security Send/Receive aren't the only commands with
> this issue.
> 

I'll look at the spec some more but you're probably correct that when
the status code transitions from NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY
to success, then NVME_SC_ADMIN_COMMAND_MEDIA_NOT_READY should not be
returned again.

I'm a little confused about what you're saying about the timeout.
nvme_enable_ctrl() does determine the correct timeout value and passes
it to nvme_wait_ready() but NVME_CSTS_RDY is set well before the media
is ready (if CC.CRIME is set). Unfortunately there doesn't seem to be
any controller status that indicates when the media is ready. I thought
about having nvme_wait_ready() wait the whole timeout if CC.CRIME is
set, but I think that is contrary to intent of CC.CRIME. And on the SSD
that I'm looking at the timeout is 15 seconds which would be a pretty
big hit to boot time.

I was looking to solve the specific case of the security commands but
certainly any of the commands in Figure 103 could have the same issue.
The heart of the problem is that CC.CRIME support is not complete. I'll
see if I can address that in a more comprehensive manner.

-Greg