[PATCH v2] nvme: continue keep alive on error

Mon May 14 08:21:37 PDT 2018

On 5/12/2018 6:33 AM, Christoph Hellwig wrote:
> On Fri, May 11, 2018 at 04:22:29PM -0700, James Smart wrote:
>> Currently, if the keep_alive command failed, an error message is
>> generated and keep alive is stopped. This guarantees the target will
>> eventually not see a keep_alive in a KATO window and fail.
>>
>> The keep_alive command may complete in error in cases where the
>> transport or lldd are temporarily out of resources. As such, the
>> command should be retried rather than letting the controller die.
>>
>> If the command completes in error, retry another one after a short
>> delay. Track whether keep alive has had an error to reduce printing
>> the error message to the first failure only.
> This seems pretty much counter the definition of the keep alive.
> What kinds of errors do you see when you'd want to retry?  How we
> can we figute out we hit exactly that case instead of just wasting
> our time retrying?

The error is a temporary adapter resource error that wants the upper 
layer to retry the command, whatever command it was, after a short delay.

Taking another look at the code and the history of this issue:  the 
transport should be returning BLK_STS_RESOURCE and the blk-mq layer will 
take care of the retry. But the transport only does so if the LLDD 
returned a -EBUSY response. And lpfc until the recent patches, didn't 
always return -EBUSY.  So it was a driver error.

Ok - I'll drop this one too.

-- james