[PATCH 3/3] nvme: add KConfig options for debug features

Sagi Grimberg sagi at grimberg.me
Mon Dec 13 01:27:09 PST 2021


>>> Add KConfig menu option to enable and disable gencounter debug
>>> feature that uses config NVME_DEBUG_USE_CID_GENCTR.
>>>
>>> Signed-off-by: Chaitanya Kulkarni <kch at nvidia.com>
>>> ---
>>>    drivers/nvme/host/Kconfig | 10 ++++++++++
>>>    1 file changed, 10 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
>>> index dc0450ca23a3..dfa2609b7006 100644
>>> --- a/drivers/nvme/host/Kconfig
>>> +++ b/drivers/nvme/host/Kconfig
>>> @@ -1,4 +1,14 @@
>>>    # SPDX-License-Identifier: GPL-2.0-only
>>> +menu "Debug (Enable driver debug features)"
>>> +config NVME_DEBUG_USE_CID_GENCTR
>>> +     bool "Enable command ID gen counter for spurious request
>>> completion"
>>> +     depends on NVME_CORE
>>> +     help
>>> +       The NVM Express driver will use generation counter
>>> +       when calculating the command id. This is needed to debug the
>>> +       spurious request completions coming from a buggy controller.
>>
>> This is not just to debug - it is also to protect against such a
>> controller. What is the purpose of this config option anyways?
>> The main distributions will (as they should) enable it anyways...
> 
> I can rewrite the text and rename it to "driver features".

It is not a feature.

> We are protecting against such a controller which is not stable
> (buggy), i.e. it is doing things which it shouldn't be
> doing at the first place.

Yes, this is exactly what we are doing.

> Consider a case if controller is not
> buggy then it adds instructions in the fast path which are not
> needed at all.

We keep repeating this, if the controller assumes _anything_ on
the command identifier then it _is_ buggy, period.

Again, the fact that the controller is buggy is perfectly fine
really, that's why we have quirks.

What happened to your nvme-cli argument? that would make it
specific to that controller.

The host side cost is very cheap as shown before.

> A controller(s) that is used in the production environment goes
> through qualification process from vendors and from the consumers
> to make sure they are stable, something like spurious completions
> detection is a basic part of the qualification,

I don't think that assuming that a production controller does not
have any hidden bugs is a great practice, and controllers evolve during
their production lifetime.

> hence we should
> allow user to configure genctr than forcing additional
> instructions in the fast path and keep this pattern for future
> such cases.

I personally think that given that the worst-case option for such
a bug is possible data-corruption then we should not make it optional,
but I won't insist on it if others think otherwise.

> Maybe I didn't understand, can you please explain what are the
> benefits of having gen-counter where controller is stable?

The benefit is that the host code does not "assume" anything
about the controller.

> I'll wait to send V2 if you can suggest any other way
> (than kconfig something like module param, I'm not sure) so user
> can configure genctr calculations, please let me know.

As I said, the nvme-cli skip-genctr argument is perfectly valid in my
mind.



More information about the Linux-nvme mailing list