[PATCH 3/3] nvme: add KConfig options for debug features
Sagi Grimberg
sagi at grimberg.me
Mon Dec 13 01:27:09 PST 2021
>>> Add KConfig menu option to enable and disable gencounter debug
>>> feature that uses config NVME_DEBUG_USE_CID_GENCTR.
>>>
>>> Signed-off-by: Chaitanya Kulkarni <kch at nvidia.com>
>>> ---
>>> drivers/nvme/host/Kconfig | 10 ++++++++++
>>> 1 file changed, 10 insertions(+)
>>>
>>> diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
>>> index dc0450ca23a3..dfa2609b7006 100644
>>> --- a/drivers/nvme/host/Kconfig
>>> +++ b/drivers/nvme/host/Kconfig
>>> @@ -1,4 +1,14 @@
>>> # SPDX-License-Identifier: GPL-2.0-only
>>> +menu "Debug (Enable driver debug features)"
>>> +config NVME_DEBUG_USE_CID_GENCTR
>>> + bool "Enable command ID gen counter for spurious request
>>> completion"
>>> + depends on NVME_CORE
>>> + help
>>> + The NVM Express driver will use generation counter
>>> + when calculating the command id. This is needed to debug the
>>> + spurious request completions coming from a buggy controller.
>>
>> This is not just to debug - it is also to protect against such a
>> controller. What is the purpose of this config option anyways?
>> The main distributions will (as they should) enable it anyways...
>
> I can rewrite the text and rename it to "driver features".
It is not a feature.
> We are protecting against such a controller which is not stable
> (buggy), i.e. it is doing things which it shouldn't be
> doing at the first place.
Yes, this is exactly what we are doing.
> Consider a case if controller is not
> buggy then it adds instructions in the fast path which are not
> needed at all.
We keep repeating this, if the controller assumes _anything_ on
the command identifier then it _is_ buggy, period.
Again, the fact that the controller is buggy is perfectly fine
really, that's why we have quirks.
What happened to your nvme-cli argument? that would make it
specific to that controller.
The host side cost is very cheap as shown before.
> A controller(s) that is used in the production environment goes
> through qualification process from vendors and from the consumers
> to make sure they are stable, something like spurious completions
> detection is a basic part of the qualification,
I don't think that assuming that a production controller does not
have any hidden bugs is a great practice, and controllers evolve during
their production lifetime.
> hence we should
> allow user to configure genctr than forcing additional
> instructions in the fast path and keep this pattern for future
> such cases.
I personally think that given that the worst-case option for such
a bug is possible data-corruption then we should not make it optional,
but I won't insist on it if others think otherwise.
> Maybe I didn't understand, can you please explain what are the
> benefits of having gen-counter where controller is stable?
The benefit is that the host code does not "assume" anything
about the controller.
> I'll wait to send V2 if you can suggest any other way
> (than kconfig something like module param, I'm not sure) so user
> can configure genctr calculations, please let me know.
As I said, the nvme-cli skip-genctr argument is perfectly valid in my
mind.
More information about the Linux-nvme
mailing list