[PATCH 0/2] nvme: handle partially unique NID value

Thu Apr 17 09:56:18 PDT 2025

Christoph,

There is no debate about whether the NID reporting behavior is incorrect and
has to be fixed. It definitely has to be fixed and is getting fixed for new
drives.

That behavior was a defect, not a request, and I have theories on how people
that probably knew better missed realizing that.

Unfortunately the incorrect implementation was missed for quite a while and
there are drives in the field that have a correct NGUID and an invalid EUI64 in
some specific configurations. There is no simple fix for the drives in the 
field.

I've seen some reflector traffic that suggests that similar behavior has been 
seen in other drives.

Since the NGUID is valid, and is the value used as the unique namespace ID (when
present), the issue didn't create problems in the environment where the drives 
were being used until a uniqueness check was performed on the EUI64.

It is a very serious error that the EUI64 is not unique and it is completely 
appropriate for that to be flagged.

Having a quirk of some kind that allows the drives to be used, when they worked
perfectly previously, seems like the right thing to do.

A discussion on how to appropriately flag this serious error seems to be in 
order if the method proposed by Hannes isn't acceptable.

Curtis

-----Original Message-----
From: Christoph Hellwig <hch at lst.de> 
Sent: Monday, April 14, 2025 5:41 AM
To: Hannes Reinecke <hare at suse.de>
Cc: Christoph Hellwig <hch at lst.de>; hare at kernel.org; Keith Busch <kbusch at kernel.org>; Sagi Grimberg <sagi at grimberg.me>; wagi at lst.de; linux-nvme at lists.infradead.org; Ballard, Curtis C (HPE Storage) <curtis.ballard at hpe.com>; Javier Gonzalez <javier.gonz at samsung.com>
Subject: Re: [PATCH 0/2] nvme: handle partially unique NID value

On Mon, Apr 14, 2025 at 01:31:29PM +0200, Hannes Reinecke wrote:
> We have discussed this at LSF, and the involved parties (ie
> Samsung as the vendor, HPe as the IHV, and us as the OS provider)
> are happy with this approach.
> And we have paying customers for which the cited patch caused a regression, 
> so ignoring it is not an option for us.

Tell them to fix their broken systems instead of shifting this broken
crap upstream.  Really, we bend over backwards for consumer hardware
that doesn't know better.  We don't add crap for vendors that absolutely
should know better participate in the working group and only provide
expensive enterprise hardware just because they pay you.  If you have
so little spine that you want to accommodate this intentionally broken
behavior do it in your tree but don't force the burden on others.