nvme_scan_ns does not remove stale namespaces anymore since kernel 6.0

Tue Feb 21 06:25:40 PST 2023

Hi everyone,

I wanted to report an issue with NVMe namespace scanning that we are observing ever since commit 1a893c2bfef46ac447eead8ea7afe417942be237 ("nvme: refactor namespace probing”).

On Google Cloud Platform (GCP), disks to a VM are attached via NVMe on the same controller, but as different namespaces. For example, a VM with the OS disk and a data disk attached looks like this:

Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n2          /dev/ng0n2            nvme_card-pd         nvme_card-pd                             2           0.00   B /  10.74  GB    512   B +  0 B   2       
/dev/nvme0n1          /dev/ng0n1            nvme_card-pd         nvme_card-pd                             1           0.00   B / 161.06  GB    512   B +  0 B   2       

Whenever a disk is attached or detached via GCP’s web portal or API, a namespace scan is triggered with the intention to either detect new disks (= new namespaces) or remove old disks (= existing namespaces):
[  155.142132] nvme nvme0: rescanning namespaces.

Before the refactoring commit, everything worked fine. Namespaces were removed when disks have been detached via GCP.

After the commit, the namespaces are not cleaned up anymore correctly. If I now remove a disk, a namespace scan is triggered but the namespace is left behind in Linux:

Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n2          /dev/ng0n2            nvme_card-pd         nvme_card-pd                             2           0.00   B /   0.00   B      1   B +  0 B   2       
/dev/nvme0n1          /dev/ng0n1            nvme_card-pd         nvme_card-pd                             1           0.00   B / 161.06  GB    512   B +  0 B   2       

I already analyzed this issue a bit.

Before the commit, nvme_scan_ns in nvme/host/core.c called nvme_validate_ns with each allocated namespace. In the example above, when the disk is removed, nvme_validate_ns is called on both namespaces.
In nvme_validate_ns, nvme_identify_ns is called. When a namespace is not allocated anymore, it will return an error, which seems to be expected behavior:

	if ((*id)->ncap == 0) /* namespace not allocated or attached */
 		goto out_free_id;

Since we have an error value as return, back in nvme_validate_ns, nvme_ns_remove is called on the namespace of the detached disk and the namespace is removed from Linux.

After the commit, nvme_scan_ns calls nvme_identify_ns over nvme_ns_info_from_identify - before nvme_validate_ns. When nvme_identify_ns returns an error on the namespace of the detached disk, so does nvme_ns_info_from_identify. Back in nvme_scan_ns, the error from nvme_ns_info_from_identify now causes an early return in case of an error - and never calls nvme_validate_ns for any stale namespaces. And since nvme_validate_ns is not called, the namespace remains existing in Linux, while the underlying (virtualized) drive is gone. 

Any access to the device will cause I/O errors then, which certainly is not great.

I already reported this to Google and their engineers could confirm the issue, but I couldn’t find any proposed fix or report here on the mailing list so far. 
So I wanted to go ahead and fix this on my own, given this issue already exists since kernel 6.0 and still in kernel 6.2 now.

However, I am not sure what exactly a patch here would like to keep the “clean” code from the refactoring in place yet also remove any stale namespaces.

My suggestion would be to remove the early returns, call nvme_validate_ns on any allocated namespaces and only call nvme_put_ns (for an existing allocated namespace) or nvme_alloc_ns (for a new, not yet allocated namespace) in case no error was encountered before.
But given I am not really that familiar with the NVMe spec or code in detail, I am not sure if this could introduce other problems.

Any advice or suggestions on what a good fix for this issue would look like?

Best regards,
Nils