What should we do about the nvme atomics mess?
Niklas Cassel
cassel at kernel.org
Tue Jul 8 02:38:09 PDT 2025
On Mon, Jul 07, 2025 at 04:18:34PM +0200, Christoph Hellwig wrote:
> Hi all,
>
> I'm a bit lost on what to do about the sad state of NVMe atomic writes.
>
> As a short reminder the main issues are:
>
> 1) there is no flag on a command to request atomic (aka non-torn)
> behavior, instead writes adhering to the atomicy requirements will
> never be torn, and writes not adhering them can be torn any time.
> This differs from SCSI where atomic writes have to be be explicitly
> requested and fail when they can't be satisfied
> 2) the original way to indicate the main atomicy limit is the AWUPF
> field, which is in Identify Controller, but specified in logical
> blocks which only exist at a namespace layer. This a) lead to
> various problems because the limit is a mess when namespace have
> different logical block sizes, and it b) also causes additional
> issues because NVMe allows it to be different for different
> controllers in the same subsystem.
>
> Commit 8695f060a029 added some sanity checks to deal with issue 2b,
> but we kept running into more issues with it. Partially because
> the check wasn't quite correct, but also because we've gotten
> reports of controllers that change the AWUPF value when reformatting
> namespaces to deal with issue 2a.
>
> And I'm a bit lost on what to do here.
>
> We could:
>
> I. revert the check and the subsequent fixup. If you really want
> to use the nvme atomics you already better pray a lot anyway
> due to issue 1)
> II. limit the check to multi-controller subsystems
> III. don't allow atomics on controllers that only report AWUPF and
> limit support to controllers that support that more sanely
> defined NAWUPF
I like III.
But NVMe should probably push to deprecate AUWPF, and introduce a new field
that is like AUWPF but which is specified in a fixed unit, e.g. bytes or
CAP.MPSMIN. (I'm thinking of e.g. Zone Append Size Limit (ZASL) which is also
a per controller limit, but the value is specified in units of CAP.MPSMIN,
just like the Maximum Data Transfer Size (MDTS).)
Kind regards,
Niklas
More information about the Linux-nvme
mailing list