[PATCH v2] Do not require atomic writes to be power of 2 sized and aligned on length boundary

John Garry john.g.garry at oracle.com
Tue Dec 23 01:26:20 PST 2025


On 22/12/2025 13:28, Vitaliy Filippov wrote:
> Hi linux-fsdevel,
> I recently discovered that Linux incorrectly requires all atomic
> writes to have 2^N length and to be aligned on the length boundary.
> This requirement contradicts NVMe specification which doesn't require
> such alignment and length and thus highly restricts usage of atomic
> writes with NVMe disks which support it (Micron and Kioxia).

All these alignment and size rules are specific to using RWF_ATOMIC. You 
don't have to use RWF_ATOMIC if you don't want to - as you prob know, 
atomic writes are implicit on NVMe.

> NVMe specification has its own atomic write restrictions - AWUPF and
> NABSPF/NABO, but both are already checked by the nvme subsystem.
> The 2^N restriction comes from generic_atomic_write_valid().
> I submitted a patch which removes this restriction to linux-block and
> linux-nvme. Sorry if these maillists weren't the right place to send
> it to, it's my first patch :).
> But the function is currently used in 3 places: block/fops.c,
> fs/ext4/file.c and fs/xfs/xfs_file.c.
> Can you tell me if ext4 and xfs really want atomic writes to be 2^N
> sized and length-aligned?

As above, this is just the kernel atomic write rules to support using 
different storage technologies.

>  From looking at the code I'd say they don't really require it?
> Can you approve my patch if I'm right? Please :-)
> 
> On Mon, Dec 22, 2025 at 12:54 PM Vitaliy Filippov <vitalifster at gmail.com> wrote:
>>
>> Hi! Thanks a lot for your reply! This is actually my first patch ever
>> so please don't blame me for not following some standards, I'll try to
>> resubmit it correctly.
>>
>> Regarding the rest:
>>
>> 1) NVMe atomic boundaries seem to already be checked in
>> nvme_valid_atomic_write().
>>
>> 2) What's atomic_write_hw_unit_max? As I understand, Linux also
>> already checks it, at least
>> /sys/block/nvme**/queue/atomic_write_max_bytes is already limited by
>> max_hw_sectors_kb.
>>
>> 3) Yes, I've of course seen that this function is also used by ext4
>> and xfs, but I don't understand the motivation behind the 2^n
>> requirement. I suppose file systems may fragment the write according
>> to currently allocated extents for example, but I don't see how issues
>> coming from this can be fixed by requiring writes to be 2^n.
>>
>> But I understand that just removing the check may break something if
>> somebody relies on them. What do you think about removing the
>> requirement only for NVMe or only for block devices then? I see 3 ways
>> to do it:
>> a) split generic_atomic_write_valid() into two functions - first for
>> all types of inodes and second only for file systems.
>> b) remove generic_atomic_write_valid() from block device checks at all.
>> c) change generic_atomic_write_valid() just like in my original patch
>> but copy original checks into other places where it's used (ext4 and
>> xfs).
>>
>> Which way do you think would be the best?
>>
>> On Mon, Dec 22, 2025 at 2:17 AM Keith Busch <kbusch at kernel.org> wrote:
>>>
>>> On Sun, Dec 21, 2025 at 04:24:02PM +0300, Vitaliy Filippov wrote:
>>>> It contradicts NVMe specification where alignment is only required when atomic
>>>> write boundary (NABSPF/NABO) is set and highly limits usage of NVMe atomic writes
>>>
>>> Commit header is missing the "fs:" prefix, and the commit log should
>>> wrap at 72 characters.
>>>
>>> On the techincal side, this is a generic function used by multiple
>>> protocols, so you can't just appeal to NVMe to justify removing the
>>> checks.
>>>
>>> NVMe still has atomic boundaries where straddling it fails to be an
>>> atomic operation. Instead of removing the checks, you'd have to replace
>>> it with a more costly operation if you really want to support more
>>> arbitrary write lengths and offsets. And if you do manage to remove the
>>> power of two requirement, then the queue limit for nvme's
>>> atomic_write_hw_unit_max isn't correct anymore.
> 




More information about the Linux-nvme mailing list