[PATCH] fs: remove power of 2 and length boundary atomic write restrictions

Vitaliy Filippov vitalifster at gmail.com
Wed Jan 7 05:05:42 PST 2026


> What is the actual usecase you are trying to solve? You mentioned "avoid
> journaling", which does not explain what you want to achieve.
>
> You could arrange your data so that it suits the rules.

I can't. My usecase is a distributed ceph-like SDS based on atomic
writes. Writes on a virtual block device have arbitrary length &
offset of course, nothing like 2^N, like on a regular block device.
Atomicity is implemented through journaling (double-write) on disks
without hardware atomic write support.

Then I found the new atomic write feature and SSDs with support for it
and implemented a new storage layer which can take advantage of it. My
new storage layer has write amplification about ~1.0 with atomic
writes (i.e. almost zero overhead). It's a huge improvement for me -
the old storage layer has WA from 3 to 4.

And everything was fine until I finally deployed it with enabled
RWF_ATOMIC (production setups should use safety features) and stumbled
upon the 2^N restriction... It was a big surprise, I never thought
that such a limitation could exist. It's absolutely irrational - the
device doesn't have that limitation and I'm just using the raw device.

It's normal and expected in the context of simple file systems like
ext4 and xfs. But for the raw device... I only discovered it after
several days of investigation with bpftrace and after reading the
kernel code. It's really unexpected. I think anyone expects the raw
NVMe disk to have the same requirements as it's described in the NVMe
spec.

> The atomic write API is based on:
> a. doing statx to find atomic write min and max limits.
> b. issuing a write with RWF_ATOMIC means that the write should be
> naturally aligned and fit within the size limits.
>
> That is the same for both raw block devices and regular FS files. And
> any atomic write boundary is not part of the API.

For raw block devices, you also have sysfs. You can look there and
determine actual restrictions. In fact I didn't even know about the
statx API when I was implementing atomic writes, and I don't use it.

And speaking of that API, why does it have to be like this? Currently
it looks like an API designed around existing internal restrictions of
the implementation - of two implementations more exactly: ext4 and
xfs, both of which are classic non-cow file systems. I suspect that if
it was primarily designed after zfs & btrfs then chances are the
restriction wouldn't exist.

Ok, it's already designed like this, but anyway, if the user is fine
with statx and with the 2^N restriction, then removing the restriction
for block devices also doesn't break anything for him. He'll send his
2^N aligned writes just like before. It's fine for databases like
mysql & postgresql because they always overwrite a whole fixed-size
page. But even speaking of databases, it's not guaranteed that **all**
databases will always have the same layout and that arbitrary atomic
write offsets will never be useful for them.

So again, can we please remove the restriction for raw block devices?
I can re-submit the patch :-)



More information about the Linux-nvme mailing list