[PATCH] fs: remove power of 2 and length boundary atomic write restrictions
John Garry
john.g.garry at oracle.com
Wed Jan 7 07:42:51 PST 2026
On 07/01/2026 13:05, Vitaliy Filippov wrote:
>> What is the actual usecase you are trying to solve? You mentioned "avoid
>> journaling", which does not explain what you want to achieve.
>>
>> You could arrange your data so that it suits the rules.
>
> I can't. My usecase is a distributed ceph-like SDS based on atomic
> writes. Writes on a virtual block device have arbitrary length &
> offset of course,
Note that the alignment rule is not just for atomic HW boundaries. We
also support atomic writes on stacked devices, where this is relevant -
specifically striped devices, like raid0. Doing an unaligned atomic
write on a striped device may result in trying to issue an atomic write
which straddles 2x separate devices, which would obviously be broken.
> nothing like 2^N, like on a regular block device.
> Atomicity is implemented through journaling (double-write) on disks
> without hardware atomic write support.
>
> Then I found the new atomic write feature and SSDs with support for it
> and implemented a new storage layer which can take advantage of it. My
> new storage layer has write amplification about ~1.0 with atomic
> writes (i.e. almost zero overhead). It's a huge improvement for me -
> the old storage layer has WA from 3 to 4.
>
> And everything was fine until I finally deployed it with enabled
> RWF_ATOMIC (production setups should use safety features) and stumbled
> upon the 2^N restriction... It was a big surprise, I never thought
> that such a limitation could exist. It's absolutely irrational - the
> device doesn't have that limitation and I'm just using the raw device.
This is all described in the man pages.
>
> It's normal and expected in the context of simple file systems like
> ext4 and xfs. But for the raw device... I only discovered it after
> several days of investigation with bpftrace and after reading the
> kernel code. It's really unexpected. I think anyone expects the raw
> NVMe disk to have the same requirements as it's described in the NVMe
> spec.
It seems that you just want to take advantage of the block layer code to
handle submission of an atomic write bio, i.e. reject anything which
cannot be atomically written. In essence, that would be to just set
REQ_ATOMIC. Maybe that could be done as a passthrough command, I'm not sure.
More information about the Linux-nvme
mailing list