[PATCH] fs: remove power of 2 and length boundary atomic write restrictions
Vitaliy Filippov
vitalifster at gmail.com
Mon Jan 5 11:29:07 PST 2026
>As I said before, just don't use RWF_ATOMIC if you don't want to deal with these restrictions.
But how to get the torn write protection then? What if the kernel
decides to fragment my 'once atomic' write?
I'll add some details:
The real NVMe disks with atomic write support which I know are:
1) Micron 7450 / 7500 and probably later
2) Kioxia CD6-R / CD7-R / CD8-R and similar
Both use AWUPF=256 KB and NABO=0. That means any write up to 256 KB
size is atomic regardless of the offset.
Actually it results in atomic_write_max_bytes being 128 KB when IOMMU
is turned on because max_hw_sectors_kb becomes 128 KB because it's
limited by iommu_dma_opt_mapping_size() and it's hard-coded to return
128 KB = PAGE_SIZE << (IOVA_RANGE_CACHE_MAX_SIZE - 1) = 4096 << 5. But
that's not the main point.
My use case is: I use raw NVMe devices in my project and I want to use
atomic writes to avoid journaling. But for me it means that I want to
do atomic writes at arbitrary 4 KB aligned offsets. And I want to use
atomic writes **safely**. That's why I want to use RWF_ATOMIC - it
allows the kernel to guarantee that it doesn't fragment the write.
With the current restrictions, as a user, I can't do that - I get
EINVAL for some of my writes when I enable RWF_ATOMIC. So I'm asking:
what's the reason behind these restrictions? Could they be removed?
On Fri, Jan 2, 2026 at 8:41 PM John Garry <john.g.garry at oracle.com> wrote:
>
> On 30/12/2025 09:01, Vitaliy Filippov wrote:
> > I think that even with the 2^N requirement the user still has to look
> > for boundaries.
> > 1) NVMe disks may have NABO != 0 (atomic boundary offset). In this
> > case 2^N aligned writes won't work at all.
>
> We don't support NABO != 0
>
> > 2) NABSPF is expressed in blocks in the NVMe spec and it's not
> > restricted to 2^N, it can be for example 3 (3*4096 = 12 KB). The spec
> > allows it. 2^N breaks this case too.
>
> We could support NABSPF which is not a power-of-2, but we don't today.
>
> If you can find some real HW which has NABSPF which is not a power-of-2,
> then it can be considered.
>
> > And the user also has to look for the maximum atomic write size
> > anyway, he can't just assume all writes are atomic out of the box,
> > regardless of the 2^N requirement.
> > So my idea is that the kernel's task is just to guarantee correctness
> > of atomic writes. It anyway can't provide the user with atomic writes
> > in all cases.
>
> What good is that to a user?
>
> Consider the user wants to atomic write a range of a file which is
> backed by disk blocks which straddle a boundary - in this case, the
> write would fail. What is the user supposed to do then? That API could
> have arbitrary failures, which effectively makes it a useless API.
>
> As I said before, just don't use RWF_ATOMIC if you don't want to deal
> with these restrictions.
More information about the Linux-nvme
mailing list