[PATCH v2 00/16] block atomic writes

John Garry john.g.garry at oracle.com
Tue Dec 12 03:08:28 PST 2023


This series introduces a proposal to implementing atomic writes in the
kernel for torn-write protection.

This series takes the approach of adding a new "atomic" flag to each of
pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively.
When set, these indicate that we want the write issued "atomically".

Only direct IO is supported and for block devices here. For this, atomic
write HW is required, like SCSI ATOMIC WRITE (16). For now, XFS support
from earlier versions is sidelined. That is until the interface is agreed
which does not fully rely on HW offload support.

man pages update has been posted at:
https://lore.kernel.org/linux-api/20230929093717.2972367-1-john.g.garry@oracle.com/T/#t
(not updated since I posted v1 kernel series)

The goal here is to provide an interface that allows applications use
application-specific block sizes larger than logical block size
reported by the storage device or larger than filesystem block size as
reported by stat().

With this new interface, application blocks will never be torn or
fractured when written. For a power fail, for each individual application
block, all or none of the data to be written. A racing atomic write and
read will mean that the read sees all the old data or all the new data,
but never a mix of old and new.

Two new fields are added to struct statx - atomic_write_unit_min and
atomic_write_unit_max. For each atomic individual write, the total length
of a write must be a between atomic_write_unit_min and
atomic_write_unit_max, inclusive, and a power-of-2. The write must also be
at a natural offset in the file wrt the write length.

SCSI sd.c and scsi_debug and NVMe kernel support is added.

Some open questions:
- How to make API extensible for when we have no HW support? In that case,
  we would prob not have to follow rule of power-of-2 length et al.
  As a possible solution, maybe we can say that atomic writes are
  supported for the file via statx, but not set unit_min and max values,
  and this means that writes need to be just FS block aligned there.
- For block layer, should atomic_write_unit_max be limited by
  max_sectors_kb? Currently it is not.
- How to improve requirement that iovecs are PAGE-aligned.
  There are 2x issues:
  a. We impose this rule to not split BIOs due to virt boundary for
     NVMe, but there virt boundary is 4K (and not PAGE size, so broken for
     16K/64K pages). Easy solution is to impose requirement that iovecs
     are 4K-aligned.
  b. We don't enforce this rule for virt boundary == 0, i.e. SCSI
- Since debugging torn-writes due to unwanted kernel BIO splitting/merging
  would be horrible, should we add some kernel storage stack software
  integrity checks?

This series is based on v6.7-rc5.

Changes since v1:
- Drop XFS support for now
- Tidy NVMe changes and also add checks for atomic write violating max
  AW PF length and boundary (if any)
- Reject - instead of ignoring - RWF_ATOMIC for files which do not
  support atomic writes
- Update block sysfs documentation
- Various tidy-ups

Alan Adamson (2):
  nvme: Support atomic writes
  nvme: Ensure atomic writes will be executed atomically

Himanshu Madhani (2):
  block: Add atomic write operations to request_queue limits
  block: Add REQ_ATOMIC flag

John Garry (10):
  block: Limit atomic writes according to bio and queue limits
  fs: Increase fmode_t size
  block: Pass blk_queue_get_max_sectors() a request pointer
  block: Limit atomic write IO size according to
    atomic_write_max_sectors
  block: Error an attempt to split an atomic write bio
  block: Add checks to merging of atomic writes
  block: Add fops atomic write support
  scsi: sd: Support reading atomic write properties from block limits
    VPD
  scsi: sd: Add WRITE_ATOMIC_16 support
  scsi: scsi_debug: Atomic write support

Prasad Singamsetty (2):
  fs/bdev: Add atomic write support info to statx
  fs: Add RWF_ATOMIC and IOCB_ATOMIC flags for atomic write support

 Documentation/ABI/stable/sysfs-block |  47 +++
 block/bdev.c                         |  31 +-
 block/blk-merge.c                    |  95 ++++-
 block/blk-mq.c                       |   2 +-
 block/blk-settings.c                 |  84 ++++
 block/blk-sysfs.c                    |  33 ++
 block/blk.h                          |   9 +-
 block/fops.c                         |  40 +-
 drivers/dma-buf/dma-buf.c            |   2 +-
 drivers/nvme/host/core.c             | 108 ++++-
 drivers/nvme/host/nvme.h             |   2 +
 drivers/scsi/scsi_debug.c            | 590 +++++++++++++++++++++------
 drivers/scsi/scsi_trace.c            |  22 +
 drivers/scsi/sd.c                    |  93 ++++-
 drivers/scsi/sd.h                    |   8 +
 fs/stat.c                            |  44 +-
 include/linux/blk_types.h            |   2 +
 include/linux/blkdev.h               |  41 +-
 include/linux/fs.h                   |  11 +
 include/linux/stat.h                 |   2 +
 include/linux/types.h                |   2 +-
 include/scsi/scsi_proto.h            |   1 +
 include/trace/events/scsi.h          |   1 +
 include/uapi/linux/fs.h              |   5 +-
 include/uapi/linux/stat.h            |   7 +-
 25 files changed, 1098 insertions(+), 184 deletions(-)

-- 
2.35.3




More information about the Linux-nvme mailing list