[PATCH v4 00/28] Zone write plugging
Hans Holmberg
Hans.Holmberg at wdc.com
Wed Apr 3 01:15:13 PDT 2024
On 2024-04-02 14:39, Damien Le Moal wrote:
> The patch series introduces zone write plugging (ZWP) as the new
> mechanism to control the ordering of writes to zoned block devices.
> ZWP replaces zone write locking (ZWL) which is implemented only by
> mq-deadline today. ZWP also allows emulating zone append operations
> using regular writes for zoned devices that do not natively support this
> operation (e.g. SMR HDDs). This patch series removes the scsi disk
> driver and device mapper zone append emulation to use ZWP emulation.
>
> Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
> write plug is simply a BIO list that is atomically manipulated using a
> spinlock and a kblockd submission work. A write BIO to a zone is
> "plugged" to delay its execution if a write BIO for the same zone was
> already issued, that is, if a write request for the same zone is being
> executed. The next plugged BIO is unplugged and issued once the write
> request completes.
>
> This mechanism allows to:
> - Untangle zone write ordering from the block IO schedulers. This
> allows removing the restriction on using only mq-deadline for zoned
> block devices. Any block IO scheduler, including "none" can be used.
> - Zone write plugging operates on BIOs instead of requests. Plugged
> BIOs waiting for execution thus do not hold scheduling tags and thus
> do not prevent other BIOs from being submitted to the device (reads
> or writes to other zones). Depending on the workload, this can
> significantly improve the device use and the performance.
> - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
> device mapper) can use ZWP. It is mandatory for the
> former but optional for the latter: BIO-based driver can use zone
> write plugging to implement write ordering guarantees, or the drivers
> can implement their own if needed.
> - The code is less invasive in the block layer and in device drivers.
> ZWP implementation is mostly limited to blk-zoned.c, with some small
> changes in blk-mq.c, blk-merge.c and bio.c.
>
> Performance evaluation results are shown below.
>
> The series is based on block/for-next and organized as follows:
>
> - Patch 1 to 6 are preparatory changes for patch 7.
> - Patch 7 and 8 introduce ZWP
> - Patch 9 and 10 add zone append emulation to ZWP.
> - Patch 11 to 18 modify zoned block device drivers to use ZWP and
> prepare for the removal of ZWL.
> - Patch 19 to 28 remove zone write locking
>
> Overall, these changes do not significantly increase the amount of code
> (the higher number of addition shown by diff-stat is in fact due to a
> larger amount of comments in the code).
>
> Many thanks must go to Christoph Hellwig for comments and suggestions
> he provided on earlier versions of these patches.
>
> Performance evaluation results
> ==============================
>
> Environments:
> - Intel Xeon 16-cores/32-threads, 128GB of RAM
> - Kernel:
> - ZWL (baseline): block/for-next (based on 6.9.0-rc2)
> - ZWP: block/for-next patched kernel to add zone write plugging
> (both kernels were compiled with the same configuration turning
> off most heavy debug features)
>
> Workoads:
> - seqw4K1: 4KB sequential write, qd=1
> - seqw4K16: 4KB sequential write, qd=16
> - seqw1M16: 1MB sequential write, qd=16
> - rndw4K16: 4KB random write, qd=16
> - rndw128K16: 128KB random write, qd=16
> - btrfs workoad: Single fio job writing 128 MB files using 128 KB
> direct IOs at qd=16.
>
> Devices:
> - nullblk (zoned): 4096 zones of 256 MB, 128 max open zones.
> - NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open and
> active zones.
> - SMR HDD: 20 TB disk with 256MB zone size, 128 max open zones.
>
> For ZWP, the result show the performance percentage increase (or
> decrease) against ZWL (baseline) case.
>
> 1) null_blk zoned device:
>
> +--------+--------+-------+--------+---------+---------+
> |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
> |(MB/s) | (MB/s) |(MB/s) | (MB/s) | (KIOPS)| (KIOPS) |
> +-----------+--------+--------+-------+--------+--------+----------+
> | ZWL | 940 | 840 | 18550 | 14400 | 424 | 167 |
> |mq-deadline| | | | | | |
> +-----------+--------+--------+-------+--------+--------+----------+
> | ZWP | 943 | 845 | 18660 | 14770 | 461 | 165 |
> |mq-deadline| (+0%) | (+0%) | (+0%) | (+1%) | (+8%) | (-1%) |
> +-----------+--------+--------+-------+--------+--------+----------+
> | ZWP | 756 | 668 | 16020 | 12980 | 135 | 101 |
> | bfq | (-19%) | (-20%) | (-13%)| (-9%) | (-68%) | (-39%) |
> +-----------+--------+--------+-------+--------+--------+----------+
> | ZWP | 2639 | 1715 | 28190 | 19760 | 344 | 150 |
> | none | (+180%)| (+104%)| (+51%)| (+37%) | (-18%) | (-10%) |
> +-----------+--------+--------+-------+--------+--------+----------+
>
> ZWP with mq-deadline gives performance very similar to zone write
> locking, showing that zone write plugging overhead is acceptable.
> But ZWP ability to run fast block devices with the none scheduler
> shows brings all the benefits of zone write plugging and results in
> significant performance increase for all workloads. The exception to
> this are random write workloads with multiple jobs: for these, the
> faster request submission rate achieved by zone write plugging results
> in higher contention on null-blk zone spinlock, which degrades
> performance.
>
> 2) NVMe ZNS drive:
>
> +--------+--------+-------+--------+--------+----------+
> |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
> |(MB/s) | (MB/s) |(MB/s) | (MB/s) | (KIOPS)| (KIOPS) |
> +-----------+--------+--------+-------+--------+--------+----------+
> | ZWL | 183 | 702 | 1066 | 1103 | 53.5 | 14.5 |
> |mq-deadline| | | | | | |
> +-----------+--------+--------+-------+--------+--------+----------+
> | ZWP | 183 | 719 | 1086 | 1108 | 55.6 | 14.7 |
> |mq-deadline| (-0%) | (+1%) | (+0%) | (+0%) | (+3%) | (+1%) |
> +-----------+--------+--------+-------+--------+--------+----------+
> | ZWP | 178 | 691 | 1082 | 1106 | 30.8 | 11.5 |
> | bfq | (-3%) | (-2%) | (-0%) | (+0%) | (-42%) | (-20%) |
> +-----------+--------+--------+-------+--------+--------+----------+
> | ZWP | 190 | 666 | 1083 | 1108 | 51.4 | 14.7 |
> | none | (+4%) | (-5%) | (+0%) | (+0%) | (-4%) | (+0%) |
> +-----------+--------+--------+-------+--------+--------+----------+
>
> Zone write plugging overhead does not significantly impact performance.
> Similar to nullblk, using the none scheduler leads to performance
> increase for most workloads.
>
> 3) SMR SATA HDD:
>
> +-------+--------+-------+--------+--------+----------+
> |seqw4K1|seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
> |(MB/s) | (MB/s) |(MB/s) | (MB/s) | (KIOPS)| (KIOPS) |
> +-----------+-------+--------+-------+--------+--------+----------+
> | ZWL | 107 | 243 | 245 | 246 | 2.2 | 0.763 |
> |mq-deadline| | | | | | |
> +-----------+-------+--------+-------+--------+--------+----------+
> | ZWP | 107 | 242 | 245 | 245 | 2.2 | 0.772 |
> |mq-deadline| (+0%) | (-0%) | (+0%) | (-0%) | (+0%) | (+0%) |
> +-----------+-------+--------+-------+--------+--------+----------+
> | ZWP | 104 | 241 | 246 | 242 | 2.2 | 0.765 |
> | bfq | (-2%) | (-0%) | (+0%) | (-0%) | (+0%) | (+0%) |
> +-----------+-------+--------+-------+--------+--------+----------+
> | ZWP | 115 | 235 | 249 | 242 | 2.2 | 0.763 |
> | none | (+7%) | (-3%) | (+1%) | (-1%) | (+0%) | (+0%) |
> +-----------+-------+--------+-------+--------+--------+----------+
>
> Performance with purely sequential write workloads at high queue depth
> somewhat decrease a little when using zone write plugging. This is due
> to the different IO pattern that ZWP generates where the first writes to
> a zone start being issued when the end of the previous zone are still
> being written. Depending on how the disk handles queued commands, seek
> may be generated, slightly impacting the throughput achieved. Such pure
> sequential write workloads are however rare with SMR drives.
>
> 4) Zone append tests using btrfs:
>
> +-------------+-------------+-----------+-------------+
> | null-blk | null_blk | ZNS | SMR |
> | native ZA | emulated ZA | native ZA | emulated ZA |
> | (MB/s) | (MB/s) | (MB/s) | (MB/s) |
> +-----------+-------------+-------------+-----------+-------------+
> | ZWL | 2441 | N/A | 1081 | 243 |
> |mq-deadline| | | | |
> +-----------+-------------+-------------+-----------+-------------+
> | ZWP | 2361 | 2999 | 1085 | 239 |
> |mq-deadline| (-1%) | | (+0%) | (-2%) |
> +-----------+-------------+-------------+-----------+-------------+
> | ZWP | 2299 | 2730 | 1080 | 240 |
> | bfq | (-4%) | | (+0%) | (-2%) |
> +-----------+-------------+-------------+-----------+-------------+
> | ZWP | 2443 | 3152 | 1083 | 240 |
> | none | (+0%) | | (+0%) | (-1%) |
> +-----------+-------------+-------------+-----------+-------------+
>
> With a more realistic use of the device though a file system, ZWP does
> not introduce significant performance differences, except for SMR for
> the same reason as with the fio sequential workloads at high queue
> depth.
I ran this patch set through some quick RocksDB testing
which has been very good at smoking out issues in this area before.
Setup
-----
Workloads: filluniquerandom followed by three iterations of overwrites,
readrandom, fwdrange, revrange and varieties of readwhilewriting,
Config: Direct IO, 400M key/values
Userspace: db_bench / RocksDB 7.7.3, using ZenFS with libzbd as backend
IO Scheduler: [none]
NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open and
active zones.
System: 16-cores (AMD Epyc 7302P) with 128 GB of DRAM
Results:
--------
iostat: 33.6T read, 6.5T written. No errors/failures.
Comparing performance with the baseline requiring [mq-deadline] to avoid
write reordering, we can see a ~8-10% throughput improvement for read
heavy workloads.
No regressions seen for these workloads, db_bench write throughput and
read and write tail latencies look unaffected.
Looks great, cheers!
Tested-by: Hans Holmberg <hans.holmberg at wdc.com>
>
> Changes from v3:
> - Rebase on block/for-next
> - Removed old patch 1 as it is already applied
> - Addressed Bart and Christoph comment in patch 4
> - Merged former patch 8 and 9 together and changed the zone write plug
> allocation to use a mempool
> - Removed the zone_wplugs_mempool_size filed from patch 8 and instead
> directly reference mempool->min_nr
> - Added review tags
>
> Changes from v2:
> - Added Patch 1 (Christoph's comment)
> - Fixed error code setup in Patch 3 (Bart's comment)
> - Split former patch 26 into patches 27 and 28
> - Modified patch 8 (zone write plugging) introduction to remove the
> kmem_cache use and address Bart's and Christoph comments.
> - Changed from using a mempool of zone write plugs to using a simple
> free-list (patch 9)
> - Simplified patch 10 as suggested by Christoph
> - Moved common code to a helper in patch 13 as suggested by Christoph
>
> Changes from v1:
> - Added patch 6
> - Rewrite of patch 7 to use a hash table of dynamically allocated zone
> write plugs. This results in changes in patch 11 and the addition of
> patch 8 and 9.
> - Rebased everything on 6.9.0-rc1
> - Added review tags for patches that did not change
>
> Damien Le Moal (28):
> block: Restore sector of flush requests
> block: Remove req_bio_endio()
> block: Introduce blk_zone_update_request_bio()
> block: Introduce bio_straddles_zones() and bio_offset_from_zone_start()
> block: Allow using bio_attempt_back_merge() internally
> block: Remember zone capacity when revalidating zones
> block: Introduce zone write plugging
> block: Fake max open zones limit when there is no limit
> block: Allow zero value of max_zone_append_sectors queue limit
> block: Implement zone append emulation
> block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
> dm: Use the block layer zone append emulation
> scsi: sd: Use the block layer zone append emulation
> ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
> null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
> null_blk: Introduce zone_append_max_sectors attribute
> null_blk: Introduce fua attribute
> nvmet: zns: Do not reference the gendisk conv_zones_bitmap
> block: Remove BLK_STS_ZONE_RESOURCE
> block: Simplify blk_revalidate_disk_zones() interface
> block: mq-deadline: Remove support for zone write locking
> block: Remove elevator required features
> block: Do not check zone type in blk_check_zone_append()
> block: Move zone related debugfs attribute to blk-zoned.c
> block: Replace zone_wlock debugfs entry with zone_wplugs entry
> block: Remove zone write locking
> block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
> block: Do not special-case plugging of zone write operations
>
> block/Kconfig | 5 -
> block/Makefile | 1 -
> block/bio.c | 7 +
> block/blk-core.c | 11 +-
> block/blk-flush.c | 1 +
> block/blk-merge.c | 22 +-
> block/blk-mq-debugfs-zoned.c | 22 -
> block/blk-mq-debugfs.c | 3 +-
> block/blk-mq-debugfs.h | 6 +-
> block/blk-mq.c | 131 ++-
> block/blk-mq.h | 31 -
> block/blk-settings.c | 46 +-
> block/blk-sysfs.c | 2 +-
> block/blk-zoned.c | 1353 +++++++++++++++++++++++++++--
> block/blk.h | 69 +-
> block/elevator.c | 46 +-
> block/elevator.h | 1 -
> block/genhd.c | 3 +-
> block/mq-deadline.c | 176 +---
> drivers/block/null_blk/main.c | 22 +-
> drivers/block/null_blk/null_blk.h | 2 +
> drivers/block/null_blk/zoned.c | 23 +-
> drivers/block/ublk_drv.c | 5 +-
> drivers/block/virtio_blk.c | 2 +-
> drivers/md/dm-core.h | 2 +-
> drivers/md/dm-zone.c | 476 +---------
> drivers/md/dm.c | 75 +-
> drivers/md/dm.h | 4 +-
> drivers/nvme/host/core.c | 2 +-
> drivers/nvme/target/zns.c | 10 +-
> drivers/scsi/scsi_lib.c | 1 -
> drivers/scsi/sd.c | 8 -
> drivers/scsi/sd.h | 19 -
> drivers/scsi/sd_zbc.c | 335 +------
> include/linux/blk-mq.h | 85 +-
> include/linux/blk_types.h | 30 +-
> include/linux/blkdev.h | 104 ++-
> 37 files changed, 1677 insertions(+), 1464 deletions(-)
> delete mode 100644 block/blk-mq-debugfs-zoned.c
>
More information about the Linux-nvme
mailing list