[PATCH] nvme: zns: limit max_zone_append by max_segments

Christoph Hellwig hch at infradead.org
Mon Jul 31 06:51:26 PDT 2023


On Mon, Jul 31, 2023 at 09:03:46PM +0900, Damien Le Moal wrote:
> I feel like a lot of the special casing for zone append bio add page can be
> removed from the block layer. This issue was found with zonefs tests on real zns
> devices because of this huge (and incorrect) zone append limit that zns has,
> combined with the recent zonefs iomap write change which overlooked the fact
> that bio add page is done by iomap before the bio op is set to zone append. That
> resulted in the large BIO. This problem however does not happen with scsi or
> null blk, kind-of proving that the regular bio add page is fine for zone append
> as long as the issuer has the correct zone append limit. Thought ?

A zone append limit larger than max_sectors is odd, and maybe the
block layer should assert something.  I think the root cause is that
many NVMe devices have a very large hardware equivalent to max_sectors
(the MDTS field), but Linux still uses a much lower limit due to memory
allocation issues (the PRPs used by NVMe are very inefficient in terms
of memory usage for larger transfers).  So we cap max_sectors to the
ѕoftware limit, but not max_zoned_append_sectors.

Zone Append needs some amount of special casing in the block layer
because the splitting of Zone Append bios must happen in the file system
as the file system needs a completion context per hardware operation.
I think the best way to do that is to first build up a maximum bio
and then use the same bio_split_rw function that the block layer would
use to split it to the hardware limits, just in the the issuer.  This
is what I did in btrfs, and it seems like zonefs actually needs to do
the same, but I missed that during review of the recent direct I/O
changes.



More information about the Linux-nvme mailing list