[PATCH] nvme: zns: limit max_zone_append by max_segments

Mon Jul 31 07:01:21 PDT 2023

On 7/31/23 22:51, Christoph Hellwig wrote:
> On Mon, Jul 31, 2023 at 09:03:46PM +0900, Damien Le Moal wrote:
>> I feel like a lot of the special casing for zone append bio add page can be
>> removed from the block layer. This issue was found with zonefs tests on real zns
>> devices because of this huge (and incorrect) zone append limit that zns has,
>> combined with the recent zonefs iomap write change which overlooked the fact
>> that bio add page is done by iomap before the bio op is set to zone append. That
>> resulted in the large BIO. This problem however does not happen with scsi or
>> null blk, kind-of proving that the regular bio add page is fine for zone append
>> as long as the issuer has the correct zone append limit. Thought ?
> 
> A zone append limit larger than max_sectors is odd, and maybe the
> block layer should assert something.  I think the root cause is that
> many NVMe devices have a very large hardware equivalent to max_sectors
> (the MDTS field), but Linux still uses a much lower limit due to memory
> allocation issues (the PRPs used by NVMe are very inefficient in terms
> of memory usage for larger transfers).  So we cap max_sectors to the
> ѕoftware limit, but not max_zoned_append_sectors.
> 
> Zone Append needs some amount of special casing in the block layer
> because the splitting of Zone Append bios must happen in the file system
> as the file system needs a completion context per hardware operation.
> I think the best way to do that is to first build up a maximum bio
> and then use the same bio_split_rw function that the block layer would
> use to split it to the hardware limits, just in the the issuer.  This
> is what I did in btrfs, and it seems like zonefs actually needs to do
> the same, but I missed that during review of the recent direct I/O
> changes.

We cannot do that in zonefs because there is no metadata to handle a possible
reordering of the fragments of a split large zone append. Hence zonefs limits
writes size to max zone append on entry and never tries to do larger writes. But
the ZNS limit bug resulted in the split. At least for zonefs, I think there is
no need to use the special bio add page since with a proper limit, we should
never see a split.

-- 
Damien Le Moal
Western Digital Research