[PATCH] nvme: zns: limit max_zone_append by max_segments

Mon Jul 31 05:03:46 PDT 2023

On 7/31/23 20:46, Shin'ichiro Kawasaki wrote:
> BIOs for zone append operation can not be split by nature, since device
> returns the written sectors of the BIOs. Then the number of segments of
> the BIOs shall not be larger than max_segments limit of devices to avoid
> bio split due to the number of segments.
> 
> However, the nvme driver sets max_zone_append limit referring the limit
> obtained from ZNS devices, regardless of max_segments limit of the
> devices. This allows zone append BIOs to have large size, and also to
> have number of segments larger than the max_segments limit. This causes
> unexpected BIO split. It results in WARN in bio_split() followed by
> KASAN issues. The recent commit 16d7fd3cfa72 ("zonefs: use iomap for
> synchronous direct writes") triggered such BIO split. The commit
> modified BIO preparation for zone append operations in zonefs, then
> pages for the BIOs are set up not by bio_add_hw_page() but by
> bio_iov_add_page(). This prepared each BIO segment to have one page, and
> increased the number of segments. Hence the BIO split.
> 
> To avoid the unexpected BIO split, reflect the max_segments limit to the
> max_zone_append limit. In the worst case, the safe max_zone_append size
> is max_segments multiplied by PAGE_SIZE. Compare it with the
> max_zone_append size obtained from ZNS devices, and set the smaller
> value as the max_zone_append limit.
> 
> Fixes: 240e6ee272c0 ("nvme: support for zoned namespaces")
> Cc: stable at vger.kernel.org
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki at wdc.com>
> ---
>  drivers/nvme/host/zns.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
> index ec8557810c21..9ee77626c235 100644
> --- a/drivers/nvme/host/zns.c
> +++ b/drivers/nvme/host/zns.c
> @@ -10,9 +10,11 @@
>  int nvme_revalidate_zones(struct nvme_ns *ns)
>  {
>  	struct request_queue *q = ns->queue;
> +	unsigned int max_sectors = queue_max_segments(q) << PAGE_SECTORS_SHIFT;
>  
>  	blk_queue_chunk_sectors(q, ns->zsze);
> -	blk_queue_max_zone_append_sectors(q, ns->ctrl->max_zone_append);
> +	max_sectors = min(max_sectors, ns->ctrl->max_zone_append);
> +	blk_queue_max_zone_append_sectors(q, max_sectors);

You could combine these 2 lines:

	blk_queue_max_zone_append_sectors(q,
		min(max_sectors, ns->ctrl->max_zone_append));

>  
>  	return blk_revalidate_disk_zones(ns->disk, NULL);
>  }

Christoph,

I feel like a lot of the special casing for zone append bio add page can be
removed from the block layer. This issue was found with zonefs tests on real zns
devices because of this huge (and incorrect) zone append limit that zns has,
combined with the recent zonefs iomap write change which overlooked the fact
that bio add page is done by iomap before the bio op is set to zone append. That
resulted in the large BIO. This problem however does not happen with scsi or
null blk, kind-of proving that the regular bio add page is fine for zone append
as long as the issuer has the correct zone append limit. Thought ?

-- 
Damien Le Moal
Western Digital Research