[PATCH V12 2/3] nvmet: add ZBD over ZNS backend support

Fri Mar 12 07:25:58 GMT 2021

On 2021/03/12 15:29, Chaitanya Kulkarni wrote:
[...]
>>> +static inline u8 nvmet_zasl(unsigned int zone_append_sects)
>>> +{
>>> +	/*
>>> +	 * Zone Append Size Limit is the value experessed in the units
>>> +	 * of minimum memory page size (i.e. 12) and is reported power of 2.
>>> +	 */
>>> +	return ilog2((zone_append_sects << 9) >> NVMET_MPSMIN_SHIFT);
>> s/9/SECTOR_SHIFT
>>
>> And you could rewrite this as:
>>
>> return ilog2(zone_append_sects >> (NVMET_MPSMIN_SHIFT - SECTOR_SHIFT));
> 
> SECTOR_SHIFT patches are nacked, reason being it doesn't make code
> clear, how about following ?

Weird. I personally find it much nicer than a magic value "9"... But why not...

[...]
>>> +void nvmet_execute_identify_cns_cs_ns(struct nvmet_req *req)
>>> +{
>>> +	struct nvme_id_ns_zns *id_zns;
>>> +	u64 zsze;
>>> +	u16 status;
>>> +
>>> +	if (req->cmd->identify.csi != NVME_CSI_ZNS) {
>>> +		req->error_loc = offsetof(struct nvme_common_command, opcode);
>>> +		status = NVME_SC_INVALID_OPCODE | NVME_SC_DNR;
>>> +		goto out;
>>> +	}
>>> +
>>> +	if (le32_to_cpu(req->cmd->identify.nsid) == NVME_NSID_ALL) {
>>> +		req->error_loc = offsetof(struct nvme_identify, nsid);
>>> +		status = NVME_SC_INVALID_NS | NVME_SC_DNR;
>>> +		goto out;
>>> +	}
>>> +
>>> +	id_zns = kzalloc(sizeof(*id_zns), GFP_KERNEL);
>>> +	if (!id_zns) {
>>> +		status = NVME_SC_INTERNAL;
>>> +		goto out;
>>> +	}
>>> +
>>> +	status = nvmet_req_find_ns(req);
>>> +	if (status) {
>>> +		status = NVME_SC_INTERNAL;
>>> +		goto done;
>>> +	}
>>> +
>>> +	if (!bdev_is_zoned(req->ns->bdev)) {
>>> +		req->error_loc = offsetof(struct nvme_identify, nsid);
>>> +		status = NVME_SC_INVALID_NS | NVME_SC_DNR;
>>> +		goto done;
>>> +	}
>>> +
>>> +	nvmet_ns_revalidate(req->ns);
>>> +	zsze = (bdev_zone_sectors(req->ns->bdev) << 9) >>
>>> +					req->ns->blksize_shift;
>> s/9/SECTOR_SHIFT
> 
> See my earlier reply to the same comment.

I do not agree but if that is the nvme code policy, sure.

>>> +void nvmet_bdev_execute_zone_mgmt_recv(struct nvmet_req *req)
>>> +{
>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zmr.slba);
>>> +	u32 bufsize = (le32_to_cpu(req->cmd->zmr.numd) + 1) << 2;
>>> +	struct nvmet_report_zone_data data = { .ns = req->ns };
>>> +	unsigned int nr_zones;
>>> +	int reported_zones;
>>> +	u16 status;
>>> +
>>> +	status = nvmet_bdev_zns_checks(req);
>>> +	if (status)
>>> +		goto out;
>>> +
>>> +	data.rz = __vmalloc(bufsize, GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO);
>>> +	if (!data.rz) {
>>> +		status = NVME_SC_INTERNAL;
>>> +		goto out;
>>> +	}
>>> +
>>> +	nr_zones = (bufsize - sizeof(struct nvme_zone_report)) /
>>> +			sizeof(struct nvme_zone_descriptor);
>>> +	if (!nr_zones) {
>>> +		status = NVME_SC_INVALID_FIELD | NVME_SC_DNR;
>>> +		goto out_free_report_zones;
>>> +	}
>>> +
>>> +	reported_zones = blkdev_report_zones(req->ns->bdev, sect, nr_zones,
>>> +					     nvmet_bdev_report_zone_cb, &data);
>>> +	if (reported_zones < 0) {
>>> +		status = NVME_SC_INTERNAL;
>>> +		goto out_free_report_zones;
>>> +	}
>> There is a problem here: the code as is ignores the request reporting option
>> field which can lead to an invalid zone report being returned. I think you need
>> to modify nvmet_bdev_report_zone_cb() to look at the reporting option field
>> passed by the initiator and filter the zone report since blkdev_report_zones()
>> does not handle that argument.
> 
> The reporting options are set by the host statistically in
> nvme_ns_report_zones()
> arefrom:-  nvme_ns_report_zones()
>          c.zmr.zra = NVME_ZRA_ZONE_REPORT;
>          c.zmr.zrasf = NVME_ZRASF_ZONE_REPORT_ALL;
>          c.zmr.pr = NVME_REPORT_ZONE_PARTIAL;
> 
> All the above values are validated in the nvmet_bdev_zns_checks() helper
> called from nvmet_bdev_execute_zone_mgmt_recv() before we allocate the
> report zone buffer.
> 
> 1. c.zmr.zra indicates the action which Reports zone descriptor entries
>    through the Report Zones data structure.
> 
>    We validate this value is been set to NVME_ZRA_ZONE_REPORT in the
>    nvmet_bdev_zns_chceks(). We are calling report zone after checking
>    zone receive action it NVME_ZRA_ZONE_REPORT so not filtering is needed
>    in the nvmet_bdev_report_zone_cb().
> 
> 2. c.zmr.zrasf indicates the action specific field which is set to
>    NVME_ZRASF_ZONE_REPORT_ALL.
> 
>    We validate this value is been set to NVME_ZRASF_ZONE_REPORT_ALL in the
>    nvmet_bdev_zns_chceks(). Since host wants all the zones we don't need to
>    filter any zone states in the nvmet_bdev_report_zone_cb().
> 
> 3. c.zmr.pr is set to NVME_REPORT_ZONE_PARTIAL which value = 1 i.e value in
>    the Report Zone data structure Number of Zones field indicates the
> number of
>    fully transferred zone descriptors in the data buffer, which we set from
>    return value of the blkdev_report_zones() :-
>   
> 
>    reported_zones = blkdev_report_zones(req->ns->bdev, sect, nr_zones,
> 					     nvmet_bdev_report_zone_cb, &data);
> <snip>   data.rz->nr_zones = cpu_to_le64(reported_zones);
> 
>    So no filtering is needed in nvmet_bdev_report_zone_cb() for c.zmr.pr.
> 
> Can you please explain what filtering is missing in the current code ?
> 
> Maybe I'm looking into an old spec.

report zones command has the reporting options (ro) field (bits 15:08 of dword
13) where the user can specify the following values:

Value Description
0h List all zones.
1h List the zones in the ZSE:Empty state.
2h List the zones in the ZSIO:Implicitly Opened state.
3h List the zones in the ZSEO:Explicitly Opened state.
4h List the zones in the ZSC:Closed state.
5h List the zones in the ZSF:Full state.
6h List the zones in the ZSRO:Read Only state.
7h List the zones in the ZSO:Offline state.

to filter the zone report based on zone condition. blkdev_report_zones() will
always to a "list all zones", that is, ro == 0h.

But on the initiator side, if the client issue a report zones command through an
ioctl (passthrough/direct access not suing the block layer BLKREPORTZONES
ioctl), it may specify a different value for the ro field. Processing that
command using blkdev_report_zones() like you are doing here without any
additional filtering will give an incorrect report. Filtering based on the user
specified ro field needs to be added in nvmet_bdev_report_zone_cb().

The current code here is fine of the initiator/client side uses the block layer
and execute all report zones through blkdev_report_zones(). But things will
break if the client starts doing passthrough commands using nvme ioctl. No ?

>>> +
>>> +	data.rz->nr_zones = cpu_to_le64(reported_zones);
>>> +
>>> +	status = nvmet_copy_to_sgl(req, 0, data.rz, bufsize);
>>> +
>>> +out_free_report_zones:
>>> +	kvfree(data.rz);
>>> +out:
>>> +	nvmet_req_complete(req, status);
>>> +}
>>> +
>>> +void nvmet_bdev_execute_zone_mgmt_send(struct nvmet_req *req)
>>> +{
>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->zms.slba);
>>> +	sector_t nr_sect = bdev_zone_sectors(req->ns->bdev);
>>> +	u16 status = NVME_SC_SUCCESS;
>>> +	u8 zsa = req->cmd->zms.zsa;
>>> +	enum req_opf op;
>>> +	int ret;
>>> +	const unsigned int zsa_to_op[] = {
>>> +		[NVME_ZONE_OPEN]	= REQ_OP_ZONE_OPEN,
>>> +		[NVME_ZONE_CLOSE]	= REQ_OP_ZONE_CLOSE,
>>> +		[NVME_ZONE_FINISH]	= REQ_OP_ZONE_FINISH,
>>> +		[NVME_ZONE_RESET]	= REQ_OP_ZONE_RESET,
>>> +	};
>>> +
>>> +	if (zsa > ARRAY_SIZE(zsa_to_op) || !zsa_to_op[zsa]) {
>> What is the point of the "!zsa_to_op[zsa]" here ? All the REQ_OP_ZONE_XXX are
>> non 0, always...
> 
> Well this is just making sure that we receive the right action since sparse
> array will return 0 for any other values than listed above having
> !zsa_to_op[zsa] check we can return an error.

But zsa is unsigned and you check it against the array size. So it can only be
within 0 and array size - 1. That is enough... I really do not see the point of
clutering the condition with something that is always true...

[...]
>>> +
>>> +	ret = blkdev_zone_mgmt(req->ns->bdev, op, sect, nr_sect, GFP_KERNEL);
>>> +	if (ret)
>>> +		status = NVME_SC_INTERNAL;
>>> +out:
>>> +	nvmet_req_complete(req, status);
>>> +}
>>> +
>>> +void nvmet_bdev_execute_zone_append(struct nvmet_req *req)
>>> +{
>>> +	sector_t sect = nvmet_lba_to_sect(req->ns, req->cmd->rw.slba);
>>> +	u16 status = NVME_SC_SUCCESS;
>>> +	unsigned int total_len = 0;
>>> +	struct scatterlist *sg;
>>> +	int ret = 0, sg_cnt;
>>> +	struct bio *bio;
>>> +
>>> +	if (!nvmet_check_transfer_len(req, nvmet_rw_data_len(req)))
>>> +		return;
>> No nvmet_req_complete() call ? Is that done in nvmet_check_transfer_len() ?
> 
> Yes it does, you had the same comment on earlier version, it can be
> confusing
> that is why I proposed a helper for check transfer len and !req->sg_cnt
> check,
> but we don't want that helper.

Just add a comment saying that nvmet_check_transfer_len() calls
nvmet_req_complete(). That will help people like me with a short memory span :)

>>> +
>>> +	if (!req->sg_cnt) {
>>> +		nvmet_req_complete(req, 0);
>>> +		return;
>>> +	}
>>> +
>>> +	if (req->transfer_len <= NVMET_MAX_INLINE_DATA_LEN) {
>>> +		bio = &req->b.inline_bio;
>>> +		bio_init(bio, req->inline_bvec, ARRAY_SIZE(req->inline_bvec));
>>> +	} else {
>>> +		bio = bio_alloc(GFP_KERNEL, req->sg_cnt);
>>> +	}
>>> +
>>> +	bio_set_dev(bio, req->ns->bdev);
>>> +	bio->bi_iter.bi_sector = sect;
>>> +	bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
>>> +	if (req->cmd->rw.control & cpu_to_le16(NVME_RW_FUA))
>>> +		bio->bi_opf |= REQ_FUA;
>>> +
>>> +	for_each_sg(req->sg, sg, req->sg_cnt, sg_cnt) {
>>> +		struct page *p = sg_page(sg);
>>> +		unsigned int l = sg->length;
>>> +		unsigned int o = sg->offset;
>>> +
>>> +		ret = bio_add_zone_append_page(bio, p, l, o);
>>> +		if (ret != sg->length) {
>>> +			status = NVME_SC_INTERNAL;
>>> +			goto out_bio_put;
>>> +		}
>>> +
>>> +		total_len += sg->length;
>>> +	}
>>> +
>>> +	if (total_len != nvmet_rw_data_len(req)) {
>>> +		status = NVME_SC_INTERNAL | NVME_SC_DNR;
>>> +		goto out_bio_put;
>>> +	}
>>> +
>>> +	ret = submit_bio_wait(bio);
>> submit_bio_wait() ? Why blocking here ? That would be bad for performance. Is it
>> mandatory to block here ? The result handling could be done in the bio_end
>> callback no ?
> 
> I did initially, but zonefs uses sync I/O, I'm not sure about the btrfs,
> if it does
> please let me know I'll make it async.
> 
> If there is no async caller in the kernel for REQ_OP_ZONE_APPEND
> shouldwe make this async ?

This should not depend on what the user does, at all.
Yes, for now zonefs only uses zone append for sync write() calls. But I intend
to have zone append used for async writes too. And in btrfs, append writes are
used for all data IOs, sync or async. That is on the initiator side anyway. The
target side should not assume any special behavior of the initiator. So if there
is no technical reasons to prevent async append writes execution, I would rather
have all of them processed with async BIOs execution, similar to regular write BIOs.

-- 
Damien Le Moal
Western Digital Research