[PATCH v12 13/13] dm: add power-of-2 target for zoned devices with non power-of-2 zone sizes

Pankaj Raghav p.raghav at samsung.com
Mon Sep 5 05:48:59 PDT 2022


Hi Mike,

>> Note that the target does not support partial mapping of the underlying
>> device.
>>
>> Signed-off-by: Pankaj Raghav <p.raghav at samsung.com>
>> Suggested-by: Johannes Thumshirn <johannes.thumshirn at wdc.com>
>> Suggested-by: Damien Le Moal <damien.lemoal at wdc.com>
>> Suggested-by: Hannes Reinecke <hare at suse.de>
> 
> 
> This target needs more review from those who Suggested-by it.
> 
> And the header and docs needs to address:
> 
> 1) why is a partial mapping of the underlying device disallowed?
While it is technically possible, I don't see any use-case to do so for
this target. I can mention it in the documentation as well.

> 2) why is it assumed all IO is read-only? (talk to me and others like
>    we don't know the inherent limitations of this class of zoned hw)
>
TL;DR: no, we don't assume all IO to be read-only. All operations all
allowed until the zone capacity, and only reads are permitted in the
emulated gap area.

A bit of context about Zoned HW(especially ZNS SSD):

Zone: A contiguous range of logical block addresses managed as a single unit.
Zoned Block Device: A block device that consists of multiple zones.
Zone size: Size of a zone
Zone capacity: Usable logical blocks in a zone

According to ZNS spec, the LBAs from zone capacity to zone size behave like
deallocated blocks when read and are not allowed to be written. Until now,
zone capacity can be any value, but zone size needed to be a power-of-2 to
work in Linux (More information about this is also in my cover letter).

This patch series aims to allow non-po2 zone size devices with zone
capacity == zone size to work in Linux.

A non-po2 zone size device might not work correctly in filesystems that
support zoned devices such as btrfs and f2fs as they assume po2 zone sizes.
Therefore, this target is created to enable these filesystems to work with
non-po2 zone sizes until native support is added.

This target's zone capacity will be the same as the underlying device, but
the target's zone size will be the nearest po2 value of its zone capacity.
Furthermore, the area between the zone capacity and zone size of the target
(emulated gap area) will resemble the spec behavior: behave like the
deallocated blocks when read (we fill zeroes in the bio) and are not
allowed to write.

Does that clarify your question?
> On a code level:
> 1) are you certain you're properly failing all writes?
>    - are writes allowed to the "zone capacity area" but _not_
>      allowed to the "emulated zone area"? (if yes, _please document_). 
I have already documented in Documentation:

A simple remap is performed for all the BIOs that do not cross the
emulation gap area, i.e., the area between the zone capacity and size.

If a BIO lies in the emulation gap area, the following operations are
performed:

  Read:
    - If the BIO lies entirely in the emulation gap area, then zero out the
BIO and complete it.

    - If the BIO spans the emulation gap area, split the BIO across the
zone capacity boundary and remap only the BIO within the zone capacity
boundary. The other part of the split BIO will be zeroed out.

  Other operations:
    - Return an error

Maybe it is not clear enough?? Let me know.

> 2) yes, you absolutely need to implement the .status target_type hook
>    (for both STATUS and TABLE).
I already queued this change locally. I will send it as a part of the next rev.

> 3) really not loving the nested return (of DM_MAPIO_SUBMITTED or
>    DM_MAPIO_REMAPPED) from methods called from dm_po2z_map().  Would
>    prefer to not have to do a depth-first search to see where and when
>    dm_po2z_map() returns a DM_MAPIO_XXX unless there is a solid
>    justification for it.  To me it just obfuscates the DM interface a
>    bit too much. 
> 
Got it. Do you prefer having the return statements in the dm_po2z_map
itself instead of returning a helper function, which in return returns the
status code?

What about something like this:

static inline void dm_po2z_read_zeroes(struct bio *bio)
{
	zero_fill_bio(bio);
	bio_endio(bio);
}

static int dm_po2z_map(struct dm_target *ti, struct bio *bio)
{
	struct dm_po2z_target *dmh = ti->private;
	int split_io_pos;

	bio_set_dev(bio, dmh->dev->bdev);

	if (op_is_zone_mgmt(bio_op(bio)))
		goto remap_sector;

	if (!bio_sectors(bio))
		return DM_MAPIO_REMAPPED;

	if (!dm_po2z_bio_in_emulated_zone_area(dmh, bio, &split_io_pos))
		goto remap_sector;

	/*
	 * Read operation on the emulated zone area (between zone capacity
	 * and zone size) will fill the bio with zeroes.Any other operation
	 * in the emulated area should return an error.
	 */
	if (bio_op(bio) == REQ_OP_READ) {
		/*
		 * If the bio is across emulated zone boundary, split
		 * the bio at
		 * the boundary.
		 */
		if (split_io_pos > 0) {
			dm_accept_partial_bio(bio, split_io_pos);
			goto remap_sector;
		}

		dm_po2z_read_zeroes(bio);
		return DM_MAPIO_SUBMITTED;
	}
	return DM_MAPIO_KILL;

remap_sector:
	bio->bi_iter.bi_sector =
		target_to_device_sect(dmh, bio->bi_iter.bi_sector);

	return DM_MAPIO_REMAPPED;
}

> Otherwise, pretty clean code and nothing weird going on.
> 
> I look forward to seeing your next (final?) revision of this patchset.
> 
> Thanks,
> Mike
> 



More information about the Linux-nvme mailing list