[PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices

Damien Le Moal damien.lemoal at opensource.wdc.com
Fri Mar 11 23:58:08 PST 2022


On 3/12/22 07:24, Luis Chamberlain wrote:
> On Fri, Mar 11, 2022 at 01:31:02PM -0800, Keith Busch wrote:
>> On Fri, Mar 11, 2022 at 01:04:35PM -0800, Luis Chamberlain wrote:
>>> On Fri, Mar 11, 2022 at 12:51:35PM -0800, Keith Busch wrote:
>>>
>>>> I'm starting to like the previous idea of creating an unholey
>>>> device-mapper for such users...
>>>
>>> Won't that restrict nvme with chunk size crap. For instance later if we
>>> want much larger block sizes.
>>
>> I'm not sure I understand. The chunk_size has nothing to do with the
>> block size. And while nvme is a user of this in some circumstances, it
>> can't be used concurrently with ZNS because the block layer appropriates
>> the field for the zone size.
> 
> Many device mapper targets split I/O into chunks, see max_io_len(),
> wouldn't this create an overhead?

Apart from the bio clone, the overhead should not be higher than what
the block layer already has. IOs that are too large or that are
straddling zones are split by the block layer, and DM splitting leads
generally to no split in the block layer for the underlying device IO.
DM essentially follows the same pattern: max_io_len() depends on the
target design limits, which in turn depend on the underlying device. For
a dm-unhole target, the IO size limit would typically be the same as
that of the underlying device.

> Using a device mapper target also creates a divergence in strategy
> for ZNS. Some will use the block device, others the dm target. The
> goal should be to create a unified path.

If we allow non power of 2 zone sized devices, the path will *never* be
unified because we will get fragmentation on what can run on these
devices as opposed to power of 2 sized ones. E.g. f2fs will not work for
the former but will for the latter. That is really not an ideal situation.

> 
> And all this, just because SMR. Is that worth it? Are we sure?

No. This is *not* because of SMR. Never has been. The first prototype
SMR drives I received in my lab 10 years ago did not have a power of 2
sized zone size because zones where naturally aligned to tracks, which
like NAND erase blocks, are not necessarily power of 2 sized. And all
zones were not even the same size. That was not usable.

The reason for the power of 2 requirement is 2 fold:
1) At the time we added zone support for SMR, chunk_sectors had to be a
power of 2 number of sectors.
2) SMR users did request power of 2 zone sizes and that all zones have
the same size as that simplified software design. There was even a
de-facto agreement that 256MB zone size is a good compromise between
usability and overhead of zone reclaim/GC. But that particular number is
for HDD due to their performance characteristics.

Hence the current Linux requirements which have been serving us well so
far. DM needed that chunk_sectors be changed to allow non power of 2
values. So the chunk_sectors requirement was lifted recently (can't
remember which version added this). Allowing non power of 2 zone size
would thus be more easily feasible now. Allowing devices with a non
power of 2 zone size is not technically difficult.

But...

The problem being raised is all about the fact that the power of 2 zone
size requirement creates a hole of unusable sectors in every zone when
the device implementation has a zone capacity lower than the zone size.

I have been arguing all along that I think this problem is a
non-problem, simply because a well designed application should *always*
use zones as storage containers without ever hoping that the next zone
in sequence can be used as well. The application should *never* consider
the entire LBA space of the device capacity without this zone split. The
zone based management of capacity is necessary for any good design to
deal correctly with write error recovery and active/open zone resources
management. And as Keith said. there is always a "hole" anyway for any
non-full zone, between the zone write pointer and the last usable sector
in the zone. Reads there are nonsensical and writes can only go to one
place.

Now, in the spirit of trying to facilitate software development for
zoned devices, we can try finding solutions to remove that hole. zonefs
is a obvious solution. But back to the previous point: with one zone ==
one file, there is no continuity in the storage address space that the
application can use. The application has to be designed to use
individual files representing a zone. And with such design, an
equivalent design directly using the block device file would have no
difficulties due to the the sector hole between zone capacity and zone
size. I have a prototype LevelDB implementation that can use both zonefs
and block device file on ZNS with only a few different lines of code to
prove this point.

The other solution would be adding a dm-unhole target to remap sectors
to remove the holes from the device address space. Such target would be
easy to write, but in my opinion, this would still not change the fact
that applications still have to deal with error recovery and active/open
zone resources. So they still have to be zone aware and operate per zone.

Furthermore, adding such DM target would create a non power of 2 zone
size zoned device which will need support from the block layer. So some
block layer functions will need to change. In the end, this may not be
different than enabling non power of 2 zone sized devices for ZNS.

And for this decision, I maintain some of my requirements:
1) The added overhead from multiplication & divisions should be
acceptable and not degrade performance. Otherwise, this would be a
disservice to the zone ecosystem.
2) Nothing that works today on available devices should break
3) Zone size requirements will still exist. E.g. btrfs 64K alignment
requirement

But even with all these properly addressed, f2fs will not work anymore,
some in-kernel users will still need some zone size requirements (btrfs)
and *all* applications using a zoned block device file will now have to
be designed based on non power of 2 zone size so that they can work on
all devices. Meaning that this is also potentially forcing changes on
existing applications to use newer zoned devices that may not have a
power of 2 zone size.

This entire discussion is about the problem that power of 2 zone size
creates (which again I think is a non-problem). However, based on the
arguments above, allowing non power of 2 zone sized devices is not
exactly problem free either.

My answer to your last question ("Are we sure?") is thus: No. I am not
sure this is a good idea. But as always, I would be happy to be proven
wrong. So far, I have not seen any argument doing that.

-- 
Damien Le Moal
Western Digital Research



More information about the Linux-nvme mailing list