[PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices

Javier González javier at javigon.com
Mon Mar 14 03:49:38 PDT 2022


On 14.03.2022 16:45, Damien Le Moal wrote:
>On 3/14/22 16:35, Christoph Hellwig wrote:
>> On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
>>> The reason for the power of 2 requirement is 2 fold:
>>> 1) At the time we added zone support for SMR, chunk_sectors had to be a
>>> power of 2 number of sectors.
>>> 2) SMR users did request power of 2 zone sizes and that all zones have
>>> the same size as that simplified software design. There was even a
>>> de-facto agreement that 256MB zone size is a good compromise between
>>> usability and overhead of zone reclaim/GC. But that particular number is
>>> for HDD due to their performance characteristics.
>>
>> Also for NVMe we initially went down the road to try to support
>> non power of two sizes.  But there was another major early host that
>> really wanted the power of two zone sizes to support hardware based
>> hosts that can cheaply do shifts but not divisions.  The variable
>> zone capacity feature (something that Linux does not currently support)
>> is a feature requested by NVMe members on the host and device side
>> also can only be supported with the the zone size / zone capacity split.
>>
>>> The other solution would be adding a dm-unhole target to remap sectors
>>> to remove the holes from the device address space. Such target would be
>>> easy to write, but in my opinion, this would still not change the fact
>>> that applications still have to deal with error recovery and active/open
>>> zone resources. So they still have to be zone aware and operate per zone.
>>
>> I don't think we even need a new target for it.  I think you can do
>> this with a table using multiple dm-linear sections already if you
>> want.
>
>Nope, this is currently not possible: DM requires the target zone size
>to be the same as the underlying device zone size. So that would not work.
>
>>
>>> My answer to your last question ("Are we sure?") is thus: No. I am not
>>> sure this is a good idea. But as always, I would be happy to be proven
>>> wrong. So far, I have not seen any argument doing that.
>>
>> Agreed. Supporting non-power of two sizes in the block layer is fairly
>> easy as shown by some of the patches seens in this series.  Supporting
>> them properly in the whole ecosystem is not trivial and will create a
>> long-term burden.  We could do that, but we'd rather have a really good
>> reason for it, and right now I don't see that.

I think that Bo's use-case is an example of a major upstream Linux host
that is struggling with unmmapped LBAs. Can we focus on this use-case
and the parts that we are missing to support Bytedance?

If you agree to this, I believe we can add support for ZoneFS pretty
easily. We also have a POC in btrfs that we will follow on. For the time
being, F2FS would fail at mkfs time if zone size is not a PO2.

What do you think?



More information about the Linux-nvme mailing list