[PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices

Damien Le Moal damien.lemoal at opensource.wdc.com
Mon Mar 14 00:45:12 PDT 2022


On 3/14/22 16:35, Christoph Hellwig wrote:
> On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
>> The reason for the power of 2 requirement is 2 fold:
>> 1) At the time we added zone support for SMR, chunk_sectors had to be a
>> power of 2 number of sectors.
>> 2) SMR users did request power of 2 zone sizes and that all zones have
>> the same size as that simplified software design. There was even a
>> de-facto agreement that 256MB zone size is a good compromise between
>> usability and overhead of zone reclaim/GC. But that particular number is
>> for HDD due to their performance characteristics.
> 
> Also for NVMe we initially went down the road to try to support
> non power of two sizes.  But there was another major early host that
> really wanted the power of two zone sizes to support hardware based
> hosts that can cheaply do shifts but not divisions.  The variable
> zone capacity feature (something that Linux does not currently support)
> is a feature requested by NVMe members on the host and device side
> also can only be supported with the the zone size / zone capacity split.
> 
>> The other solution would be adding a dm-unhole target to remap sectors
>> to remove the holes from the device address space. Such target would be
>> easy to write, but in my opinion, this would still not change the fact
>> that applications still have to deal with error recovery and active/open
>> zone resources. So they still have to be zone aware and operate per zone.
> 
> I don't think we even need a new target for it.  I think you can do
> this with a table using multiple dm-linear sections already if you
> want.

Nope, this is currently not possible: DM requires the target zone size
to be the same as the underlying device zone size. So that would not work.

> 
>> My answer to your last question ("Are we sure?") is thus: No. I am not
>> sure this is a good idea. But as always, I would be happy to be proven
>> wrong. So far, I have not seen any argument doing that.
> 
> Agreed. Supporting non-power of two sizes in the block layer is fairly
> easy as shown by some of the patches seens in this series.  Supporting
> them properly in the whole ecosystem is not trivial and will create a
> long-term burden.  We could do that, but we'd rather have a really good
> reason for it, and right now I don't see that.


-- 
Damien Le Moal
Western Digital Research



More information about the Linux-nvme mailing list