[PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
Christoph Hellwig
hch at lst.de
Mon Mar 14 00:35:37 PDT 2022
On Sat, Mar 12, 2022 at 04:58:08PM +0900, Damien Le Moal wrote:
> The reason for the power of 2 requirement is 2 fold:
> 1) At the time we added zone support for SMR, chunk_sectors had to be a
> power of 2 number of sectors.
> 2) SMR users did request power of 2 zone sizes and that all zones have
> the same size as that simplified software design. There was even a
> de-facto agreement that 256MB zone size is a good compromise between
> usability and overhead of zone reclaim/GC. But that particular number is
> for HDD due to their performance characteristics.
Also for NVMe we initially went down the road to try to support
non power of two sizes. But there was another major early host that
really wanted the power of two zone sizes to support hardware based
hosts that can cheaply do shifts but not divisions. The variable
zone capacity feature (something that Linux does not currently support)
is a feature requested by NVMe members on the host and device side
also can only be supported with the the zone size / zone capacity split.
> The other solution would be adding a dm-unhole target to remap sectors
> to remove the holes from the device address space. Such target would be
> easy to write, but in my opinion, this would still not change the fact
> that applications still have to deal with error recovery and active/open
> zone resources. So they still have to be zone aware and operate per zone.
I don't think we even need a new target for it. I think you can do
this with a table using multiple dm-linear sections already if you
want.
> My answer to your last question ("Are we sure?") is thus: No. I am not
> sure this is a good idea. But as always, I would be happy to be proven
> wrong. So far, I have not seen any argument doing that.
Agreed. Supporting non-power of two sizes in the block layer is fairly
easy as shown by some of the patches seens in this series. Supporting
them properly in the whole ecosystem is not trivial and will create a
long-term burden. We could do that, but we'd rather have a really good
reason for it, and right now I don't see that.
More information about the Linux-nvme
mailing list