[PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices

Mon Mar 14 07:16:36 PDT 2022

> >> Agreed. Supporting non-power of two sizes in the block layer is
> >> fairly easy as shown by some of the patches seens in this series.
> >> Supporting them properly in the whole ecosystem is not trivial and
> >> will create a long-term burden.  We could do that, but we'd rather
> >> have a really good reason for it, and right now I don't see that.
> 
> I think that Bo's use-case is an example of a major upstream Linux host that is
> struggling with unmmapped LBAs. Can we focus on this use-case and the parts
> that we are missing to support Bytedance?

Any application that uses zoned storage devices would have to manage unmapped LBAs due to the potential of zones being/becoming offline (no reads/writes allowed). Eliminating the difference between zone cap and zone size will not remove this requirement, and holes will continue to exist. Furthermore, writing to LBAs across zones is not allowed by the specification and must also be managed.

Given the above, applications have to be conscious of zones in general and work within their boundaries. I don't understand how applications can work without having per-zone knowledge. An application would have to know about zones and their writeable capacity. To decide where and how data is written, an application must manage writing across zones, specific offline zones, and (currently) its writeable capacity. I.e., knowledge about zones and holes is required for writing to zoned devices and isn't eliminated by removing the PO2 zone size requirement.

For years, the PO2 requirement has been known in the Linux community and by the ZNS SSD vendors. Some SSD implementors have chosen not to support PO2 zone sizes, which is a perfectly valid decision. But its implementors knowingly did that while knowing that the Linux kernel didn't support it. 

I want to turn the argument around to see it from the kernel developer's point of view. They have communicated the PO2 requirement clearly, there's good precedence working with PO2 zone sizes, and at last, holes can't be avoided and are part of the overall design of zoned storage devices. So why should the kernel developer's take on the long-term maintenance burden of NPO2 zone sizes?