[PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices
Javier González
javier at javigon.com
Tue Mar 15 06:05:34 PDT 2022
On 15.03.2022 12:32, Matias Bjørling wrote:
>> >Given the above, applications have to be conscious of zones in general and
>> work within their boundaries. I don't understand how applications can work
>> without having per-zone knowledge. An application would have to know about
>> zones and their writeable capacity. To decide where and how data is written,
>> an application must manage writing across zones, specific offline zones, and
>> (currently) its writeable capacity. I.e., knowledge about zones and holes is
>> required for writing to zoned devices and isn't eliminated by removing the PO2
>> zone size requirement.
>>
>> Supporting offlines zones is optional in the ZNS spec? We are not considering
>> supporting this in the host. This will be handled by the device for exactly
>> maintaining the SW stack simpler.
>
>It isn't optional. The spec allows any zones to go to Read Only or Offline state at any point in time. A specific implementation might give some guarantees to when such transitions happens, but it must nevertheless must be managed by the host software.
>
>Given that, and the need to not issue writes that spans zones, an application would have to aware of such behaviors. The information to make those decisions are in a zone's attributes, and thus applications would pull those, it would also know the writeable capability of a zone. So, all in all, creating support for NPO2 is something that takes a lot of work, but might have little to no impact on the overall software design.
Thanks for the clarification. I can attest that we are giving the
guarantee to simplify the host stack. I believe we are making many
assumptions in Linux too to simplify ZNS support.
This said, I understand your point. I am not developing application
support. I will refer again to Bo's response on the use case on where
holes are problematic.
>
>> >
>> >For years, the PO2 requirement has been known in the Linux community and
>> by the ZNS SSD vendors. Some SSD implementors have chosen not to support
>> PO2 zone sizes, which is a perfectly valid decision. But its implementors
>> knowingly did that while knowing that the Linux kernel didn't support it.
>> >
>> >I want to turn the argument around to see it from the kernel developer's point
>> of view. They have communicated the PO2 requirement clearly, there's good
>> precedence working with PO2 zone sizes, and at last, holes can't be avoided
>> and are part of the overall design of zoned storage devices. So why should the
>> kernel developer's take on the long-term maintenance burden of NPO2 zone
>> sizes?
>>
>> You have a good point, and that is the question we need to help answer.
>> As I see it, requirements evolve and the kernel changes with it as long as there
>> are active upstream users for it.
>
>True. There's also active users for SSDs which are custom (e.g., larger than 4KiB writes required) - but they aren't supported by the Linux kernel and isn't actively being worked on to my knowledge. Which is fine, as the customers anyway uses this in their own way, and don't need the Linux kernel support.
Ask things become stable some might choose to push support for certain
features in the Kernel. In this case, the changes are not big in the
block layer. I believe it is a process and the features should be chosen
to maximize benefit and minimize maintenance cost.
>
>>
>> The main constraint for (1) PO2 is removed in the block layer, we have (2) Linux hosts
>> stating that unmapped LBAs are a problem, and we have (3) HW supporting
>> size=capacity.
>>
>> I would be happy to hear what else you would like to see for this to be of use to
>> the kernel community.
>
>(Added numbers to your paragraph above)
>
>1. The sysfs chunksize attribute was "misused" to also represent zone size. What has changed is that RAID controllers now can use a NPO2 chunk size. This wasn't meant to naturally extend to zones, which as shown in the current posted patchset, is a lot more work.
True. But this was the main constraint for PO2.
>2. Bo mentioned that the software already manages holes. It took a bit of time to get right, but now it works. Thus, the software in question is already capable of working with holes. Thus, fixing this, would present itself as a minor optimization overall. I'm not convinced the work to do this in the kernel is proportional to the change it'll make to the applications.
I will let Bo response himself to this.
>3. I'm happy to hear that. However, I'll like to reiterate the point that the PO2 requirement have been known for years. That there's a drive doing NPO2 zones is great, but a decision was made by the SSD implementors to not support the Linux kernel given its current implementation.
Zone devices has been supported for years in SMR, and I this is a strong
argument. However, ZNS is still very new and customers have several
requirements. I do not believe that a HDD stack should have such an
impact in NVMe.
Also, we will see new interfaces adding support for zoned devices in the
future.
We should think about the future and not the past.
>
>All that said - if there are people willing to do the work and it doesn't have a negative impact on performance, code quality, maintenance complexity, etc. then there isn't anything saying support can't be added - but it does seem like it’s a lot of work, for little overall benefits to applications and the host users.
Exactly.
Patches in the block layer are trivial. This is running in production
loads without issues. I have tried to highlight the benefits in previous
benefits and I believe you understand them.
Support for ZoneFS seems easy too. We have an early POC for btrfs and it
seems it can be done. We sign up for these 2.
As for F2FS and dm-zoned, I do not think these are targets at the
moment. If this is the path we follow, these will bail out at mkfs time.
If we can agree on the above, I believe we can start with the code that
enables the existing customers and build support for butrfs and ZoneFS
in the next few months.
What do you think?
More information about the Linux-nvme
mailing list