[PATCH 0/6] power_of_2 emulation support for NVMe ZNS devices

David Sterba dsterba at suse.cz
Tue Mar 15 07:27:40 PDT 2022


On Tue, Mar 15, 2022 at 02:14:23PM +0000, Johannes Thumshirn wrote:
> On 15/03/2022 14:52, Javier González wrote:
> > On 15.03.2022 14:30, Christoph Hellwig wrote:
> >> On Tue, Mar 15, 2022 at 02:26:11PM +0100, Javier González wrote:
> >>> but we do not see a usage for ZNS in F2FS, as it is a mobile
> >>> file-system. As other interfaces arrive, this work will become natural.
> >>>
> >>> ZoneFS and butrfs are good targets for ZNS and these we can do. I would
> >>> still do the work in phases to make sure we have enough early feedback
> >>> from the community.
> >>>
> >>> Since this thread has been very active, I will wait some time for
> >>> Christoph and others to catch up before we start sending code.
> >>
> >> Can someone summarize where we stand?  Between the lack of quoting
> >> from hell and overly long lines from corporate mail clients I've
> >> mostly stopped reading this thread because it takes too much effort
> >> actually extract the information.
> > 
> > Let me give it a try:
> > 
> >   - PO2 emulation in NVMe is a no-go. Drop this.
> > 
> >   - The arguments against supporting PO2 are:
> >       - It makes ZNS depart from a SMR assumption of PO2 zone sizes. This
> >         can create confusion for users of both SMR and ZNS
> > 
> >       - Existing applications assume PO2 zone sizes, and probably do
> >         optimizations for these. These applications, if wanting to use
> >         ZNS will have to change the calculations
> > 
> >       - There is a fear for performance regressions.
> > 
> >       - It adds more work to you and other maintainers
> > 
> >   - The arguments in favour of PO2 are:
> >       - Unmapped LBAs create holes that applications need to deal with.
> >         This affects mapping and performance due to splits. Bo explained
> >         this in a thread from Bytedance's perspective.  I explained in an
> >         answer to Matias how we are not letting zones transition to
> >         offline in order to simplify the host stack. Not sure if this is
> >         something we want to bring to NVMe.
> > 
> >       - As ZNS adds more features and other protocols add support for
> >         zoned devices we will have more use-cases for the zoned block
> >         device. We will have to deal with these fragmentation at some
> >         point.
> > 
> >       - This is used in production workloads in Linux hosts. I would
> >         advocate for this not being off-tree as it will be a headache for
> >         all in the future.
> > 
> >   - If you agree that removing PO2 is an option, we can do the following:
> >       - Remove the constraint in the block layer and add ZoneFS support
> >         in a first patch.
> > 
> >       - Add btrfs support in a later patch
> 
> (+ linux-btrfs )
> 
> Please also make sure to support btrfs and not only throw some patches 
> over the fence. Zoned device support in btrfs is complex enough and has 
> quite some special casing vs regular btrfs, which we're working on getting
> rid of. So having non-power-of-2 zone size, would also mean having NPO2
> block-groups (and thus block-groups not aligned to the stripe size).
> 
> Just thinking of this and knowing I need to support it gives me a 
> headache.

PO2 is really easy to work with and I guess allocation on the physical
device could also benefit from that, I'm still puzzled why the NPO2 is
even proposed.

We can possibly hide the calculations behind some API so I hope in the
end it should be bearable. The size of block groups is flexible we only
want some reasonable alignment.

> Also please consult the rest of the btrfs developers for thoughts on this.
> After all btrfs has full zoned support (including ZNS, not saying it's 
> perfect) and is also the default FS for at least two Linux distributions.

I haven't read the whole thread yet, my impression is that some hardware
is deliberately breaking existing assumptions about zoned devices and in
turn breaking btrfs support. I hope I'm wrong on that or at least that
it's possible to work around it.



More information about the Linux-nvme mailing list