Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

Bart Van Assche bvanassche at acm.org
Fri Sep 23 09:19:48 PDT 2022


On 9/22/22 23:29, Matias Bjørling wrote:
> With UFS, in the proposed copy I have (may been changed) - there's
> the concept of gap zones, which is zones that cannot be accessed by
> the host. The gap zones are essentially "LBA fillers", enabling the
> next writeable zone to start at a X * pow2 size offset. My
> understanding is that this specific approach was chosen to simplify
> standardization in UFS and avoid updating T10's ZBC with zone
> capacity support.
> 
> While UFS would technically expose non-power of 2 zone sizes, they're
> also, due to the gap zones, could also be considered power of 2 zones
> if one considers the seq. write zone + the gap zone as a single
> unit.
> 
> When I think about having UFS support in the kernel, the SWR and the
> gap zone could be represented as a single unit. For example:
> 
> UFS - Zone Report
>    Zone 0: SWR, LBA 0-11
>    Zone 1: Gap, LBA 12-15
>    Zone 2: SWR, LBA 16-27
>    Zone 3: Gap, LBA 28-31
>    ...
> 
> Kernel representation - Zone Report (as supported today)
>    Zone 0: SWR, LBA 0-15, Zone Capacity 12
>    Zone 1: SWR, LBA 16-31, Zone Capacity 12
>    ...
> 
> If doing it this way, it removes the need for filesystems,
> device-mappers, user-space applications having to understand gap
> zones, and allows UFS to work out of the box with no changes to the
> rest of the zoned storage eco-system.
> 
> Has the above representation been considered?

Hi Matias,

What has been described above is the approach from the first version of 
the zoned storage for UFS (ZUFS) draft standard. Support for this 
approach is available in the upstream kernel. See also "[PATCH v2 0/9] 
Support zoned devices with gap zones", 2022-04-21 
(https://lore.kernel.org/linux-scsi/20220421183023.3462291-1-bvanassche@acm.org/).

Since F2FS extents must be split at gap zones, gap zones negatively 
affect sequential read and write performance. So we abandoned the gap 
zone approach. The current approach is as follows:
* The power-of-two restriction for the offset between zone starts has 
been removed. Gap zones are no longer required. Hence, we will need the 
patches that add support for zone sizes that are not a power of two.
* The Sequential Write Required (SWR) and Sequential Write Preferred 
(SWP) zone types are supported. The feedback we received from UFS 
vendors is that which zone type works best depends on their firmware and 
ASIC design.
* We need a queue depth larger than one (QD > 1) for writes to achieve 
the full sequential write bandwidth. We plan to support QD > 1 as follows:
   - If writes have to be serialized, submit these to the same hardware
     queue. According to the UFS host controller interface (UFSHCI)
     standard, UFS host controllers are not allowed to reorder SCSI
     commands that are submitted to the same hardware queue. A source of
     command reordering that remains is the SCSI retry mechanism. Retries
     happen e.g. after a command timeout.
   - For SWP zones, require the UFS device firmware to use its garbage
     collection mechanism to reorder data in the unlikely case that
     out-of-order writes happened.
   - For SWR zones, retry writes that failed because these were received
     out-of-order by a UFS device. ZBC-1 requires compliant devices to
     respond with ILLEGAL REQUEST / UNALIGNED WRITE COMMAND to out-of-
     order writes.

We have considered the zone append approach but decided not to use it 
because if zone append commands get reordered the data ends up 
permanently out-of-order on the storage medium. This affects sequential 
read performance negatively.

Bart.



More information about the Linux-nvme mailing list