Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]
Bart Van Assche
bvanassche at acm.org
Fri Sep 23 09:19:48 PDT 2022
On 9/22/22 23:29, Matias Bjørling wrote:
> With UFS, in the proposed copy I have (may been changed) - there's
> the concept of gap zones, which is zones that cannot be accessed by
> the host. The gap zones are essentially "LBA fillers", enabling the
> next writeable zone to start at a X * pow2 size offset. My
> understanding is that this specific approach was chosen to simplify
> standardization in UFS and avoid updating T10's ZBC with zone
> capacity support.
>
> While UFS would technically expose non-power of 2 zone sizes, they're
> also, due to the gap zones, could also be considered power of 2 zones
> if one considers the seq. write zone + the gap zone as a single
> unit.
>
> When I think about having UFS support in the kernel, the SWR and the
> gap zone could be represented as a single unit. For example:
>
> UFS - Zone Report
> Zone 0: SWR, LBA 0-11
> Zone 1: Gap, LBA 12-15
> Zone 2: SWR, LBA 16-27
> Zone 3: Gap, LBA 28-31
> ...
>
> Kernel representation - Zone Report (as supported today)
> Zone 0: SWR, LBA 0-15, Zone Capacity 12
> Zone 1: SWR, LBA 16-31, Zone Capacity 12
> ...
>
> If doing it this way, it removes the need for filesystems,
> device-mappers, user-space applications having to understand gap
> zones, and allows UFS to work out of the box with no changes to the
> rest of the zoned storage eco-system.
>
> Has the above representation been considered?
Hi Matias,
What has been described above is the approach from the first version of
the zoned storage for UFS (ZUFS) draft standard. Support for this
approach is available in the upstream kernel. See also "[PATCH v2 0/9]
Support zoned devices with gap zones", 2022-04-21
(https://lore.kernel.org/linux-scsi/20220421183023.3462291-1-bvanassche@acm.org/).
Since F2FS extents must be split at gap zones, gap zones negatively
affect sequential read and write performance. So we abandoned the gap
zone approach. The current approach is as follows:
* The power-of-two restriction for the offset between zone starts has
been removed. Gap zones are no longer required. Hence, we will need the
patches that add support for zone sizes that are not a power of two.
* The Sequential Write Required (SWR) and Sequential Write Preferred
(SWP) zone types are supported. The feedback we received from UFS
vendors is that which zone type works best depends on their firmware and
ASIC design.
* We need a queue depth larger than one (QD > 1) for writes to achieve
the full sequential write bandwidth. We plan to support QD > 1 as follows:
- If writes have to be serialized, submit these to the same hardware
queue. According to the UFS host controller interface (UFSHCI)
standard, UFS host controllers are not allowed to reorder SCSI
commands that are submitted to the same hardware queue. A source of
command reordering that remains is the SCSI retry mechanism. Retries
happen e.g. after a command timeout.
- For SWP zones, require the UFS device firmware to use its garbage
collection mechanism to reorder data in the unlikely case that
out-of-order writes happened.
- For SWR zones, retry writes that failed because these were received
out-of-order by a UFS device. ZBC-1 requires compliant devices to
respond with ILLEGAL REQUEST / UNALIGNED WRITE COMMAND to out-of-
order writes.
We have considered the zone append approach but decided not to use it
because if zone append commands get reordered the data ends up
permanently out-of-order on the storage medium. This affects sequential
read performance negatively.
Bart.
More information about the Linux-nvme
mailing list