[RFC PATCH -next v3 01/10] block: introduce BLK_FEAT_WRITE_ZEROES_UNMAP to queue limits features

Zhang Yi yi.zhang at huaweicloud.com
Wed Apr 9 20:52:17 PDT 2025


On 2025/4/9 18:31, Christoph Hellwig wrote:
> On Tue, Mar 18, 2025 at 03:35:36PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang at huawei.com>
>>
>> Currently, disks primarily implement the write zeroes command (aka
>> REQ_OP_WRITE_ZEROES) through two mechanisms: the first involves
>> physically writing zeros to the disk media (e.g., HDDs), while the
>> second performs an unmap operation on the logical blocks, effectively
>> putting them into a deallocated state (e.g., SSDs). The first method is
>> generally slow, while the second method is typically very fast.
>>
>> For example, on certain NVMe SSDs that support NVME_NS_DEAC, submitting
>> REQ_OP_WRITE_ZEROES requests with the NVME_WZ_DEAC bit can accelerate
>> the write zeros operation by placing disk blocks into
> 
> Note that this is a can, not a must.  The NVMe definition of Write
> Zeroes is unfortunately pretty stupid.
> 
>> +		[RO] Devices that explicitly support the unmap write zeroes
>> +		operation in which a single write zeroes request with the unmap
>> +		bit set to zero out the range of contiguous blocks on storage
>> +		by freeing blocks, rather than writing physical zeroes to the
>> +		media.
> 
> This is not actually guaranteed for nvme or scsi.

Thank you for your review and comments. However, I'm not sure I fully
understand your points. Could you please provide more details?

AFAIK, the NVMe protocol has the following description in the latest
NVM Command Set Specification Figure 82 and Figure 114:

===
Deallocate (DEAC): If this bit is set to ‘1’, then the host is
requesting that the controller deallocate the specified logical blocks.
If this bit is cleared to ‘0’, then the host is not requesting that
the controller deallocate the specified logical blocks...

DLFEAT:
Write Zeroes Deallocation Support (WZDS): If this bit is set to ‘1’,
then the controller supports the Deallocate bit in the Write Zeroes
command for this namespace...
Deallocation Read Behavior (DRB): This field indicates the deallocated
logical block read behavior. For a logical block that is deallocated,
this field indicates the values read from that deallocated logical block
and its metadata (excluding protection information)...

  Value  Definition
  001b   A deallocated logical block returns all bytes cleared to 0h
===

At the same time, the current kernel determines whether to set the
unmap bit when submitting the write zeroes command based on the above
protocol. So I think this rules should be clear now.

Were you saying that what is described in this protocol is not a
mandatory requirement? Which means the disks that claiming to support
the UNMAP write zeroes command(WZDS=1,DRB=1), but in fact, they still
write actual zeroes data to the storage media? Or were you referring
to some irregular disks that do not obey the protocol and mislead
users?

Thanks,
Yi.




More information about the Linux-nvme mailing list