[RFC PATCH 0/8] block: add support for REQ_OP_VERIFY

Douglas Gilbert dgilbert at interlog.com
Thu Nov 4 08:16:41 PDT 2021


On 2021-11-04 2:46 a.m., Chaitanya Kulkarni wrote:
> From: Chaitanya Kulkarni <kch at nvidia.com>
> 
> Hi,
> 
> One of the responsibilities of the Operating System, along with managing
> resources, is to provide a unified interface to the user by creating
> hardware abstractions. In the Linux Kernel storage stack that
> abstraction is created by implementing the generic request operations
> such as REQ_OP_READ/REQ_OP_WRITE or REQ_OP_DISCARD/REQ_OP_WRITE_ZEROES,
> etc that are mapped to the specific low-level hardware protocol commands
> e.g. SCSI or NVMe.
> 
> With that in mind, this RFC patch-series implements a new block layer
> operation to offload the data verification on to the controller if
> supported or emulate the operation if not. The main advantage is to free
> up the CPU and reduce the host link traffic since, for some devices,
> their internal bandwidth is higher than the host link and offloading this
> operation can improve the performance of the proactive error detection
> applications such as file system level scrubbing.
> 
> * Background *
> -----------------------------------------------------------------------
> 
> NVMe Specification provides a controller level Verify command [1] which
> is similar to the ATA Verify [2] command where the controller is
> responsible for data verification without transferring the data to the
> host. (Offloading LBAs verification). This is designed to proactively
> discover any data corruption issues when the device is free so that
> applications can protect sensitive data and take corrective action
> instead of waiting for failure to occur.
> 
> The NVMe Verify command is added in order to provide low level media
> scrubbing and possibly moving the data to the right place in case it has
> correctable media degradation. Also, this provides a way to enhance
> file-system level scrubbing/checksum verification and optinally offload
> this task, which is CPU intensive, to the kernel (when emulated), over
> the fabric, and to the controller (when supported).
> 
> This is useful when the controller's internal bandwidth is higher than
> the host's bandwith showing a sharp increase in the performance due to
> _no host traffic or host CPU involvement_.
> 
> * Implementation *
> -----------------------------------------------------------------------
> 
> Right now there is no generic interface which can be used by the
> in-kernel components such as file-system or userspace application
> (except passthru commands or some combination of write/read/compare) to
> issue verify command with the central block layer API. This can lead to
> each userspace applications having protocol specific IOCTL which
> defeates the purpose of having the OS provide a H/W abstraction.
> 
> This patch series introduces a new block layer payloadless request
> operation REQ_OP_VERIFY that allows in-kernel components & userspace
> applications to verify the range of the LBAs by offloading checksum
> scrubbing/verification to the controller that is directly attached to
> the host. For direct attached devices this leads to decrease in the host
> DMA traffic and CPU usage and for the fabrics attached device over the
> network that leads to a decrease in the network traffic and CPU usage
> for both host & target.
> 
> * Scope *
> -----------------------------------------------------------------------
> 
> Please note this only covers the operating system level overhead.
> Analyzing controller verify command performance for common protocols
> (SCSI/NVMe) is out of scope for REQ_OP_VERIFY.
> 
> * Micro Benchmarks *
> -----------------------------------------------------------------------
> 
> When verifing 500GB of data on NVMeOF with nvme-loop and null_blk as a
> target backend block device results show almost a 80% performance
> increase :-
> 
> With Verify resulting in REQ_OP_VERIFY to null_blk :-
> 
> real	2m3.773s
> user	0m0.000s
> sys	0m59.553s
> 
> With Emulation resulting in REQ_OP_READ null_blk :-
> 
> real	12m18.964s
> user	0m0.002s
> sys	1m15.666s
> 
> 
> A detailed test log is included at the end of the cover letter.
> Each of the following was tested:
> 
> 1. Direct Attached REQ_OP_VERIFY.
> 2. Fabrics Attached REQ_OP_VERIFY.
> 3. Multi-device (md) REQ_OP_VERIFY.
> 
> * The complete picture *
> -----------------------------------------------------------------------
> 
>    For the completeness the whole kernel stack support is divided into
>    two phases :-
>   
>    Phase I :-
>   
>     Add and stabilize the support for the Block layer & low level drivers
>     such as SCSI, NVMe, MD, and NVMeOF, implement necessary emulations in
>     the block layer if needed and provide block level tools such as
>     _blkverify_. Also, add appropriate testcases for code-coverage.
> 
>    Phase II :-
>   
>     Add and stabilize the support for upper layer kernel components such
>     as file-systems and provide userspace tools such _fsverify_ to route
>     the request from file systems to block layer to Low level device
>     drivers.
> 
> 
> Please note that the interfaces for blk-lib.c REQ_OP_VERIFY emulation
> will change in future I put together for the scope of RFC.
> 
> Any comments are welcome.

Hi,
You may also want to consider higher level support for the NVME COMPARE
and SCSI VERIFY(BYTCHK=1) commands. Since PCIe and SAS transports are
full duplex, replacing two READs (plus a memcmp in host memory) with
one READ and one COMPARE may be a win on a bandwidth constrained
system. It is a safe to assume the data-in transfers on a storage transport
exceed (probably by a significant margin) the data-out transfers. An
offloaded COMPARE switches one of those data-in transfers to a data-out
transfer, so it should improve the bandwidth utilization.

I did some brief benchmarking on a NVME SSD's COMPARE command (its optional)
and the results were underwhelming. OTOH using my own dd variants (which
can do compare instead of copy) and a scsi_debug target (i.e. RAM) I have
seen compare times of > 15 GBps while a copy rarely exceeds 9 GBps.


BTW The SCSI VERIFY(BYTCHK=3) command compares one block sent from
the host with a sequence of logical blocks on the media. So, for example,
it would be a quick way of checking that a sequence of blocks contained
zero-ed data.

Doug Gilbert



More information about the Linux-nvme mailing list