Buffer I/O Errors from Zoned NVME devices

Tue Feb 2 10:22:50 EST 2021

On Tue, Feb 02, 2021 at 03:06:22PM +0000, Jeffrey Lien wrote:
> Keith, Christoph, Damien,
> This errors are happening on both the 5.9 and 5.10.7 kernels.  CONFIG_BLK_DEV_ZONED is set to y in the .config file.   
> 
> I will try the patch to disable partition scanning that Keith suggested.  I'll also get the latest FW loaded and see if that resolves the issue.  

After re-reading FW dev's explanation, it sounds like something is off
with the implementation. The spec only allows a "boundary error" if
you're crossing zones, but you said the reads are in the last zone, so
there's no opprotunity to cross to the next zone. 

What did you mean by the "zone's hole"? Does this drive have ZCAP less
than ZSZE and we're reading from unmapped LBAs? If so, I think we are
supposed to be allowed to read these, but we just can't write them.

> -----Original Message-----
> From: Damien Le Moal <Damien.LeMoal at wdc.com> 
> Sent: Monday, February 1, 2021 3:05 PM
> To: hch at lst.de; Keith Busch <kbusch at kernel.org>
> Cc: Jeffrey Lien <Jeff.Lien at wdc.com>; linux-nvme at lists.infradead.org
> Subject: Re: Buffer I/O Errors from Zoned NVME devices
> 
> On 2021/02/02 3:03, hch at lst.de wrote:
> > On Mon, Feb 01, 2021 at 09:53:06AM -0800, Keith Busch wrote:
> >> On Mon, Feb 01, 2021 at 02:36:12PM +0000, Jeffrey Lien wrote:
> >>> Christoph, Keith
> >>> We're seeing a lot of these Buffer I/O errors with our zoned nvme devices.  One of the FW developers looked into it and had the following explanation:
> >>> All these Reads are from the kernel during enumeration and for LBAs that are in last zone's hole hence expected to return boundary error which is getting logged by kernel.
> >>>
> >>> [65281.936988] Buffer I/O error on dev nvme1n2, logical block 
> >>> 3800039296, async page read [65281.937165] blk_update_request: I/O 
> >>> error, dev nvme1n2, sector 3800039297 op 0x0:(READ) flags 0x0 
> >>> phys_seg 1 prio class 0 [65281.937166] Buffer I/O error on dev 
> >>> nvme1n2, logical block 3800039297, async page read [65281.937335] 
> >>> blk_update_request: I/O error, dev nvme1n2, sector 3800039298 op 
> >>> 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 [65281.937336] Buffer 
> >>> I/O error on dev nvme1n2, logical block 3800039298, async page read 
> >>> [65281.937498] blk_update_request: I/O error, dev nvme1n2, sector 
> >>> 3800039299 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> >>>
> >>> Are you aware of this issue and if so, do you have any recommendations on how to avoid or resolve?  
> >>
> >> Is this from the partition scanning? We don't partition zoned 
> >> devices, so I think we can skip it. Does the following resolve the issue?
> > 
> > We already have special zoned device handling in the partitioning code.
> 
> Partitions are ignored and warning printed, but the partition table is still being read...
> 
> > 
> > But NVMe should make sure to never span a zone boundary as we set the 
> > chunk size to avoid that.
> > 
> > What kernel version is this?  Is CONFIG_BLK_DEV_ZONED enabled?
> 
> I had a very similar problem doing zonefs tests on Matias machine on a ZNS drive last week. The problem was the firmware... An upgrade to the latest version fixed the issue. Not sure what FW rev you are running here, but upgrading might solve this.
> 
> > 
> 
> 
> --
> Damien Le Moal
> Western Digital Research