Problem with SPCC 256GB NVMe 1.3 drive - refcount_t: underflow; use-after-free.

Bradley Chapman chapman6235 at comcast.net
Wed Jan 20 21:33:08 EST 2021


Good evening!

On 1/19/21 10:08 PM, Chaitanya Kulkarni wrote:
> On 1/18/21 10:33 AM, Bradley Chapman wrote:
>> Good afternoon!
>>
>> On 1/17/21 11:36 PM, Chaitanya Kulkarni wrote:
>>> On 1/17/21 11:05 AM, Bradley Chapman wrote:
>>>> [ 2836.554298] nvme nvme1: I/O 415 QID 3 timeout, disable controller
>>>> [ 2836.672064] blk_update_request: I/O error, dev nvme1n1, sector 16350
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672072] blk_update_request: I/O error, dev nvme1n1, sector 16093
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672074] blk_update_request: I/O error, dev nvme1n1, sector 15836
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672076] blk_update_request: I/O error, dev nvme1n1, sector 15579
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672078] blk_update_request: I/O error, dev nvme1n1, sector 15322
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672080] blk_update_request: I/O error, dev nvme1n1, sector 15065
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672082] blk_update_request: I/O error, dev nvme1n1, sector 14808
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672083] blk_update_request: I/O error, dev nvme1n1, sector 14551
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672085] blk_update_request: I/O error, dev nvme1n1, sector 14294
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672087] blk_update_request: I/O error, dev nvme1n1, sector 14037
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672121] nvme nvme1: failed to mark controller live state
>>>> [ 2836.672123] nvme nvme1: Removing after probe failure status: -19
>>>> [ 2836.689016] Aborting journal on device dm-0-8.
>>>> [ 2836.689024] Buffer I/O error on dev dm-0, logical block 25198592,
>>>> lost sync page write
>>>> [ 2836.689027] JBD2: Error -5 detected when updating journal superblock
>>>> for dm-0-8.
>>> Without the knowledge of fs mount/format command I can only suspect that
>>> super
>>> block zeroing issued with write-zeroes request is translated into
>>> REQ_OP_WRITE_ZEROES which controller is not able to process resulting in
>>> the error. This analysis maybe wrong.
>>>
>>> Can you please share following details :-
>>>
>>> nvme id-ns /dev/nvme0n1 -H (we are interested in oncs part here)
>> I ran the requested command against /dev/nvme1n1 (since /dev/nvme0n1
>> works perfectly so far) and here is the result:
> Sorry my bad it suppose to be nvme id-ctrl /dev/nvme0n1 -H

$ nvme id-ctrl /dev/nvme1n1 -H

NVME Identify Controller:
vid       : 0x2263
ssvid     : 0x1d97
sn        : P2002287000000001296
mn        : SPCC M.2 PCIe SSD
fr        : V1.0
rab       : 6
ieee      : 000000
cmic      : 0
   [3:3] : 0     ANA not supported
   [2:2] : 0     PCI
   [1:1] : 0     Single Controller
   [0:0] : 0     Single Port

mdts      : 5
cntlid    : 1
ver       : 10300
rtd3r     : 249f0
rtd3e     : 13880
oaes      : 0x200
   [9:9] : 0x1   Firmware Activation Notices Supported
   [8:8] : 0     Namespace Attribute Changed Event Not Supported

ctratt    : 0
   [5:5] : 0     Predictable Latency Mode Not Supported
   [4:4] : 0     Endurance Groups Not Supported
   [3:3] : 0     Read Recovery Levels Not Supported
   [2:2] : 0     NVM Sets Not Supported
   [1:1] : 0     Non-Operational Power State Permissive Not Supported
   [0:0] : 0     128-bit Host Identifier Not Supported

rrls      : 0
oacs      : 0x7
   [8:8] : 0     Doorbell Buffer Config Not Supported
   [7:7] : 0     Virtualization Management Not Supported
   [6:6] : 0     NVMe-MI Send and Receive Not Supported
   [5:5] : 0     Directives Not Supported
   [4:4] : 0     Device Self-test Not Supported
   [3:3] : 0     NS Management and Attachment Not Supported
   [2:2] : 0x1   FW Commit and Download Supported
   [1:1] : 0x1   Format NVM Supported
   [0:0] : 0x1   Security Send and Receive Supported

acl       : 3
aerl      : 3
frmw      : 0x2
   [4:4] : 0     Firmware Activate Without Reset Not Supported
   [3:1] : 0x1   Number of Firmware Slots
   [0:0] : 0     Firmware Slot 1 Read/Write

lpa       : 0xa
   [3:3] : 0x1   Telemetry host/controller initiated log page Suporrted
   [2:2] : 0     Extended data for Get Log Page Not Supported
   [1:1] : 0x1   Command Effects Log Page Supported
   [0:0] : 0     SMART/Health Log Page per NS Not Supported

elpe      : 63
npss      : 0
avscc     : 0x1
   [0:0] : 0x1   Admin Vendor Specific Commands uses NVMe Format

apsta     : 0
   [0:0] : 0     Autonomous Power State Transitions Not Supported

wctemp    : 354
cctemp    : 363
mtfa      : 0
hmpre     : 16384
hmmin     : 16384
tnvmcap   : 0
unvmcap   : 0
rpmbs     : 0
  [31:24]: 0     Access Size
  [23:16]: 0     Total Size
   [5:3] : 0     Authentication Method
   [2:0] : 0     Number of RPMB Units

edstt     : 5
dsto      : 1
fwug      : 0
kas       : 0
hctma     : 0
   [0:0] : 0     Host Controlled Thermal Management Not Supported

mntmt     : 0
mxtmt     : 0
sanicap   : 0
   [2:2] : 0     Overwrite Sanitize Operation Not Supported
   [1:1] : 0     Block Erase Sanitize Operation Not Supported
   [0:0] : 0     Crypto Erase Sanitize Operation Not Supported

hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
anatt     : 0
anacap    : 0
   [7:7] : 0     Non-zero group ID Not Supported
   [6:6] : 0     Group ID does not change
   [4:4] : 0     ANA Change state Not Supported
   [3:3] : 0     ANA Persistent Loss state Not Supported
   [2:2] : 0     ANA Inaccessible state Not Supported
   [1:1] : 0     ANA Non-optimized state Not Supported
   [0:0] : 0     ANA Optimized state Not Supported

anagrpmax : 0
nanagrpid : 0
sqes      : 0x66
   [7:4] : 0x6   Max SQ Entry Size (64)
   [3:0] : 0x6   Min SQ Entry Size (64)

cqes      : 0x44
   [7:4] : 0x4   Max CQ Entry Size (16)
   [3:0] : 0x4   Min CQ Entry Size (16)

maxcmd    : 0
nn        : 1
oncs      : 0x1d
   [6:6] : 0     Timestamp Not Supported
   [5:5] : 0     Reservations Not Supported
   [4:4] : 0x1   Save and Select Supported
   [3:3] : 0x1   Write Zeroes Supported
   [2:2] : 0x1   Data Set Management Supported
   [1:1] : 0     Write Uncorrectable Not Supported
   [0:0] : 0x1   Compare Supported

fuses     : 0
   [0:0] : 0     Fused Compare and Write Not Supported

fna       : 0x3
   [2:2] : 0     Crypto Erase Not Supported as part of Secure Erase
   [1:1] : 0x1   Crypto Erase Applies to All Namespace(s)
   [0:0] : 0x1   Format Applies to All Namespace(s)

vwc       : 0x5
   [7:3] : 0x2   Reserved
   [0:0] : 0x1   Volatile Write Cache Present

awun      : 0
awupf     : 0
nvscc     : 0
   [0:0] : 0     NVM Vendor Specific Commands uses Vendor Specific Format

nwpc      : 0
   [2:2] : 0     Permanent Write Protect Not Supported
   [1:1] : 0     Write Protect Until Power Supply Not Supported
   [0:0] : 0     No Write Protect and Write Protect Namespace Not Supported

acwu      : 0
sgls      : 0
  [1:0]  : 0     Scatter-Gather Lists Not Supported

mnan      : 0
subnqn    :
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
   [0:0] : 0     Dynamic Controller Model

msdbd     : 0
ps    0 : mp:3.30W operational enlat:5 exlat:5 rrt:0 rrl:0
           rwt:0 rwl:0 idle_power:- active_power:-

>>> Also for above device what is the value for the queue block write-zeroes
>>>
>>> parameter that is present in the
>>> /sys/block/<nvmeXnY>/queue/write_zeroes_max_bytes ?
>> $ cat /sys/block/nvme1n1/queue/write_zeroes_max_bytes
>> 131584
> So write-zeroes is configured from the setup.
>>> You can also try blkdiscard -z 0 -l 1024 /dev/<nvmeXnY> to see if the
>>> problem is with
>>> write zeroes.
>> # blkdiscard -z -l 1024 /dev/nvme1n1
>> blkdiscard: /dev/nvme1n1: BLKZEROOUT ioctl failed: Device or resource busy
> This is exactly what I thought, we need to add a quirk for this model
> and make sure
> we don't set the write-zeroes support and make blk-lib emulate the
> write-zeroes.

I am ready to take patches for the NVMe driver to test this out - this 
device is not a boot device and I have no data on it that needs to be 
preserved.

>>> Also can you please also try the latest nvme tree branch nvme-5.11 ?
>>>
>> Where do I get that code from? Is it already in the 5.11-rc tree or do I
>> need to look somewhere else? I checked https://github.com/linux-nvme but
>> I did not see it there.
> Here is the link :-git://git.infradead.org/nvme.git
> Branch 5.12.

I tried fetching the entire repo but it was huge and would have taken a 
long time, so I tried to fetch a single branch instead and got this result:

$ git clone --branch 5.12 --single-branch git://git.infradead.org/nvme.git
Cloning into 'nvme'...
warning: Could not find remote branch 5.12 to clone.
fatal: Remote branch 5.12 not found in upstream origin

I haven't compiled any out-of-tree kernel code in a very long time - how 
easy is it to add this code to a kernel tree and compile it into the 
kernel once I've figured out how to get it?

Brad



More information about the Linux-nvme mailing list