Problem with SPCC 256GB NVMe 1.3 drive - refcount_t: underflow; use-after-free.
Bradley Chapman
chapman6235 at comcast.net
Wed Jan 20 21:33:08 EST 2021
Good evening!
On 1/19/21 10:08 PM, Chaitanya Kulkarni wrote:
> On 1/18/21 10:33 AM, Bradley Chapman wrote:
>> Good afternoon!
>>
>> On 1/17/21 11:36 PM, Chaitanya Kulkarni wrote:
>>> On 1/17/21 11:05 AM, Bradley Chapman wrote:
>>>> [ 2836.554298] nvme nvme1: I/O 415 QID 3 timeout, disable controller
>>>> [ 2836.672064] blk_update_request: I/O error, dev nvme1n1, sector 16350
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672072] blk_update_request: I/O error, dev nvme1n1, sector 16093
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672074] blk_update_request: I/O error, dev nvme1n1, sector 15836
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672076] blk_update_request: I/O error, dev nvme1n1, sector 15579
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672078] blk_update_request: I/O error, dev nvme1n1, sector 15322
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672080] blk_update_request: I/O error, dev nvme1n1, sector 15065
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672082] blk_update_request: I/O error, dev nvme1n1, sector 14808
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672083] blk_update_request: I/O error, dev nvme1n1, sector 14551
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672085] blk_update_request: I/O error, dev nvme1n1, sector 14294
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672087] blk_update_request: I/O error, dev nvme1n1, sector 14037
>>>> op 0x9:(WRITE_ZEROES) flags 0x0 phys_seg 0 prio class 0
>>>> [ 2836.672121] nvme nvme1: failed to mark controller live state
>>>> [ 2836.672123] nvme nvme1: Removing after probe failure status: -19
>>>> [ 2836.689016] Aborting journal on device dm-0-8.
>>>> [ 2836.689024] Buffer I/O error on dev dm-0, logical block 25198592,
>>>> lost sync page write
>>>> [ 2836.689027] JBD2: Error -5 detected when updating journal superblock
>>>> for dm-0-8.
>>> Without the knowledge of fs mount/format command I can only suspect that
>>> super
>>> block zeroing issued with write-zeroes request is translated into
>>> REQ_OP_WRITE_ZEROES which controller is not able to process resulting in
>>> the error. This analysis maybe wrong.
>>>
>>> Can you please share following details :-
>>>
>>> nvme id-ns /dev/nvme0n1 -H (we are interested in oncs part here)
>> I ran the requested command against /dev/nvme1n1 (since /dev/nvme0n1
>> works perfectly so far) and here is the result:
> Sorry my bad it suppose to be nvme id-ctrl /dev/nvme0n1 -H
$ nvme id-ctrl /dev/nvme1n1 -H
NVME Identify Controller:
vid : 0x2263
ssvid : 0x1d97
sn : P2002287000000001296
mn : SPCC M.2 PCIe SSD
fr : V1.0
rab : 6
ieee : 000000
cmic : 0
[3:3] : 0 ANA not supported
[2:2] : 0 PCI
[1:1] : 0 Single Controller
[0:0] : 0 Single Port
mdts : 5
cntlid : 1
ver : 10300
rtd3r : 249f0
rtd3e : 13880
oaes : 0x200
[9:9] : 0x1 Firmware Activation Notices Supported
[8:8] : 0 Namespace Attribute Changed Event Not Supported
ctratt : 0
[5:5] : 0 Predictable Latency Mode Not Supported
[4:4] : 0 Endurance Groups Not Supported
[3:3] : 0 Read Recovery Levels Not Supported
[2:2] : 0 NVM Sets Not Supported
[1:1] : 0 Non-Operational Power State Permissive Not Supported
[0:0] : 0 128-bit Host Identifier Not Supported
rrls : 0
oacs : 0x7
[8:8] : 0 Doorbell Buffer Config Not Supported
[7:7] : 0 Virtualization Management Not Supported
[6:6] : 0 NVMe-MI Send and Receive Not Supported
[5:5] : 0 Directives Not Supported
[4:4] : 0 Device Self-test Not Supported
[3:3] : 0 NS Management and Attachment Not Supported
[2:2] : 0x1 FW Commit and Download Supported
[1:1] : 0x1 Format NVM Supported
[0:0] : 0x1 Security Send and Receive Supported
acl : 3
aerl : 3
frmw : 0x2
[4:4] : 0 Firmware Activate Without Reset Not Supported
[3:1] : 0x1 Number of Firmware Slots
[0:0] : 0 Firmware Slot 1 Read/Write
lpa : 0xa
[3:3] : 0x1 Telemetry host/controller initiated log page Suporrted
[2:2] : 0 Extended data for Get Log Page Not Supported
[1:1] : 0x1 Command Effects Log Page Supported
[0:0] : 0 SMART/Health Log Page per NS Not Supported
elpe : 63
npss : 0
avscc : 0x1
[0:0] : 0x1 Admin Vendor Specific Commands uses NVMe Format
apsta : 0
[0:0] : 0 Autonomous Power State Transitions Not Supported
wctemp : 354
cctemp : 363
mtfa : 0
hmpre : 16384
hmmin : 16384
tnvmcap : 0
unvmcap : 0
rpmbs : 0
[31:24]: 0 Access Size
[23:16]: 0 Total Size
[5:3] : 0 Authentication Method
[2:0] : 0 Number of RPMB Units
edstt : 5
dsto : 1
fwug : 0
kas : 0
hctma : 0
[0:0] : 0 Host Controlled Thermal Management Not Supported
mntmt : 0
mxtmt : 0
sanicap : 0
[2:2] : 0 Overwrite Sanitize Operation Not Supported
[1:1] : 0 Block Erase Sanitize Operation Not Supported
[0:0] : 0 Crypto Erase Sanitize Operation Not Supported
hmminds : 0
hmmaxd : 0
nsetidmax : 0
anatt : 0
anacap : 0
[7:7] : 0 Non-zero group ID Not Supported
[6:6] : 0 Group ID does not change
[4:4] : 0 ANA Change state Not Supported
[3:3] : 0 ANA Persistent Loss state Not Supported
[2:2] : 0 ANA Inaccessible state Not Supported
[1:1] : 0 ANA Non-optimized state Not Supported
[0:0] : 0 ANA Optimized state Not Supported
anagrpmax : 0
nanagrpid : 0
sqes : 0x66
[7:4] : 0x6 Max SQ Entry Size (64)
[3:0] : 0x6 Min SQ Entry Size (64)
cqes : 0x44
[7:4] : 0x4 Max CQ Entry Size (16)
[3:0] : 0x4 Min CQ Entry Size (16)
maxcmd : 0
nn : 1
oncs : 0x1d
[6:6] : 0 Timestamp Not Supported
[5:5] : 0 Reservations Not Supported
[4:4] : 0x1 Save and Select Supported
[3:3] : 0x1 Write Zeroes Supported
[2:2] : 0x1 Data Set Management Supported
[1:1] : 0 Write Uncorrectable Not Supported
[0:0] : 0x1 Compare Supported
fuses : 0
[0:0] : 0 Fused Compare and Write Not Supported
fna : 0x3
[2:2] : 0 Crypto Erase Not Supported as part of Secure Erase
[1:1] : 0x1 Crypto Erase Applies to All Namespace(s)
[0:0] : 0x1 Format Applies to All Namespace(s)
vwc : 0x5
[7:3] : 0x2 Reserved
[0:0] : 0x1 Volatile Write Cache Present
awun : 0
awupf : 0
nvscc : 0
[0:0] : 0 NVM Vendor Specific Commands uses Vendor Specific Format
nwpc : 0
[2:2] : 0 Permanent Write Protect Not Supported
[1:1] : 0 Write Protect Until Power Supply Not Supported
[0:0] : 0 No Write Protect and Write Protect Namespace Not Supported
acwu : 0
sgls : 0
[1:0] : 0 Scatter-Gather Lists Not Supported
mnan : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
ctrattr : 0
[0:0] : 0 Dynamic Controller Model
msdbd : 0
ps 0 : mp:3.30W operational enlat:5 exlat:5 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
>>> Also for above device what is the value for the queue block write-zeroes
>>>
>>> parameter that is present in the
>>> /sys/block/<nvmeXnY>/queue/write_zeroes_max_bytes ?
>> $ cat /sys/block/nvme1n1/queue/write_zeroes_max_bytes
>> 131584
> So write-zeroes is configured from the setup.
>>> You can also try blkdiscard -z 0 -l 1024 /dev/<nvmeXnY> to see if the
>>> problem is with
>>> write zeroes.
>> # blkdiscard -z -l 1024 /dev/nvme1n1
>> blkdiscard: /dev/nvme1n1: BLKZEROOUT ioctl failed: Device or resource busy
> This is exactly what I thought, we need to add a quirk for this model
> and make sure
> we don't set the write-zeroes support and make blk-lib emulate the
> write-zeroes.
I am ready to take patches for the NVMe driver to test this out - this
device is not a boot device and I have no data on it that needs to be
preserved.
>>> Also can you please also try the latest nvme tree branch nvme-5.11 ?
>>>
>> Where do I get that code from? Is it already in the 5.11-rc tree or do I
>> need to look somewhere else? I checked https://github.com/linux-nvme but
>> I did not see it there.
> Here is the link :-git://git.infradead.org/nvme.git
> Branch 5.12.
I tried fetching the entire repo but it was huge and would have taken a
long time, so I tried to fetch a single branch instead and got this result:
$ git clone --branch 5.12 --single-branch git://git.infradead.org/nvme.git
Cloning into 'nvme'...
warning: Could not find remote branch 5.12 to clone.
fatal: Remote branch 5.12 not found in upstream origin
I haven't compiled any out-of-tree kernel code in a very long time - how
easy is it to add this code to a kernel tree and compile it into the
kernel once I've figured out how to get it?
Brad
More information about the Linux-nvme
mailing list