processor reboots if nvme host controller is surprise removed
Kallol Biswas
kallol at nucleodyne.com
Mon Sep 21 16:31:20 EDT 2020
The PCIe dump for the port is copied below. The Slot seems to have the
presence detect bit.
root at earley:~# lspci -vvv -s 0:3.2
00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1483
(prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin ? routed to IRQ 27
Bus: primary=00, secondary=10, subordinate=10, sec-latency=0
I/O behind bridge: 0000f000-00000fff
Memory behind bridge: f7000000-f70fffff
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0
ExtTag+ RBE+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 512 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr- TransPend-
LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit
Latency L0s unlimited, L1 <64us
ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 16GT/s, Width x4, TrErr- Train- SlotClk+
DLActive+ BWMgmt+ ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd-
HotPlug- Surprise-
Slot #0, PowerLimit 0.000W; Interlock- NoCompl+
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet-
CmdCplt- HPIrq- LinkChg-
Control: AttnInd Unknown, PwrInd Unknown,
Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet+ Interlock-
Changed: MRL- PresDet- LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
PMEIntEna+ CRSVisible+
RootCap: CRSVisible+
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,
LTR+, OBFF Not Supported ARIFwd-
DevCtl2: Completion Timeout: 65ms to 210ms,
TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+,
LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee0a000 Data: 4021
Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc.
[AMD] Device 1453
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100 v1] Vendor Specific Information: ID=0001
Rev=1 Len=010 <?>
Capabilities: [270 v1] #19
Capabilities: [2a0 v1] Access Control Services
ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
UpstreamFwd+ EgressCtrl- DirectTrans+
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir-
UpstreamFwd- EgressCtrl- DirectTrans-
Capabilities: [370 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2-
ASPM_L1.1+ L1_PM_Substates+
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
L1SubCtl2:
Capabilities: [3c4 v1] #23
Capabilities: [400 v1] #25
Capabilities: [410 v1] #26
Capabilities: [440 v1] #27
Kernel driver in use: pcieport
On Mon, Sep 21, 2020 at 12:24 PM Keith Busch <kbusch at kernel.org> wrote:
>
> On Mon, Sep 21, 2020 at 11:38:48AM -0700, Kallol Biswas wrote:
> > I have an issue with powering down a nvme host controller while fio is active.
> > Hoping someone from this list can provide some input so that the
> > problem can be resolved or worked around.
> >
> >
> > System info:
> >
> > description: Motherboard
> > product: X570 Phantom Gaming X
> > vendor: ASRock
> >
> > *-cpu
> > description: CPU
> > product: AMD Ryzen 5 3600 6-Core Processor
> >
> > Fio with 50-50% rdwr traffic is active and when the power to the
> > device is removed by an external means.
> >
> >
> > A few commands are active in a submission queue.
> >
> > I/Os time out.
> >
> > The nvme_timeout routine is called. First register access is CSTS.
> > Sometimes the read to the register returns 0xffffffff.... sometimes
> > causes the processor to restart. When this returns 0xffffffff the
> > next processor restarts trying to access the PCIe config register
> > PCI_STATUS.
> >
> > The root port had big CTO value, I changed to 0, still it did not help.
> >
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not
> > Supported ARIFwd-
> > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR+, OBFF
> > Disabled ARIFwd-
> >
> > To:
> >
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not
> > Supported ARIFwd-
> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF
> > Disabled ARIFwd-
> >
> > This change does not help processor restart in nvme_timeout() routine.
>
> It doesn't sound like your platform handles an unexpected link down.
> What does your root port's Link Capabilities register show?
--
------
Kallol Biswas
Phone: 408-718-8164 (c)
Phone: 408-725-7527 (o)
NucleoDyne Systems, Inc.
“From the intrinsic evidence of his creation, the Great Architect
of the Universe now begins to appear as a pure mathematician.”
More information about the Linux-nvme
mailing list