processor reboots if nvme host controller is surprise removed

Kallol Biswas kallol at nucleodyne.com
Mon Sep 21 16:31:20 EDT 2020


The PCIe dump for the port is copied below. The Slot seems to have the
presence detect bit.

root at earley:~# lspci -vvv -s 0:3.2
00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1483
(prog-if 00 [Normal decode])
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 27
        Bus: primary=00, secondary=10, subordinate=10, sec-latency=0
        I/O behind bridge: 0000f000-00000fff
        Memory behind bridge: f7000000-f70fffff
        Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff
        Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort-
<TAbort- <MAbort+ <SERR- <PERR-
        BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
                PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0
                        ExtTag+ RBE+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal-
Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 512 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq-
AuxPwr- TransPend-
                LnkCap: Port #0, Speed 16GT/s, Width x8, ASPM L1, Exit
Latency L0s unlimited, L1 <64us
                        ClockPM- Surprise- LLActRep+ BwNot+ ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 16GT/s, Width x4, TrErr- Train- SlotClk+
DLActive+ BWMgmt+ ABWMgmt-
                SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd-
HotPlug- Surprise-
                        Slot #0, PowerLimit 0.000W; Interlock- NoCompl+
                SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet-
CmdCplt- HPIrq- LinkChg-
                        Control: AttnInd Unknown, PwrInd Unknown,
Power- Interlock-
                SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt-
PresDet+ Interlock-
                        Changed: MRL- PresDet- LinkState+
                RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal-
PMEIntEna+ CRSVisible+
                RootCap: CRSVisible+
                RootSta: PME ReqID 0000, PMEStatus- PMEPending-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+,
LTR+, OBFF Not Supported ARIFwd-
                DevCtl2: Completion Timeout: 65ms to 210ms,
TimeoutDis-, LTR+, OBFF Disabled ARIFwd-
                LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range,
EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB,
EqualizationComplete+, EqualizationPhase1+
                         EqualizationPhase2+, EqualizationPhase3+,
LinkEqualizationRequest-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee0a000  Data: 4021
        Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc.
[AMD] Device 1453
        Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
        Capabilities: [100 v1] Vendor Specific Information: ID=0001
Rev=1 Len=010 <?>
        Capabilities: [270 v1] #19
        Capabilities: [2a0 v1] Access Control Services
                ACSCap: SrcValid+ TransBlk+ ReqRedir+ CmpltRedir+
UpstreamFwd+ EgressCtrl- DirectTrans+
                ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir-
UpstreamFwd- EgressCtrl- DirectTrans-
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2- PCI-PM_L1.1+ ASPM_L1.2-
ASPM_L1.1+ L1_PM_Substates+
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-

                L1SubCtl2:
        Capabilities: [3c4 v1] #23
        Capabilities: [400 v1] #25
        Capabilities: [410 v1] #26
        Capabilities: [440 v1] #27
        Kernel driver in use: pcieport

On Mon, Sep 21, 2020 at 12:24 PM Keith Busch <kbusch at kernel.org> wrote:
>
> On Mon, Sep 21, 2020 at 11:38:48AM -0700, Kallol Biswas wrote:
> > I have an issue with powering down a nvme host controller while fio is active.
> > Hoping someone from this list can provide some input so that the
> > problem can be resolved or worked around.
> >
> >
> > System info:
> >
> > description: Motherboard
> >        product: X570 Phantom Gaming X
> >        vendor: ASRock
> >
> > *-cpu
> >           description: CPU
> >           product: AMD Ryzen 5 3600 6-Core Processor
> >
> > Fio with 50-50% rdwr traffic is active and when the power to the
> > device is removed by an external means.
> >
> >
> > A few commands are active in a submission queue.
> >
> > I/Os time out.
> >
> > The nvme_timeout routine is called. First register access is CSTS.
> > Sometimes the read to the register returns 0xffffffff.... sometimes
> > causes the processor to restart. When this returns  0xffffffff the
> > next processor restarts trying to access the PCIe config register
> > PCI_STATUS.
> >
> > The root port had big CTO value, I changed to 0, still it did not help.
> >
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not
> > Supported ARIFwd-
> > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR+, OBFF
> > Disabled ARIFwd-
> >
> > To:
> >
> > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+, OBFF Not
> > Supported ARIFwd-
> > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF
> > Disabled ARIFwd-
> >
> > This change does not help  processor restart in nvme_timeout()  routine.
>
> It doesn't sound like your platform handles an unexpected link down.
> What does your root port's Link Capabilities register show?



-- 
------
Kallol Biswas
Phone: 408-718-8164 (c)
Phone: 408-725-7527 (o)
NucleoDyne Systems, Inc.


“From the intrinsic evidence of his creation, the Great Architect

of the Universe now begins to appear as a pure mathematician.”



More information about the Linux-nvme mailing list