I/O 0 QID 0 timeout, disable controller - kernel 4.4 / 4.5 NVMe controller dropouts

Jens Axboe axboe at fb.com
Thu Apr 14 07:08:16 PDT 2016


On 04/14/2016 07:21 AM, Keith Busch wrote:
> On Thu, Apr 14, 2016 at 03:13:22PM +1000, Sam McLeod wrote:
>> We have 6 Supermicro servers all of the same (or very similar spec),
>>
>> Since Kernel 4.4 / 4.5 we've had NVMe devices randomly dropping.
>> It does not relate to a particular server, disk, controller etc... and downgrading to kernel 4.1.
>>
>> With kernel 4.4 the servers would load and the disk randomly disappear.
>> With 4.5 the server loads with one of the disks missing every time.
>>
>>
>> ```
>> [   66.856719] nvme 0000:03:00.0: I/O 0 QID 0 timeout, disable controller
>> [   66.957911] nvme 0000:03:00.0: Identify Controller failed (-4)
>> [   66.957961] nvme 0000:03:00.0: Removing after probe failure status: -5
>> ```
>
> Looks like more fallout from reducing the scope of admin queue completion
> polling...
>
> Jens:
>
> Could we please apply the MSI-x fix commit to 4.6 instead of 4.7 so 4.6
> isn't equally broken? Currently staged in for-next here:
>
>    https://urldefense.proofpoint.com/v2/url?u=http-3A__git.kernel.dk_-3Fp-3Dlinux-2Dblock.git-3Ba-3Dcommitdiff-3Bh-3D788e15abbb9408c9399d7e3445ac9afb3b2fd7d6-3Bhp-3De0489487ec9cd79ee1fa0dc5d3789c08b0e51a2c&d=CwIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=cK1a7KivzZRh1fKQMjSm2A&m=WjNBEBATs2DJsluOdhxTZFsQleenkWVxdeLMNPJCTnc&s=7AMgD6x25bdbt4Lp-qTGaHVjXW6yP4GSes8jTZ7SBr0&e=
>
> I'd also like to submit an apporpriate port to stable if no objections.

It feels awfully risky for the current series. Yes, we know this patch 
fixes the reported cases, but I'm worried that there are other 
controllers that will now fail because we don't probe with legacy 
interrupts. But the alternative is polling, which isn't great either and 
would (once again) cause the current and next series to diverge in weird 
and interesting ways.

Hmm

-- 
Jens Axboe




More information about the Linux-nvme mailing list