IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again

Scott Bauer sbauer at eng.utah.edu
Fri Apr 6 12:00:37 PDT 2018



On 04/06/2018 12:04 PM, Keith Busch wrote:
> On Fri, Apr 06, 2018 at 12:46:06PM -0500, Alex G. wrote:
>> On 04/06/2018 12:16 PM, Scott Bauer wrote:
>>> You're using AER inject, right?
>>
>> No. I'm causing the errors in hardware with hot-unplug.
> 
> I think Scott's still on the right track for this particular sighting.
> The AER handler looks unsafe under changing topologies. It might need run
> under pci_lock_rescan_remove() before walking the bus to prevent races
> with the surprise removal, but it's not clear to me yet if holding that
> lock is okay to do in this contexty

I think we may get into a deadlock situation if we grab the pci_lock_rescan.
the hotplug unconfigure code will eventually call driver->remove() which I believe
can end up in the aer_remove(), which will do a flush_work. If the aer delegated
irq handler is waiting on the pci_lock_rescan, before it does a walk_bus, we've deadlocked
there as the hp code is waiting on the remove() to finish, and the remove is waiting on 
the flush work to finish and the work being flushed is waiting on the lock.

Although I didn't check to see if flushwork waits for already running work or not.



> 
> This however does not appear to resemble your previous sightings. In your
> previous sightings, it looks like something has lost track of commands,
> and we're freeing the resources with them a second time.
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 



More information about the Linux-nvme mailing list