nvmet_rdma crash - DISCONNECT event with NULL queue

Sagi Grimberg sagi at grimberg.me
Tue Nov 1 15:34:57 PDT 2016


>>> But:  I'll try this patch and run for a few hours and see what happens.  I
>>> believe regardless of a keep alive issue, the above patch is still needed.
>>
>> In your tests, can you enable dynamic debug on:
>> nvmet_start_keep_alive_timer
>> nvmet_stop_keep_alive_timer
>> nvmet_execute_keep_alive
>
> Hey Sagi.  I hit another crash on the target.  This was with 4.8.0 + the patch
> to skip disconnect events if the cm_id->qp is NULL. This time the crash is in
> _raw_spin_lock_irqsave() called by nvmet_rdma_recv_done().  The log is too big
> to include everything inline, but I'm attaching the full log as an attachment.
> Looks like at around 4988.169 seconds in the log, we see 5 controllers created,
> all named "controller 1"!  And 32 queues assigned to controller 1 5 times!  And
> shortly after that we hit the BUG.

So I think you're creating multiple subsystems and provision each
subsystem differently correct? the controller ids are unique within
a subsystem so two different subsystems can have ctrl id 1. Perhaps
our logging should mention the subsysnqn too?

Anyway, is there traffic going on?

The only way we can get recv_done with corrupted data is if we posted
something after the qp drain completed, can you check if that can happen?

Can you share your test case?



More information about the Linux-nvme mailing list