[PATCH v3 0/3] Handle number of queue changes

Mon Aug 29 11:44:41 PDT 2022

On 8/29/2022 1:38 AM, Daniel Wagner wrote:
> Updated this series to a proper patch series with Hannes and Sagi's
> feedback addressed.
> 
> I tested this with nvme-tcp but due to lack of hardware the nvme-rdma
> is only compile tested.
> 
> Daniel
> 
> 
>  From the previous cover letter:
> 
> We got a report from our customer that a firmware upgrade on the
> storage array is able to 'break' host. This is caused a change of
> number of queues which the target supports after a reconnect.
> 
> Let's assume the number of queues is 8 and all is working fine. Then
> the connection is dropped and the host starts to try to
> reconnect. Eventually, this succeeds but now the new number of queues
> is 10:
> 
> nvme0: creating 8 I/O queues.
> nvme0: mapped 8/0/0 default/read/poll queues.
> nvme0: new ctrl: NQN "nvmet-test", addr 10.100.128.29:4420
> nvme0: queue 0: timeout request 0x0 type 4
> nvme0: starting error recovery
> nvme0: failed nvme_keep_alive_end_io error=10
> nvme0: Reconnecting in 10 seconds...
> nvme0: failed to connect socket: -110
> nvme0: Failed reconnect attempt 1
> nvme0: Reconnecting in 10 seconds...
> nvme0: creating 10 I/O queues.
> nvme0: Connect command failed, error wo/DNR bit: -16389
> nvme0: failed to connect queue: 9 ret=-5
> nvme0: Failed reconnect attempt 2
> 
> As you can see queue number 9 is not able to connect.
> 
> As the order of starting and unfreezing is important we can't just
> move the start of the queues after the tagset update. So my stupid
> idea was to start just the older number of queues and then the rest.

Since you're in the area....

I recommend adding something to ensure that after a reconnect that if 
I/O queues were present in the prior association, the new controller 
must return support for at least 1 io queue or the reconnect fails and 
retries. This was a bug we hit on a subsystem in FC.

-- james