[PATCH v3 0/3] Handle number of queue changes
James Smart
jsmart2021 at gmail.com
Mon Aug 29 11:44:41 PDT 2022
On 8/29/2022 1:38 AM, Daniel Wagner wrote:
> Updated this series to a proper patch series with Hannes and Sagi's
> feedback addressed.
>
> I tested this with nvme-tcp but due to lack of hardware the nvme-rdma
> is only compile tested.
>
> Daniel
>
>
> From the previous cover letter:
>
> We got a report from our customer that a firmware upgrade on the
> storage array is able to 'break' host. This is caused a change of
> number of queues which the target supports after a reconnect.
>
> Let's assume the number of queues is 8 and all is working fine. Then
> the connection is dropped and the host starts to try to
> reconnect. Eventually, this succeeds but now the new number of queues
> is 10:
>
> nvme0: creating 8 I/O queues.
> nvme0: mapped 8/0/0 default/read/poll queues.
> nvme0: new ctrl: NQN "nvmet-test", addr 10.100.128.29:4420
> nvme0: queue 0: timeout request 0x0 type 4
> nvme0: starting error recovery
> nvme0: failed nvme_keep_alive_end_io error=10
> nvme0: Reconnecting in 10 seconds...
> nvme0: failed to connect socket: -110
> nvme0: Failed reconnect attempt 1
> nvme0: Reconnecting in 10 seconds...
> nvme0: creating 10 I/O queues.
> nvme0: Connect command failed, error wo/DNR bit: -16389
> nvme0: failed to connect queue: 9 ret=-5
> nvme0: Failed reconnect attempt 2
>
> As you can see queue number 9 is not able to connect.
>
> As the order of starting and unfreezing is important we can't just
> move the start of the queues after the tagset update. So my stupid
> idea was to start just the older number of queues and then the rest.
Since you're in the area....
I recommend adding something to ensure that after a reconnect that if
I/O queues were present in the prior association, the new controller
must return support for at least 1 io queue or the reconnect fails and
retries. This was a bug we hit on a subsystem in FC.
-- james
More information about the Linux-nvme
mailing list