[PATCHv11 00/18] nvme: In-kernel TLS support for TCP

Mon Sep 18 05:42:55 PDT 2023

On 9/18/23 15:15, Hannes Reinecke wrote:
> On 9/14/23 20:10, Sagi Grimberg wrote:
>>
>>
>> On 9/11/23 15:09, Hannes Reinecke wrote:
>>> On 9/6/23 12:04, Hannes Reinecke wrote:
>>>> On 9/6/23 10:48, Pawel Baldysiak wrote:
>>>>> Hi Hannes,
>>>>>
>>>>> I'm testing v15 patchset in extensive way and observing that it is 
>>>>> not stable.
>>>>> I'm running it on two connected machines - one is used as a host, 
>>>>> second as target.
>>>>> Once connected I'm running I/Os with dd.
>>>>> Eventually I got I/O timeouts from nvme layer on host,
>>>>> causing connect/disconnect loops and I/O hung.
>>>>> Does not matter if the workload is reads or writes.
>>>>>
>>>> Do you have the kernel message log here?
>>>> It _might_ be that we're overloading the TLS stack, such that 
>>>> processing of individual packets is taking too long and triggering a 
>>>> KATO.
>>>>
>>>>> Second test is connect/disconnect loops.
>>>>> On disconnect I've got ocasional delays due to nvme commands timeouts.
>>>>> The disconnect happens eventually, but first it need to hit timeout.
>>>>>
>>>> Again, do you have message logs?
>>>>
>>> Managed to validate it with a larger setup.
>>> Message log is:
>>> nvme nvme0: mapped 16/0/0 default/read/poll queues.
>>> nvme nvme0: new ctrl: NQN "nqn.exuma", addr 10.168.136.124:4420
>>> nvme nvme0: Removing ctrl: NQN "nqn.exuma"
>>> nvme nvme0: queue 1: timeout cid 0x3071 type 4 opcode 0x2 (Read)
>>> nvme0c0n1: Read(0x2) @ LBA 15999927, 9 blocks, Host Aborted Command 
>>> (sct 0x3 / sc 0x71)
>>> nvme nvme0: queue 1: timeout cid 0x2078 type 4 opcode 0x2 (Read)
>>> nvme0c0n1: Read(0x2) @ LBA 15999872, 4 blocks, Host Aborted Command 
>>> (sct 0x3 / sc 0x71)
>>> nvme nvme0: queue 1: timeout cid 0x2079 type 4 opcode 0x2 (Read)
>>> nvme0c0n1: Read(0x2) @ LBA 15999877, 1 blocks, Host Aborted Command 
>>> (sct 0x3 / sc 0x71)
>>> nvme nvme0: queue 1: timeout cid 0x207a type 4 opcode 0x2 (Read)
>>> nvme0c0n1: Read(0x2) @ LBA 15999879, 2 blocks, Host Aborted Command 
>>> (sct 0x3 / sc 0x71)
>>> nvme nvme0: queue 1: timeout cid 0x207b type 4 opcode 0x2 (Read)
>>> nvme0c0n1: Read(0x2) @ LBA 15999882, 4 blocks, Host Aborted Command 
>>> (sct 0x3 / sc 0x71)
>>> nvme nvme0: queue 1: timeout cid 0x207c type 4 opcode 0x2 (Read)
>>> nvme0c0n1: Read(0x2) @ LBA 15999887, 21 blocks, Host Aborted Command 
>>> (sct 0x3 / sc 0x71)
>>> nvme nvme0: queue 1: timeout cid 0x207d type 4 opcode 0x2 (Read)
>>> nvme0c0n1: Read(0x2) @ LBA 15999909, 6 blocks, Host Aborted Command 
>>> (sct 0x3 / sc 0x71)
>>> nvme nvme0: queue 1: timeout cid 0x207e type 4 opcode 0x2 (Read)
>>> nvme0c0n1: Read(0x2) @ LBA 15999916, 10 blocks, Host Aborted Command 
>>> (sct 0x3 / sc 0x71)
>>> block nvme0n1: no available path - failing I/O
>>> Buffer I/O error on dev nvme0n1, logical block 15999872, async page read
>>> nvme nvme0: failed to send request -32
>>> nvme nvme0: Property Set error: 880, offset 0x14
>>
>> Can you add timestamps to the log trace?
>>
>>>
>>> Which actually points to an issue in the teardown flow.
>>> nvme_do_delete_ctrl() does this:
>>>
>>>         nvme_stop_ctrl(ctrl);
>>>         nvme_remove_namespaces(ctrl);
>>>         ctrl->ops->delete_ctrl(ctrl);
>>>
>>> and ->delete_ctrl()
>>>   ->nvme_tcp_delete_ctrl()
>>>     ->nvme_tcp_teardown_ctrl()
>>>       ->nvme_tcp_teardown_io_queues()
>>>
>>> has:
>>>         nvme_quiesce_admin_queue(ctrl);
>>>         nvme_quiesce_io_queues(ctrl);
>>>         nvme_sync_io_queues(ctrl);
>>>         nvme_tcp_stop_io_queues(ctrl);
>>>
>>> Now, if you look at 'nvme_sync_io_queues()' you'll find
>>>
>>>         down_read(&ctrl->namespaces_rwsem);
>>>         list_for_each_entry(ns, &ctrl->namespaces, list)
>>>                 blk_sync_queue(ns->queue);
>>>         up_read(&ctrl->namespaces_rwsem);
>>>
>>> but we've removed all namespaces right at the start in
>>> nvme_do_delete_ctr(), so this is a no-op and we'll never
>>> catch any I/O in flight causing the timeouts to occur :-(
>>
>> Hannes, its not clear to me what you're saying here. What is different
>> in TLS that makes the teardown go that slow?
>>
>> Also, what is being blocked here? Are there still I/Os to an already
>> deleted namespace?
> 
> Hmm. Seems to be an interrelated issue. Problem is that TLS sessions 
> occasionally time out right after initiating the connection, and it'll
> take some reconnect attempts to get things sorts.

Sounds like the trigger to the problem. If there is some I/O stalls
that would explain why there are resets due to either an I/O or
the keep-alive timing out.

> Still debugging; seems like an issue with the network stack.

Interesting, please do share when you have some further insights.

> And while reconnection is underway of course the I/O is aborted, but
> depending on the timing namespace scanning might be underway, so we're
> blocked in 'flush_work(&ctrl->scan_work)' in nvme_remove_namespaces()
> and have to wait for command timeout on namespace identify et al.
> So that is probably expected,

The scan_work is fine to block if there is an ongoing admin command
that needs to timeout first. However this should be bound to the admin
timeout, not forever. After that any reconnection attempts should not
continue as the ctrl state is DELETING.

Is this a permanent phenomenon? or just stuck and gets released at some
point?

BTW, in your tests you are triggering a disconnect (delete), was that
step included as part of the report from Pawel?

> and I'll be concentrating on figuring out why we're seeing reconnections.

Sure, it'll also be good to align that you are doing the same tests.