nvme-tls and TCP window full

Tue Jul 11 05:05:04 PDT 2023

>> Hey Hannes,
>>
>> Any progress on this one?
>>
> Oh well; slow going.
> After some length debugging I found that the read_sock() implementation 
> wasn't exactly up to scratch (it would read each byte individually, 
> instead of each skb individually ... D'oh).
> But even with that fixed I'm still seeing I/O stalls due to window full:

What does your current read_sock look like?

One thing that looks incorrect is that you should be calling
tls_rx_rec_wait() in a loop as long as it returns 1 no? This way
you will continue to consume as long as you have something in the
sk ready for process...

Something like (the completely untested):
do {
	if (!skb) {
		if (!skb_queue_empty(&ctx->rx_list)) {
			skb = __skb_dequeue(&ctx->rx_list);
		} else {
			struct tls_decrypt_arg darg;

			err = tls_rx_rec_wait(sk, NULL, true, true);
			if (err <= 0) {
				tls_rx_reader_release(sk, ctx);
				return err;
			}

			memset(&darg.inargs, 0, sizeof(darg.inargs));

			err = tls_rx_one_record(sk, NULL, &darg);
			if (err < 0) {
				tls_err_abort(sk, -EBADMSG);
				tls_rx_reader_release(sk, ctx);
				return err;
			}

			tls_rx_rec_done(ctx);
			skb = darg.skb;
	}

	rxm = strp_msg(skb);
	tlm = tls_msg(skb);

	/* read_sock does not support reading control messages /
	if (tlm->control != TLS_RECORD_TYPE_DATA) {
		err = -EINVAL;
		goto read_sock_requeue;
	}

	used = read_actor(desc, skb, rxm->offset, rxm->full_len);
	if (used <= 0) {
		err = used;
		goto read_sock_end;
	}

	copied += used;
	if (used < rxm->full_len) {
		rxm->offset += used;
		rxm->full_len -= used;
		if (!desc->count)
			goto read_sock_requeue;
	} else {
		consume_skb(skb);
		if (desc->count && skb_queue_empty(&ctx->rx_list))
			skb = __skb_dequeue(&ctx->rx_list);
		else
			skb = NULL;
	}
} while (skb);

> 
> [85639.669120] nvme nvme0: queue 2: C2H data cid 0x4d offset 0 len 4096
> [85639.669885] nvme nvme0: queue 2: PDU Data cid 0x4d offset 29 len 4120 
> rem 4096

what is rem? remaining? what is this derived from? the PDU or skb?

> [85639.670809] nvme nvme0: queue 2: RSP cid 0x4d status 0
> [85639.671449] nvme nvme0: queue 2: pdu 24 data 0 consumed 4144 result 0
> [85639.674928] nvme nvme0: queue 2: skb 00000000bee9b76d len 8262 offset 
> 5 len 8240

What is offset? skb offset?

> [85639.676474] nvme nvme0: queue 2: C2H data cid 0x4e offset 0 len 8192
> [85639.678013] nvme nvme0: queue 2: PDU Data cid 0x4e offset 29 len 8216 
> rem 8192

offset is 29? is that pdu offset? should be 24? does it somehow combine
the offset 5 from above?

It is not very clear what the prints are...

> [85639.679810] nvme nvme0: queue 2: RSP cid 0x4e status 0
> [85639.681247] nvme nvme0: queue 2: pdu 24 data 0 consumed 8240 result 0
> [85639.690985] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.692321] nvme nvme0: queue 2: C2H data cid 0x4f offset 0 len 126976
> [85639.694012] nvme nvme0: queue 2: PDU Data cid 0x4f offset 29 len 
> 16360 rem 126976
> [85639.695882] nvme nvme0: queue 2: pdu 0 data 110616 consumed 16384 
> result 0
> [85639.697667] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.699339] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
> rem 110616
> [85639.701122] nvme nvme0: queue 2: pdu 0 data 94232 consumed 16384 
> result 0
> [85639.702831] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.704918] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
> rem 94232
> [85639.706759] nvme nvme0: queue 2: pdu 0 data 77848 consumed 16384 
> result 0
> [85639.708367] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.711219] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
> rem 77848
> [85639.712108] nvme nvme0: queue 2: pdu 0 data 61464 consumed 16384 
> result 0
> [85639.714182] nvme nvme0: queue 2: skb 00000000bee9b76d len 16406 
> offset 5 len 16384
> [85639.715107] nvme nvme0: queue 2: PDU Data cid 0x4f offset 5 len 16384 
> rem 61464
> [85639.715989] nvme nvme0: queue 2: pdu 0 data 45080 consumed 16384 
> result 0
> [85671.510572] nvme nvme0: queue 2: timeout cid 0x4f type 4 opcode 0x2 
> (Read)
> 
> (These are just client-side logs).
> Which looks to me as if we're waiting for the server to continue sending
> PDU data frames, only these never arrive.

I think it is more likely that you are losing sync on the stream...
Maybe the server is done sending data, but if you have a bug processing
it the client think that more is coming...

What is the workload you are testing? Is this workload dependent? small bs?

> So really I wonder whether it's not rather an issue on the server side.

The variable you added here is the read_sock, so I would suspect that
this would be your offender.

> Maybe the server doesn't retire skbs (or not all of them), causing the 
> TCP window to shrink.
> That, of course, is wild guessing, as I have no idea if and how calls to 
> 'consume_skb' reflect back to the TCP window size.

skbs are unrelated to the TCP window. They relate to the socket send
buffer. skbs left dangling would cause server side to run out of memory,
not for the TCP window to close. The two are completely unrelated.