[LSF/MM/BPF TOPIC] NVMe-TCP performance improvements
Hannes Reinecke
hare at suse.de
Fri Mar 21 00:30:37 PDT 2025
Hi all,
a partner of ours reported a severe performance and scalability issue
with TLS on NVMe-TCP, where spurious I/O timeouts have been occurring
once the system scales up to more controllers. With enough controllers
he even ran into timeouts during the initial connect command, rendering
the system unusable.
There have been several attempts to fix this (my attempts [1], spreading
load to CPUs by Sagi [2]), but they didn't really fix the issue.
I have since then done quite some performance analysis, but found that
'classical' statistical methods can't easily used as we have a really
high divergence in performance; the standard deviation vastly exceeds
the mean value, rendering the calculation meaningless.
Eventually I nailed down the problem with [3], but the performance
analysis problem remains.
In the discussion I would like to present the performance analysis,
and discuss ways on how the analyze the performance in the light of
high noise in the measurements.
[1]
https://lore.kernel.org/linux-nvme/20240618120345.64761-1-hare@kernel.org/#r
[2]
https://lore.kernel.org/linux-nvme/20241224120457.576100-1-sagi@grimberg.me/
[3]
https://lore.kernel.org/linux-nvme/20250307132802.111513-1-hare@kernel.org/
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list