[LSF/MM BPF TOPIC] NUMA topology metrics for NVMe-oF
Hannes Reinecke
hare at suse.de
Tue Feb 20 00:03:08 PST 2024
Hi all,
having recently played around with CXL I started to wonder which
impllication that would have for NVMe-over-Fabrics, and how the path
selection would play out on such a system.
Thing is, with heavy NUMA systems we really should have a look at
the inter-node latencies, especially as the HW latencies are getting
closer to the NUMA latencies: for an Intel two socket node I'm seeing
latencies of around 200ns, and it's not unheard of getting around 5M
IOPS from the device, which results in a latency of 2000ns.
And that's on PCI4.0. With PCI5 or CXL one expects the latency to
decrease even further.
So I think that we should need to look at factor in the NUMA topology
for PCI devices, too. We do have a NUMA I/O policy, but that only looks
at the latency between nodes.
What we're missing is a NUMA latency for the PCI devices themselves.
So this discussion would be around how we could model (or even measure)
the PCI latency, and how we could modify the NVMe-oF iopolicies to take
the NUMA latencies into account when selecting the 'best' path.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list