[LSF/MM BPF TOPIC] NUMA topology metrics for NVMe-oF

Tue Feb 20 21:49:15 PST 2024

Hannes,

On 2/20/24 00:03, Hannes Reinecke wrote:
> Hi all,
>
> having recently played around with CXL I started to wonder which 
> impllication that would have for NVMe-over-Fabrics, and how the path 
> selection would play out on such a system.
>
> Thing is, with heavy NUMA systems we really should have a look at
> the inter-node latencies, especially as the HW latencies are getting
> closer to the NUMA latencies: for an Intel two socket node I'm seeing
> latencies of around 200ns, and it's not unheard of getting around 5M 
> IOPS from the device, which results in a latency of 2000ns.
> And that's on PCI4.0. With PCI5 or CXL one expects the latency to 
> decrease even further.
>
> So I think that we should need to look at factor in the NUMA topology
> for PCI devices, too. We do have a NUMA I/O policy, but that only looks
> at the latency between nodes.
> What we're missing is a NUMA latency for the PCI devices themselves.
>
> So this discussion would be around how we could model (or even measure)
> the PCI latency, and how we could modify the NVMe-oF iopolicies to 
> take the NUMA latencies into account when selecting the 'best' path.
>
> Cheers,
>
> Hannes

I'm interested in this topic, I also think if we can get some baseline data
about where we stand with current architecture and where we want to see the
numbers before LSFMM it will help overall discussion.

-ck