[PATCH 1/1] nvme-pci: Add CPU latency pm-qos handling

Tue Oct 15 02:25:37 PDT 2024

On Wed, 2024-10-09 at 11:24 +0300, Tero Kristo wrote:
> On Wed, 2024-10-09 at 10:00 +0200, Christoph Hellwig wrote:
> > On Wed, Oct 09, 2024 at 09:45:07AM +0300, Tero Kristo wrote:
> > > Initially, I posted the patch against block layer, but there the
> > > recommendation was to move this closer to the HW; i.e. NVMe
> > > driver
> > > level.
> > 
> > Even if it is called from NVMe, at lot of the code is not nvme
> > specific.
> > Some of it appears block specific and other pats are entirely
> > generic.
> > 
> > But I still don't see how walking cpumasks and updating paramters
> > in
> > far away (in terms of cache lines and pointer dereferences) for
> > every
> > single I/O could work without having a huge performance impact.
> > 
> 
> Generally, the cpumask only has a couple of CPUs on it; yes its true
> on
> certain setups every CPU of the system may end up on it, but then the
> user has the option to not enable this feature at all. In my testing
> system, there is a separate NVME irq for each CPU, so the affinity
> mask
> only contains one bit.
> 
> Also, the code tries to avoid calling the heavy PM QoS stuff, by
> checking if the request is already active, and updating the values in
> a
> workqueue later on. Generally the heavy-ish parameter update only
> happens on the first activity of a burst of NVMe accesses.

I've been giving this some thought offline, but can't really think of
how this could be done in the generic layers; the code needs to figure
out the interrupt that gets fired by the activity, to prevent the CPU
that is going to handle that interrupt to go into deep idle,
potentially ruining the latency and throughput of the request. The
knowledge of this interrupt mapping only resides in the driver level,
in this case NVMe.

One thing that could be done is to prevent the whole feature to be used
on setups where the number of cpus per irq is above some threshold;
lets say 4 as an example.

-Tero