[PATCH 2/2] nvme-pci: allow unmanaged interrupts

Fri May 10 17:41:58 PDT 2024

On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote:
> On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote:
> > On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote:
> > > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
> > > > From: Keith Busch <kbusch at kernel.org>
> > > > 
> > > > Some people _really_ want to control their interrupt affinity.
> > > 
> > > So let them argue why.  I'd rather have a really, really, really
> > > good argument for this crap, and I'd like to hear it from the horses
> > > mouth.
> > 
> > It's just prioritizing predictable user task scheduling for a subset of
> > CPUs instead of having consistently better storage performance.
> > 
> > We already have "isolcpus=managed_irq," parameter to prevent managed
> > interrupts from running on a subset of CPUs, so the use case is already
> > kind of supported. The problem with that parameter is it is a no-op if
> > the starting affinity spread contains only isolated CPUs.
> 
> Can you explain a bit why it is a no-op? If only isolated CPUs are
> spread on one queue, there will be no IO originated from these isolated
> CPUs, that is exactly what the isolation needs.

The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs.
It only limits where the managed irq will assign the effective_cpus as a
best effort.

Example, I boot with a system with 4 threads, one nvme device, and
kernel parameter:

  isolcpus=managed_irq,2-3

Run this:

  for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done

Check /proc/interrupts | grep nvme0:

           CPU0       CPU1       CPU2       CPU3
...
 26:       1000          0          0          0  PCI-MSIX-0000:00:05.0   1-edge      nvme0q1
 27:          0       1004          0          0  PCI-MSIX-0000:00:05.0   2-edge      nvme0q2
 28:          0          0       1000          0  PCI-MSIX-0000:00:05.0   3-edge      nvme0q3
 29:          0          0          0       1043  PCI-MSIX-0000:00:05.0   4-edge      nvme0q4

The isolcpus did nothing becuase the each vector's mask had just one
cpu; there was no where else that the managed irq could send it. The
documentation seems to indicate that was by design as a "best effort".