[PATCH 2/2] nvme-pci: allow unmanaged interrupts
Keith Busch
kbusch at kernel.org
Fri May 10 17:41:58 PDT 2024
On Sat, May 11, 2024 at 07:50:21AM +0800, Ming Lei wrote:
> On Fri, May 10, 2024 at 10:20:02AM -0600, Keith Busch wrote:
> > On Fri, May 10, 2024 at 05:10:47PM +0200, Christoph Hellwig wrote:
> > > On Fri, May 10, 2024 at 07:14:59AM -0700, Keith Busch wrote:
> > > > From: Keith Busch <kbusch at kernel.org>
> > > >
> > > > Some people _really_ want to control their interrupt affinity.
> > >
> > > So let them argue why. I'd rather have a really, really, really
> > > good argument for this crap, and I'd like to hear it from the horses
> > > mouth.
> >
> > It's just prioritizing predictable user task scheduling for a subset of
> > CPUs instead of having consistently better storage performance.
> >
> > We already have "isolcpus=managed_irq," parameter to prevent managed
> > interrupts from running on a subset of CPUs, so the use case is already
> > kind of supported. The problem with that parameter is it is a no-op if
> > the starting affinity spread contains only isolated CPUs.
>
> Can you explain a bit why it is a no-op? If only isolated CPUs are
> spread on one queue, there will be no IO originated from these isolated
> CPUs, that is exactly what the isolation needs.
The "isolcpus=managed_irq," option doesn't limit the dispatching CPUs.
It only limits where the managed irq will assign the effective_cpus as a
best effort.
Example, I boot with a system with 4 threads, one nvme device, and
kernel parameter:
isolcpus=managed_irq,2-3
Run this:
for i in $(seq 0 3); do taskset -c $i dd if=/dev/nvme0n1 of=/dev/null bs=4k count=1000 iflag=direct; done
Check /proc/interrupts | grep nvme0:
CPU0 CPU1 CPU2 CPU3
...
26: 1000 0 0 0 PCI-MSIX-0000:00:05.0 1-edge nvme0q1
27: 0 1004 0 0 PCI-MSIX-0000:00:05.0 2-edge nvme0q2
28: 0 0 1000 0 PCI-MSIX-0000:00:05.0 3-edge nvme0q3
29: 0 0 0 1043 PCI-MSIX-0000:00:05.0 4-edge nvme0q4
The isolcpus did nothing becuase the each vector's mask had just one
cpu; there was no where else that the managed irq could send it. The
documentation seems to indicate that was by design as a "best effort".
More information about the Linux-nvme
mailing list