[PATCH V3] nvme-pci: allow unmanaged interrupts

Tue Jul 2 18:51:02 PDT 2024

On Tue, Jul 02, 2024 at 06:28:19PM +0200, Daniel Wagner wrote:
> On Tue, Jul 02, 2024 at 08:12:11PM GMT, Ming Lei wrote:
> > On Tue, Jul 02, 2024 at 01:50:02PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> > > > From: Keith Busch <kbusch at kernel.org>
> > > > 
> > > > People _really_ want to control their interrupt affinity in some
> > > > cases, such as Openshift with Performance profile, in which each
> > > > irq's affinity is completely specified from userspace. Turns out
> > > > that 'isolcpus=managed_irqs' isn't enough.
> > > > 
> > > > Add module parameter to allow unmanaged interrupts, just as some
> > > > SCSI drivers are doing.
> > > 
> > > Same as before: hell no.  We can't just add hacky global kernel
> > > parameters everywhere.  We need the cpu isolation infrastructure to
> > > work properly instead of piling hacks of hacks in every relevant driver.
> > 
> > Per my understanding, here cpu isolation infrastructure can't work for
> > Openshift, in which IO workload can be run on applications which are executed
> > on isolated CPUs, meantime userspace do expect that interrupts can be
> > triggered on user-specified CPU cores only in controllable way.  
> > 
> > Marcelo and Lawrence may have more input in this area.
> > 
> > Also irq allocation really belongs to device & driver stuff, how can that be
> > hack? We even may not abstract public API in block layer for handling
> > irq related thing.
> 
> I am confused. I though you told me that my series 'nvme-pci: honor
> isolcpus configuration' is not necessary. But you still need this patch

Your patch fixes nothing basically, meantime it introduces regression. But
I don't object the approach if blk-mq regressions can be solved.

> to get the affinity sorted out? Wouldn't it make sense to figure out how
> we can make my series working also for your use case? E.g. we could
> introduce another HK type (io_queue) to control the affinity. This would
> decouple if from the managed_irq option.

Adding new HK type can't help this issue because Openshift environment needs
to control each irq's affinity by themselves dynamically, and even IO workload
may be run on isolated CPUs.

Thanks, 
Ming