[PATCH V3] nvme-pci: allow unmanaged interrupts

Tue Jul 2 05:20:30 PDT 2024

Openshift needs the ability to dynamically move IRQs of all drivers away from a specific set of CPUs, at the point that an isolated workload starts running on those CPUs, and requires high performance guarantees, i.e. no HW interrupts to occur. To achieve this, dynamic setting of the smp_affinity for all drivers is used - at the moment, the NVME driver does not support this, so the NVME IRQs remain running on CPUs they should not be on, and so impact performance of the isolated workload.

Ability to set the smp_affinity seems like a generic feature, that should be supported by all drivers - I'm unclear how adding this feature to the NVME driver can be viewed as a hack?

Thanks,
Lawrence

-----Original Message-----
From: Ming Lei <ming.lei at redhat.com> 
Sent: Tuesday, July 2, 2024 1:12 PM
To: Christoph Hellwig <hch at lst.de>
Cc: Keith Busch <kbusch at kernel.org>; linux-nvme at lists.infradead.org; Sagi Grimberg <sagi at grimberg.me>; Lawrence Troup (ltroup) <ltroup at cisco.com>; Marcelo Tosatti <mtosatti at redhat.com>; ming.lei at redhat.com
Subject: Re: [PATCH V3] nvme-pci: allow unmanaged interrupts

On Tue, Jul 02, 2024 at 01:50:02PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 02, 2024 at 06:41:12PM +0800, Ming Lei wrote:
> > From: Keith Busch <kbusch at kernel.org>
> > 
> > People _really_ want to control their interrupt affinity in some 
> > cases, such as Openshift with Performance profile, in which each 
> > irq's affinity is completely specified from userspace. Turns out 
> > that 'isolcpus=managed_irqs' isn't enough.
> > 
> > Add module parameter to allow unmanaged interrupts, just as some 
> > SCSI drivers are doing.
> 
> Same as before: hell no.  We can't just add hacky global kernel 
> parameters everywhere.  We need the cpu isolation infrastructure to 
> work properly instead of piling hacks of hacks in every relevant driver.

Per my understanding, here cpu isolation infrastructure can't work for Openshift, in which IO workload can be run on applications which are executed on isolated CPUs, meantime userspace do expect that interrupts can be triggered on user-specified CPU cores only in controllable way.  

Marcelo and Lawrence may have more input in this area.

Also irq allocation really belongs to device & driver stuff, how can that be hack? We even may not abstract public API in block layer for handling irq related thing.

Thanks,
Ming