Seeking for help with NVMe arbitration questions

Fri Apr 28 11:18:03 PDT 2023

> You might expect that a new command placed on a shallow queue will
> be handled ahead of a command place on a deep queue at the same
> time. Indeed, some implementation may even show desirable results
> with that scheme, but the spec doesn't really guarantee that, though.
Thanks Keith, that makes total sense... This also aligns with what I
experimented with (please see below). I was hoping to see task2 would
benefit from a larger queue proportion but it didn't.
1. W/ default queue set-up, ran 2 identical randwrite FIO tasks (3
identical jobs each) to the same set of CPUs. The 2 tasks got the same
performance
2. Changed to 1/64/2 queue set-up, FIO tasks were the same, despite
task1 went to default queues and task2 went to poll queues. The result
was very close to test 1. Task2 did run faster but the difference was
trivial (seemed to come from IO polling instead of queue distribution)

Then this confuses me with the motivations of introducing different
queue types. Doesn't it aim for providing some sort of prioritization?

Best,
Yicheng

On Fri, Apr 28, 2023 at 8:43 AM Keith Busch <kbusch at kernel.org> wrote:
>
> On Thu, Apr 27, 2023 at 05:36:12PM -0700, Wang Yicheng wrote:
> > Thanks a lot Keith! This is very helpful!
> >
> > 1. Then do you see a way to prioritize a specific set of IOs (favor
> > small writes over large writes) from the IO queue's perspective?
> > Initially I was thinking of WRR, which later turned out to be not
> > supported. If I want to leverage the IO queues to achieve the same
> > goal, from what I understand I can simply send small writes to poll
> > queues, and allocate more of those queues. Say on average small writes
> > take up 20% of the total IOs. And if I distribute 40% of total queues
> > as poll queues, in some sense I give more weight to small writes and
> > thus prioritize them.
>
> You might expect that a new command placed on a shallow queue will
> be handled ahead of a command place on a deep queue at the same
> time. Indeed, some implementation may even show desirable results
> with that scheme, but the spec doesn't really guarantee that, though.
>
> For a pure software side solution, you could use an ioscheduler and
> set your ioprio accordingly.
>
> > > > 3. Say I have only 1 default queue and I submit an I/O from some CPU,
> > > > then there can be a chance that the I/O would need to cross CPUs, if
> > > > the default queue happens not to be on the same core right?
> > >
> > > If you only have one queue of a particular type, then the sysfs mq
> > > directory for that queue should show cpu_list having all CPU's set,
> > > so no CPU crossing necessary for dispatch. In fact, for any queue
> > > count and CPU topo, dispatch should never need to reschedule to
> > > another core (that was the point of the design). Completions on the
> > > other hand are typically affinitized to a single specific CPU, so
> > > the complete may happen on a different core than your submit in
> > > this scenario.
> >
> > 2. You mentioned that completions are affinitized to a single specific
> > CPU. And this is exactly what I observed in my test. This also seems
> > to cause worse performance. Is there a way to query that affinity or
> > is it invisible from outside?
>
> To query a queue's affinity, check /proc/irq/<#>/effective_affinity.
> You can check /proc/interrupts to determine which irq# goes with
> which queue.