[PATCH rfc 0/6] convert nvme pci to use irq-poll service

Wed Oct 5 15:49:57 PDT 2016

On Thu, Oct 06, 2016 at 12:55:07AM +0300, Sagi Grimberg wrote:
> 
> > > I ran some tests with this and it seemed to work pretty well with
> > > my low-end nvme devices. One phenomenon I've encountered was that
> > > for single core long queue-depth'ed randread workload I saw around
> > > ~8-10% iops decrease. However when running multi-core IO I didn't
> > > see any noticeable performance degradation. non-polling Canonical
> > > randread latency doesn't seem to be affected as well. And also
> > > polling mode IO is not affected as expected.
> > > 
> > > So in addition for review and feedback, this is a call for testing
> > > and benchmarking as this touches the critical data path.
> > 
> > Hi Sagi,
> > 
> > Just reconfirming your findings with another data point, I ran this on
> > controllers with 3D Xpoint media, and single depth 4k random read latency
> > increased almost 7%. I'll try see if there's anything else we can do to
> > bring that in.
> 
> Actually I didn't notice latency increase with QD=1 but I'm using
> low-end devices so I might have missed it.
> Did you use libaio or psync (for polling mode)?

I used 'sync' ioengine for random read. For sequential, I used 'dd',
and it showed the same difference.

If I use pvsync2 with --hipri in fio, then I see little to no difference.

> I'm a bit surprised that scheduling soft-irq (on the same core) is
> so expensive (the networking folks are using it all over...)
> Perhaps we need to look into napi and see if we're doing something
> wrong there...

The latency increase I observed was ~.5 microseconds. That may get lost
in the noise for most controllers. Maybe that's within the expected
increase using this soft-irq method?

> I wander if we kept the cq processing in queue_rq but budget it to
> something normally balanced? maybe poll budget to 4 completions?
>
> Does this have any effect with your Xpoint?

No difference with that.

I was actually never sure about having this opprotunistic polling in the
IO submission path. For one, it can create higher latency outliers if
a different controller's interrupt affinitized to the same CPU posted
a completion first.

Another thing we can do if we don't poll for completions in the submission
path is split the single q_lock into submission and completion locks. Then
we don't have to disable irq's in the submission path. I saw that demo'ed
a while ago, and it was a small micro-improvement, but never saw the
patch proposed on the public lists..

> --
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 150941a1a730..28b33f518a3d 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -608,6 +608,7 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>                 goto out;
>         }
>         __nvme_submit_cmd(nvmeq, &cmnd);
> +       __nvme_process_cq(nvmeq, 4);
>         spin_unlock_irq(&nvmeq->q_lock);
>         return BLK_MQ_RQ_QUEUE_OK;
>  out:
> --