[PATCH] Remove redundant writes to uncached sqe memory
Sam Bradshaw (sbradshaw)
sbradshaw at micron.com
Mon May 12 12:41:52 PDT 2014
> -----Original Message-----
> From: Matthew Wilcox [mailto:willy at linux.intel.com]
> Sent: Saturday, May 10, 2014 7:53 PM
> To: Sam Bradshaw (sbradshaw)
> Cc: linux-nvme at lists.infradead.org
> Subject: Re: [PATCH] Remove redundant writes to uncached sqe memory
>
> On Fri, May 09, 2014 at 01:44:47PM -0700, Sam Bradshaw wrote:
> > The memset to clear the SQE in nvme_submit_iod() is made partially
> > redundant by subsequent writes. This patch explicitly clears each
> > SQE structure member in ascending order, eliminating the need for
> > the memset. With this change, our perf runs show ~1.5% less time
> > spent in the IO submission path and minor reduced q lock contention.
>
> I'm shocked! I thought that zeroing the cacheline first would be
> better
> performing than storing into parts of the cacheline. But I can't argue
> with your numbers. I think your patch is missing a store to the
> metadata
> element though; care to rerun your test with that added?
Yes, thanks for pointing out the missing store to ->metadata.
Without memset:
+ 35.35% fio nvme_process_cq
+ 10.85% fio nvme_make_request
+ 10.64% fio nvme_map_bio
+ 10.49% fio free_cmdid
+ 9.65% fio alloc_cmdid
+ 9.52% fio bio_completion
+ 5.18% fio nvme_submit_iod
+ 2.38% fio nvme_submit_bio_queue
+ 2.26% fio nvme_alloc_iod
+ 2.17% fio nvme_setup_prps
+ 0.90% fio nvme_free_iod
+ 0.61% fio nvme_irq
With memset:
+ 36.24% fio nvme_process_cq
+ 11.04% fio nvme_make_request
+ 10.76% fio nvme_map_bio
+ 9.48% fio free_cmdid
+ 9.24% fio alloc_cmdid
+ 8.51% fio bio_completion
+ 6.33% fio nvme_submit_iod
+ 2.38% fio nvme_submit_bio_queue
+ 2.30% fio nvme_alloc_iod
+ 2.18% fio nvme_setup_prps
+ 0.91% fio nvme_free_iod
+ 0.62% fio nvme_irq
The numbers pretty consistently show nvme_submit_io() taking more
cpu time with the memset. But, to be fair there is run-to-run
variation +/- ~1% in cpu time for any of the bigger offenders.
The test setup is a bit odd for several (deliberate) reasons:
1) we have fewer queue pairs supported by hardware than cpu cores.
2) fio io submission threads are pegged to cores on nodes other than
the nodes owning the SQ/CQ physical memory, causing the SQE update
to cross a QPI link.
3) io submission threads concurrently run on more than one remote
core and contend for the SQ lock.
If you're not so keen on these data as justification, I could rerun
the experiment on a platforms with differing coherency models. May
take a bit of time, though.
-Sam
More information about the Linux-nvme
mailing list