[PATCH] NVMe: Remove superfluous cqe_seen
Matthew Wilcox
willy at linux.intel.com
Thu Jun 19 09:59:57 PDT 2014
On Thu, May 22, 2014 at 12:10:19AM +0000, Sam Bradshaw (sbradshaw) wrote:
> Performance problem, though not very easily measured. At very high iops
> rates, most if not all cqe's are processed via nvme_process_cq() in
> make_request(), leaving nvme_irq() with no work to do. Nevertheless, it
> always writes cqe_seen, which invalidates a very hot cacheline. This
> is somewhat exacerbated when IO submissions originate on a remote node
> relative to the cpu handling the irq.
I was thinking "Hey, we should move cqe_seen to a different cacheline".
So I looked at the cacheline assignments for the different variables,
and cqe_seen is on the same cacheline as cq_head and cq_phase, so that
cacheline is already being dirtied. Indeed, it's in the same Dword as
cq_phase, so I'd be amazed if the CPU didn't coalesce the two writes.
That might be a more fruitful patch ... rearrange nvme_queue to put
cq_head, cq_phase and cqe_seen in the same Dword, and expect the CPU to
optimise the three assignments into a single Dword store.
I'll let you try it out since you have the setup to benchmark it. Right now,
this is the layout I see:
/* --- cacheline 3 boundary (192 bytes) --- */
u32 * q_db; /* 192 8 */
u16 q_depth; /* 200 2 */
u16 cq_vector; /* 202 2 */
u16 sq_head; /* 204 2 */
u16 sq_tail; /* 206 2 */
u16 cq_head; /* 208 2 */
u16 qid; /* 210 2 */
u8 cq_phase; /* 212 1 */
u8 cqe_seen; /* 213 1 */
u8 q_suspended; /* 214 1 */
I notice a 4-byte hole after q_lock, so moving cq_head, cq_phase and
cqe_seen into that space would probably be a good idea (since that
cacheline is definitely dirty). I really haven't tried to optimise the
frequently-updated parts of the data structure into the same cacheline,
and it should really help your bizarre setup :-).
More information about the Linux-nvme
mailing list