[PATCH for-4.5 06/13] NVMe: Remove WQ_MEM_RECLAIM from nvme work queue

Keith Busch keith.busch at intel.com
Wed Feb 10 15:37:47 PST 2016


On Wed, Feb 10, 2016 at 10:46:41AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 10, 2016 at 11:17:23AM -0700, Keith Busch wrote:
> > This isn't used for work in the memory reclaim path, and we may need
> > to sync with work queues that also are not flagged memory relaim. This
> > fixes a kernel warning if we ever do sync with such a work queue.
> 
> We do need it during memory reclaim: memory reclaim in general
> does I/O, which can be on NVMe.  We then need the workqueue to
> abort a command or reset an overloaded controller to make progress.
> Not having WQ_MEM_RECLAIM risks deadlocks in heavily loaded systems.

Darn. Invalidating a disk drains lru, which syncs with work scheduled
on the system_wq. Syncing with that from a memory reclaim work queue
hits a kernel warning.

That lru drain work is reclaiming memory, though. Does this need
to be using a WQ_MEM_RECLAIM queue, then?

This is the alternate patch I didn't plan to submit:

---
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 0e32bc7..f7cc91e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -356,6 +359,7 @@ extern struct workqueue_struct *system_unbound_wq;
 extern struct workqueue_struct *system_freezable_wq;
 extern struct workqueue_struct *system_power_efficient_wq;
 extern struct workqueue_struct *system_freezable_power_efficient_wq;
+extern struct workqueue_struct *system_mem_wq;

 extern struct workqueue_struct *
 __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 61a0264..57a50d2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5483,10 +5483,13 @@ static int __init init_workqueues(void)
        system_freezable_power_efficient_wq = alloc_workqueue("events_freezable_power_efficient",
                                              WQ_FREEZABLE | WQ_POWER_EFFICIENT,
                                              0);
+       system_mem_wq = alloc_workqueue("events_mem_unbound", WQ_UNBOUND | WQ_MEM_RECLAIM,
+                                           WQ_UNBOUND_MAX_ACTIVE);
        BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
               !system_unbound_wq || !system_freezable_wq ||
               !system_power_efficient_wq ||
-              !system_freezable_power_efficient_wq);
+              !system_freezable_power_efficient_wq ||
+              !system_mem_wq);

        wq_watchdog_init();

diff --git a/mm/swap.c b/mm/swap.c
index 09fe5e9..eecf98a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -685,7 +685,7 @@ void lru_add_drain_all(void)
                    pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
                    need_activate_page_drain(cpu)) {
                        INIT_WORK(work, lru_add_drain_per_cpu);
-                       schedule_work_on(cpu, work);
+                       queue_work_on(cpu, system_mem_wq, work);
                        cpumask_set_cpu(cpu, &has_work);
                }
        }
--



More information about the Linux-nvme mailing list