[PATCH] mm: Add Kcompressd for accelerated memory compression

Fri Jun 27 16:21:32 PDT 2025

On Sun, Jun 22, 2025 at 10:16 PM Barry Song <21cnbao at gmail.com> wrote:
>
> Hi Nhat,
>
> On Wed, Jun 18, 2025 at 2:21 AM Nhat Pham <nphamcs at gmail.com> wrote:
> >
> > On Sun, Jun 15, 2025 at 8:41 PM Barry Song <21cnbao at gmail.com> wrote:
> > > >>
> > > >> That seems unnecessary. There is an existing method for asynchronous
> > > >> writeback, and pageout() is naturally fully set up to handle this.
> > > >>
> > > >> IMO the better way to do this is to make zswap_store() (and
> > > >> zram_bio_write()?) asynchronous. Make those functions queue the work
> > > >> and wake the compression daemon, and then have the daemon call
> > > >> folio_end_writeback() / bio_endio() when it's done with it.
> > >
> > > > +1.
> > >
> > >
> > > But,
> > > How could this be possible for zswap? zswap_store() is only a frontend —
> > > we still need its return value to determine whether __swap_writepage()
> > > is required. Waiting for the result of zswap_store() is inherently a
> > > synchronous step.
> >
> > Hmm, I might be misunderstanding either of you, but it sounds like
> > what you're describing here does not contradict what Johannes is
> > proposing?
>
> It seems contradictory: Johannes proposes that zswap could behave like zRAM
> by invoking `folio_end_writeback()` or `bio_endio()`, but this doesn’t align
> with actual behavior since zswap_store might not end `swap_writeout()`—it may
> still proceed to `__swap_writeback()` to complete the final steps.
>
> Meanwhile, Qun-wei’s RFC has already explored using `folio_end_writeback()` and
> `bio_endio()` at the end of `__swap_writepage()` for zRAM, though that approach
> also has its own issues.

Hmm OK. I'll let Johannes comment on this then :)

>
> >
> > >
> > > My point is that folio_end_writeback() and bio_endio() can only be
> > > called after the entire zswap_store() → __swap_writepage() sequence is
> > > completed. That’s why both are placed in the new kcompressed.
> >
> > Hmm, how about:
> >
> > 1. Inside zswap_store(), we first obtain the obj_cgroup reference,
> > check cgroup and pool limit, and grab a zswap pool reference (in
> > effect, determining the slot allocator and compressor).
> >
> > 2. Next, we try to queue the work to kcompressd, saving the folio and
> > the zswap pool (and whatever else we need for the continuation). If
> > this fails, we can proceed with the old synchronous path.
> >
> > 3. In kcompressed daemon, we perform the continuation of
> > zswap_store(): compression, slot allocation, storing, zswap's LRU
> > modification, etc. If this fails, we check if the mem_cgroup enables
> > writeback. If it's enabled, we can call __swap_writepage(). Ideally,
> > if writeback is disabled, we should activate the page, but it might
> > not be possible since shrink_folio_list() might already re-add the
> > page to the inactive lru. Maybe some modification of pageout() and
> > shrink_folio_list() can make this work, but I haven't thought too
> > deeply about it :) If it's impossible, we can perform async
> > compression only for cgroups that enable writeback for now. Once we
> > fix zswap's handling of incompressible pages, we can revisit this
> > decision (+ SJ).
> >
> > TLDR: move the work-queueing step forward a bit, into the middle of
> > zswap_store().
> >
> > One benefit of this is we skip pages of cgroups that disable zswap, or
> > when zswap pool is full.
>
> I assume you meant something like the following:
>
> bool try_to_sched_async_zswap_store()
> {
>         get_obj_cgroup_from_folio()
>         if (err) goto xxx;
>         zswap_check_limits();
>         if (err) goto xxx;
>         zswap_pool_current_get()
>         if (err) goto xxx;
>
>         queue_folio_to_kcompressd(folio);

Something like this, yeah. Can queue_folio_to_kcompressd() fail? If
so, we can also try synchronous compression on failure here
(__zswap_store() ?).

>         return true;
>
> xxx:
>         error handler things;
>         return false;
> }
>
> If this function returns true, it suggests that compression requests
> have been queued to kcompressd. Following that, in kcompressd():
>
> int __zswap_store(folio)
> {
>         for(i=0;i<nr_pages;i++) {
>                 zswap_store_page();
>                 if (err) return err;
>         }
>         return 0;
> }
>
> kcompressd()
> {
>         while(folio_queue_is_not_empty) {
>                 folio = dequeue_folio();
>                 if (folio_queued_by_zswap(folio)) {
>                         if(!__zswap_store(folio))
>                                 continue;
>                 }
>                 if ((zswap_store_page_fails && mem_cgroup_zswap_writeback_enabled()) ||
>                     folio_queued_by_zram) {

If !mem_cgroup_zswap_writeback_enabled(), I wonder if we can activate
the page here?

>                         __swap_writepage();
>                 }
>         }
> }
>
> In kswapd, we will need to do
> int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
> {
>         ...
>         if (try_to_sched_async_zswap_store(folio))
>                 return;
>         if (is_sync_comp_blkdev(swap)) {
>                 queue_folio_to_kcompressd(folio);
>                 return;
>         }
>         __swap_writepage();
> }
>
> To be honest, I'm not sure if there's a flag that indicates whether the
> folio was queued by zswap or zram. If not, we may need to add a member

I don't think there is.

> associated with folio pointers in the queue between kswapd and kcompressd,
> since we need to identify zswap cases. Maybe we can reuse bit 0 of the
> folio pointer?
>
> What I mean is: while queuing, if the folio is queued by zswap, we do
> `pointer |= BIT(0)`. Then in kcompressd, we restore the original folio
> with `folio = pointer & ~BIT(0)`. It's a bit ugly, but I’m not sure
> there’s a better approach.

I think this approach is fine.

We can also hack struct zswap_entry, but that would require an extra
xarray look up. OTOH, if we can assume that zram users will not enable
zswap, we might optimize that lookup away? Not sure if it's much
cleaner than just pointer tagging though.

>
> Thanks
> Barry