[PATCH] nvmet: allow associating port to a cgroup via configfs

Thu Jul 6 06:28:18 PDT 2023

On 04/07/2023 10:54, Sagi Grimberg wrote:
>>>> A full blown nvmet cgroup controller may be a complete solution, but it
>>>> may take some time to achieve,
>>>
>>> I don't see any other sane solution here.
>>>
>>> Maybe Tejun/others think differently here?
>>
>> I'm not necessarily against the idea of enabling subsystems to assign
>> cgroup
>> membership to entities which aren't processes or threads. It does make
>> sense
>> for cases where a kernel subsystem is serving multiple classes of users
>> which aren't processes as here and it's likely that we'd need something
>> similar for certain memory regions in a limited way (e.g. tmpfs chunk
>> shared
>> across multiple cgroups).
> 
> That makes sense.
> 
> From the nvme target side, the prime use-case is I/O, which can be on
> against bdev backends, file backends or passthru nvme devices.
> 
> What we'd want is for something that is agnostic to the backend type
> hence my comment that the only sane solution would be to introduce a
> nvmet cgroup controller.
> 
> I also asked the question of what is the use-case here? because the
> "users" are remote nvme hosts accessing nvmet, there is no direct
> mapping between a nvme namespace (backed by say a bdev) to a host, only
> indirect mapping via a subsystem over a port (which is kinda-sorta
> similar to a SCSI I_T Nexus). Implementing I/O service-levels
> enforcement with blkcg seems like the wrong place to me.

We have a server with a bdev that reach 15K IOPs, but when more than 10K
IOPs are served the latency of the bdev is being doubled. We want to
limit this bdev to 10K IOPs to serve the clients with low latency.

For sake of simplicity The server expose the bdev with nvme target under
a single subsystem and with a single namespace through a single port.

There are 2 clients connected to the server, the clients aren't aware of
each other and are submitting unknown amount of IOPs.
Without the limitation on the server side we have to limit each client
to half of the bdev capability, in our example it would be 5K IOPs for
each client.

This would be non-optimal limitation, sometimes client #1 is idle while
client #2 need to submit 10K IOPs.
With a server side limitation, both clients can submit up to 10K
combined but are exposed to "noisy-neighbor".

>> That said, because we haven't done this before, we haven't figured out
>> how
>> the API should be like and we definitely want something which can be
>> used in
>> a similar fashion across the board. Also, cgroup does assume that
>> resources
>> are always associated with processes or threads, and making this work
>> with
>> non-task entity would require some generalization there. Maybe the
>> solution
>> is to always have a tying kthread which serves as a proxy for the
>> resource
>> but that seems a bit nasty at least on the first thought.

A general API to associate non-task entities sounds great!

On 03/07/2023 13:21, Sagi Grimberg wrote:
> cgroupv2 didn't break anything, this was never an intended feature of
> the linux nvme target, so it couldn't have been broken. Did anyone
> know that people are doing this with nvmet?
> 
> I'm pretty sure others on the list are treating this as a suggested
> new feature for nvmet. and designing this feature as something that
> is only supported for blkdevs is undersirable.

I understand that throttling of an nvme target was not an intended feature.
However it was possible to create a global limit for any bdev, which
allowed throttling nvme target with bdev backends.
Today with the new cgroup architecture it is impossible.

I don't see how this does not count as a user-space breakage.
I think this patch can be a temporary solution until a general API will
be designed and implemented.