[PATCH] nvmet: allow associating port to a cgroup via configfs

Mon Jul 3 06:21:40 PDT 2023

> Hey Sagi and Chaitanya,
> 
> On 28/06/2023 5:33, Chaitanya Kulkarni wrote:
>> On 6/27/23 14:13, Sagi Grimberg wrote:
>>> Hey Ofir,
>>>
>>>> Currently there is no way to throttle nvme targets with cgroup v2.
>>>
>>> How do you do it with v1?
> With v1 I would add a blkio rule at the cgroup root level. The bio's
> that the nvme target submits aren't associated to a specific cgroup,
> which makes them follow the rules of the cgroup root level.
> 
> V2 doesn't allow to set rules at the root level by design.
> 
>>>> The IOs that the nvme target submits lack associating to a cgroup,
>>>> which makes them act as root cgroup. The root cgroup can't be throttled
>>>> with the cgroup v2 mechanism.
>>>
>>> What happens to file or passthru backends? You paid attention just to
>>> bdev. I don't see how this is sanely supported with files. It's possible
>>> if you convert nvmet to use its own dedicated kthreads and infer the
>>> cg from the kthread. That presents a whole other set of issues.
>>>
>>
>> if we are doing it for one back-end we cannot leave other back-ends out ...
>>
>>> Maybe the cleanest way to implement something like this is to implement
>>> a full blown nvmet cgroup controller that you can apply a whole set of
>>> resources to, in addition to I/O.
> 
> Thorttiling files and passthru isn't possible with cgroup v1 as well,
> cgroup v2 broke the abillity to throttle bdevs. The purpose of the patch
> is to re-enable the broken functionality.

cgroupv2 didn't break anything, this was never an intended feature of
the linux nvme target, so it couldn't have been broken. Did anyone
know that people are doing this with nvmet?

I'm pretty sure others on the list are treating this as a suggested
new feature for nvmet. and designing this feature as something that
is only supported for blkdevs is undersirable.

> There was an attempt to re-enable the functionality by allowing io
> throttle on the root cgroup but it's against the cgroup v2 design.
> Reference:
> https://lore.kernel.org/r/20220114093000.3323470-1-yukuai3@huawei.com/
> 
> A full blown nvmet cgroup controller may be a complete solution, but it
> may take some time to achieve,

I don't see any other sane solution here.

Maybe Tejun/others think differently here?

> while the feature is still broken.

Again, this is not a breakage.

> 
>>>
>>>> Signed-off-by: Ofir Gal <ofir.gal at volumez.com>
>>>> ---
>>>>    drivers/nvme/target/configfs.c    | 77 +++++++++++++++++++++++++++++++
>>>>    drivers/nvme/target/core.c        |  3 ++
>>>>    drivers/nvme/target/io-cmd-bdev.c | 13 ++++++
>>>>    drivers/nvme/target/nvmet.h       |  3 ++
>>>>    include/linux/cgroup.h            |  5 ++
>>>>    kernel/cgroup/cgroup-internal.h   |  5 --
>>>
>>> Don't mix cgroup and nvmet changes in the same patch.
> 
> Thanks for claryfing I wansn's sure if it's nessecary I would split the
> patch for v2.
> 
>>>
>>>>    6 files changed, 101 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/nvme/target/configfs.c
>>>> b/drivers/nvme/target/configfs.c
>>>> index 907143870da5..2e8f93a07498 100644
>>>> --- a/drivers/nvme/target/configfs.c
>>>> +++ b/drivers/nvme/target/configfs.c
>>>> @@ -12,6 +12,7 @@
>>>>    #include <linux/ctype.h>
>>>>    #include <linux/pci.h>
>>>>    #include <linux/pci-p2pdma.h>
>>>> +#include <linux/cgroup.h>
>>>>    #ifdef CONFIG_NVME_TARGET_AUTH
>>>>    #include <linux/nvme-auth.h>
>>>>    #endif
>>>> @@ -281,6 +282,81 @@ static ssize_t
>>>> nvmet_param_pi_enable_store(struct config_item *item,
>>>>    CONFIGFS_ATTR(nvmet_, param_pi_enable);
>>>>    #endif
>>>>    +static ssize_t nvmet_param_associated_cgroup_show(struct
>>>> config_item *item,
>>>> +        char *page)
>>>> +{
>>>> +    struct nvmet_port *port = to_nvmet_port(item);
>>>> +    ssize_t len = 0;
>>>> +    ssize_t retval;
>>>> +    char *suffix;
>>>> +
>>>> +    /* No cgroup has been set means the IOs are assoicated to the
>>>> root cgroup */
>>>> +    if (!port->cgrp)
>>>> +        goto root_cgroup;
>>>> +
>>>> +    retval = cgroup_path_ns(port->cgrp, page, PAGE_SIZE,
>>>> +                         current->nsproxy->cgroup_ns);
>>>> +    if (retval >= PATH_MAX || retval >= PAGE_SIZE)
>>>> +        return -ENAMETOOLONG;
>>>> +
>>>> +    /* No cgroup found means the IOs are assoicated to the root
>>>> cgroup */
>>>> +    if (retval < 0)
>>>> +        goto root_cgroup;
>>>> +
>>>> +    len += retval;
>>>> +
>>>> +    suffix = cgroup_is_dead(port->cgrp) ? " (deleted)\n" : "\n";
>>>> +    len += snprintf(page + len, PAGE_SIZE - len, suffix);
>>>> +
>>>> +    return len;
>>>> +
>>>> +root_cgroup:
>>>> +    return snprintf(page, PAGE_SIZE, "/\n");
>>>> +}
>>>> +
>>>> +static ssize_t nvmet_param_associated_cgroup_store(struct
>>>> config_item *item,
>>>> +        const char *page, size_t count)
>>>> +{
>>>> +    struct nvmet_port *port = to_nvmet_port(item);
>>>> +    struct cgroup_subsys_state *blkcg;
>>>> +    ssize_t retval = -EINVAL;
>>>> +    struct cgroup *cgrp;
>>>> +    char *path;
>>>> +    int len;
>>>> +
>>>> +    len = strcspn(page, "\n");
>>>> +    if (!len)
>>>> +        return -EINVAL;
>>>> +
>>>> +    path = kmemdup_nul(page, len, GFP_KERNEL);
>>>> +    if (!path)
>>>> +        return -ENOMEM;
>>>> +
>>>> +    cgrp = cgroup_get_from_path(path);
>>>> +    kfree(path);
>>>> +    if (IS_ERR(cgrp))
>>>> +        return -ENOENT;
>>>> +
>>>> +    blkcg = cgroup_get_e_css(cgrp, &io_cgrp_subsys);
>>>> +    if (!blkcg)
>>>> +        goto out_put_cgroup;
>>>> +
>>>> +    /* Put old cgroup */
>>>> +    if (port->cgrp)
>>>> +        cgroup_put(port->cgrp);
>>>> +
>>>> +    port->cgrp = cgrp;
>>>> +    port->blkcg = blkcg;
>>>> +
>>>> +    return count;
>>>> +
>>>> +out_put_cgroup:
>>>> +    cgroup_put(cgrp);
>>>> +    return retval;
>>>> +}
>>>
>>> I'm not at all convinced that nvmet ratelimiting does not
>>> require a dedicated cgroup controller... Rgardles, this doesn't
>>> look like a port attribute, its a subsystem attribute.
>>
>> +1 here, can you please explain the choice of port ?
> 
> In cgroup threads/processes are associated to a specific control group.
> Each control group may have different rules to throttle various devices.
> For example we may have 2 applications both using the same bdev.
> By associating the apps to different cgroups, we can create a different
> throttling rule for each app.
> Throttling is done by echoing "MAJOR:MINOR rbps=X wiops=Y" to "io.max"
> of the cgroup.
> 
> Associating a subsystem to a cgroup will only allow us to create a
> single rule for each namespace (bdev) in this subsystem.
> When associating the nvme port to the cgroup it acts as the "thread"
> that handles the IO for the target, which aligns with the cgroup design.

The port does not own the thread though, rdma/loop inherit the
"thread" from something entirely different. tcp uses the same worker
threads for all ports (global workqueue).

IIUC, a cgroup would be associated with a specific host, and the entity
that is host aware is a subsystem, not a port.

> Regardless if the attribute is part of the port or the subsystem, the
> user needs to specify constraints per namespace. I see no clear value in
> setting the cgroup attribute on the subsystem.

Maybe it would be worth to explain what you are trying to achieve here,
because my understanding is that you want to allow different hosts to
get different I/O service levels. The right place to do this is the
subsystem AFAICT.

> On the other hand, by associating a port to a cgroup we could have
> multiple constraints per namespace. It will allow the user to have more
> control of the behavior of his system.

Can you please clarify what you mean by "multiple constraints per
namespace"?

> For example a system with a RDMA port and a TCP port that are connected
> to the same subsystem's can apply different limits for each port.
> 
> This could be complimentary to NVMe ANA for example, where the target
> could apply different constraints for optimized and non-optimized paths

I don't understand what you are suggesting here. nvme already has
asymmetry semantics between controllers.

I think you are mixing two different things. There is asymmetric access
in nvmet (which could translate to tcp vs. rdma or to something else)
and there is providing different I/O service levels to different hosts.
I don't see how these two are connected.