[PATCH 08/18] nvme-tcp: enable TLS handshake upcall

Hannes Reinecke hare at suse.de
Wed Mar 22 02:12:53 PDT 2023


On 3/22/23 09:45, Sagi Grimberg wrote:
> 
>> Select possible PSK identities and call the TLS handshake upcall
>> for each identity.
>> The TLS 1.3 RFC allows to send multiple identities with each ClientHello
>> request, but none of the SSL libraries implement it. As the connection
>> is established when the association is created we send only a single
>> identity for each upcall, and close the connection to restart with
>> the next identity if the handshake fails.
> 
> Can't this loop be done in userspace? In other words, how can
> we get rid of this when SSL libs would decide to support it?
> 
Well. That is something which I've been thinking about, but really 
haven't come to a good solution.

Crux of the matter is that we have to close the connection after a 
failed TLS handshake:

 > A TLS 1.3 client implementation that only supports sending a single
 > PSK identity during connection setup may be required to connect
 > multiple times in order to negotiate cipher suites with different hash
 > functions.

and as it's quite unclear in which state the connection is after the 
userspace library failed the handshake.
So the only good way to recover is to close the connection and restart 
with a different identity.

While we can move the identity selection to userspace (eg by providing 
an 'tls_psk' fabrics option holding the key serial of the PSK to user), 
that will allow us only to pass a _single_ PSK for each attempt.

And the other problem is that in its current form the spec allows for 
_different_ identites for each connection; by passing a key from 
userspace we would not be able to support that.
(Not saying that it's useful, mind.)

We could allow for several 'tls_psk' options, though; maybe that would 
be a way out.

>>
>> Signed-off-by: Hannes Reinecke <hare at suse.de>
>> ---
>>   drivers/nvme/host/tcp.c | 157 +++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 148 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index 0438d42f4179..bcf24e9a08e1 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -8,9 +8,12 @@
>>   #include <linux/init.h>
>>   #include <linux/slab.h>
>>   #include <linux/err.h>
>> +#include <linux/key.h>
>>   #include <linux/nvme-tcp.h>
>> +#include <linux/nvme-keyring.h>
>>   #include <net/sock.h>
>>   #include <net/tcp.h>
>> +#include <net/handshake.h>
>>   #include <linux/blk-mq.h>
>>   #include <crypto/hash.h>
>>   #include <net/busy_poll.h>
>> @@ -31,6 +34,14 @@ static int so_priority;
>>   module_param(so_priority, int, 0644);
>>   MODULE_PARM_DESC(so_priority, "nvme tcp socket optimize priority");
>> +/*
>> + * TLS handshake timeout
>> + */
>> +static int tls_handshake_timeout = 10;
>> +module_param(tls_handshake_timeout, int, 0644);
>> +MODULE_PARM_DESC(tls_handshake_timeout,
>> +         "nvme TLS handshake timeout in seconds (default 10)");
> 
> Can you share what is the normal time of an upcall?
> 
That really depends on the network latency and/or reachability of the 
server. It might just have been started up, switches MAC tables not 
updated, STP still ongoing, what do I know.
So 10 seconds seemed to be a good compromise.
But that's also why I made this configurable.

>> +
>>   #ifdef CONFIG_DEBUG_LOCK_ALLOC
>>   /* lockdep can detect a circular dependency of the form
>>    *   sk_lock -> mmap_lock (page fault) -> fs locks -> sk_lock
>> @@ -104,6 +115,7 @@ enum nvme_tcp_queue_flags {
>>       NVME_TCP_Q_ALLOCATED    = 0,
>>       NVME_TCP_Q_LIVE        = 1,
>>       NVME_TCP_Q_POLLING    = 2,
>> +    NVME_TCP_Q_TLS        = 3,
>>   };
>>   enum nvme_tcp_recv_state {
>> @@ -148,6 +160,9 @@ struct nvme_tcp_queue {
>>       __le32            exp_ddgst;
>>       __le32            recv_ddgst;
>> +    struct completion       *tls_complete;
>> +    int                     tls_err;
>> +
>>       struct page_frag_cache    pf_cache;
>>       void (*state_change)(struct sock *);
>> @@ -1505,7 +1520,102 @@ static void nvme_tcp_set_queue_io_cpu(struct 
>> nvme_tcp_queue *queue)
>>       queue->io_cpu = cpumask_next_wrap(n - 1, cpu_online_mask, -1, 
>> false);
>>   }
>> -static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid)
>> +/*
>> + * nvme_tcp_lookup_psk - Look up PSKs to use for TLS
>> + *
>> + */
>> +static int nvme_tcp_lookup_psks(struct nvme_ctrl *nctrl,
>> +                   key_serial_t *keylist, int num_keys)
> 
> Where is num_keys used?
> 
Ah, indeed, need to check this in the loop.

>> +{
>> +    enum nvme_tcp_tls_cipher cipher = NVME_TCP_TLS_CIPHER_SHA384;
>> +    struct key *tls_key;
>> +    int num = 0;
>> +    bool generated = false;
>> +
>> +    /* Check for pre-provisioned keys; retained keys first */
>> +    do {
>> +        tls_key = nvme_tls_psk_lookup(NULL, nctrl->opts->host->nqn,
>> +                          nctrl->opts->subsysnqn,
>> +                          cipher, generated);
>> +        if (!IS_ERR(tls_key)) {
>> +            keylist[num] = tls_key->serial;
>> +            num++;
>> +            key_put(tls_key);
>> +        }
>> +        if (cipher == NVME_TCP_TLS_CIPHER_SHA384)
>> +            cipher = NVME_TCP_TLS_CIPHER_SHA256;
>> +        else {
>> +            if (generated)
>> +                cipher = NVME_TCP_TLS_CIPHER_INVALID;
>> +            else {
>> +                cipher = NVME_TCP_TLS_CIPHER_SHA384;
>> +                generated = true;
>> +            }
>> +        }
>> +    } while(cipher != NVME_TCP_TLS_CIPHER_INVALID);
> 
> I'm unclear about a few things here:
> 1. what is the meaning of pre-provisioned vs. retained vs. generated?
> 2. Can this loop be reorganized in a nested for loop with a break?
>     I'm wandering if it will make it simpler to read.
> 
'pre-provisioned' means that the admin has stored the keys in the 
keyring prior to calling 'nvme connect'.
'generated' means a key which is derived from the key material generated 
from a previous DH-HMAC-CHAP transaction.
As for the loop: I am going back and forth between having a loop
(which is executed exactly four times) and unrolling the loop into four 
distinct calls to nvme_tls_psk_lookup().
It probably doesn't matter for the actual assembler code (as the 
compiler will be doing a loop unroll anyway), but the unrolled code 
would allow for better documentation, Code might be slightly longer, 
though, with lots of repetitions.
So really, I don't know which is best.

>> +    return num;
>> +}
>> +
>> +static void nvme_tcp_tls_done(void *data, int status, key_serial_t 
>> peerid)
>> +{
>> +    struct nvme_tcp_queue *queue = data;
>> +    struct nvme_tcp_ctrl *ctrl = queue->ctrl;
>> +    int qid = nvme_tcp_queue_id(queue);
>> +
>> +    dev_dbg(ctrl->ctrl.device, "queue %d: TLS handshake done, key %x, 
>> status %d\n",
>> +        qid, peerid, status);
>> +
>> +    queue->tls_err = -status;
>> +    if (queue->tls_complete)
>> +        complete(queue->tls_complete);
>> +}
>> +
>> +static int nvme_tcp_start_tls(struct nvme_ctrl *nctrl,
>> +                  struct nvme_tcp_queue *queue,
>> +                  key_serial_t peerid)
>> +{
>> +    int qid = nvme_tcp_queue_id(queue);
>> +    int ret;
>> +    struct tls_handshake_args args;
>> +    unsigned long tmo = tls_handshake_timeout * HZ;
>> +    DECLARE_COMPLETION_ONSTACK(tls_complete);
>> +
>> +    dev_dbg(nctrl->device, "queue %d: start TLS with key %x\n",
>> +        qid, peerid);
>> +    args.ta_sock = queue->sock;
>> +    args.ta_done = nvme_tcp_tls_done;
>> +    args.ta_data = queue;
>> +    args.ta_my_peerids[0] = peerid;
>> +    args.ta_num_peerids = 1;
>> +    args.ta_keyring = nvme_keyring_id();
>> +    args.ta_timeout_ms = tls_handshake_timeout * 2 * 1000;
>> +    queue->tls_err = -EOPNOTSUPP;
>> +    queue->tls_complete = &tls_complete;
>> +    ret = tls_client_hello_psk(&args, GFP_KERNEL);
>> +    if (ret) {
>> +        dev_dbg(nctrl->device, "queue %d: failed to start TLS: %d\n",
>> +            qid, ret);
>> +        return ret;
>> +    }
>> +    if (wait_for_completion_timeout(queue->tls_complete, tmo) == 0) {
>> +        dev_dbg(nctrl->device,
>> +            "queue %d: TLS handshake timeout\n", qid);
>> +        queue->tls_complete = NULL;
>> +        ret = -ETIMEDOUT;
>> +    } else {
>> +        dev_dbg(nctrl->device,
>> +            "queue %d: TLS handshake complete, error %d\n",
>> +            qid, queue->tls_err);
>> +        ret = queue->tls_err;
>> +    }
>> +    queue->tls_complete = NULL;
>> +    if (!ret)
>> +        set_bit(NVME_TCP_Q_TLS, &queue->flags);
>> +    return ret;
>> +}
>> +
>> +static int nvme_tcp_alloc_queue(struct nvme_ctrl *nctrl, int qid,
>> +                key_serial_t peerid)
>>   {
>>       struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>>       struct nvme_tcp_queue *queue = &ctrl->queues[qid];
>> @@ -1628,6 +1738,13 @@ static int nvme_tcp_alloc_queue(struct 
>> nvme_ctrl *nctrl, int qid)
>>           goto err_rcv_pdu;
>>       }
>> +    /* If PSKs are configured try to start TLS */
>> +    if (peerid) {
> 
> Where is peerid being initialized? Not to mention that peerid is
> a rather cryptic name (at least to me). Is this the ClientHello
> identity?
> 
'peerid' is the term used in the netlink handshake protocol.
It actually is the key serial number of the PSK to use.
Maybe 'psk_id' would be more appropriate here.

>> +        ret = nvme_tcp_start_tls(nctrl, queue, peerid);
>> +        if (ret)
>> +            goto err_init_connect;
>> +    }
>> +
>>       ret = nvme_tcp_init_connection(queue);
>>       if (ret)
>>           goto err_init_connect;
>> @@ -1774,11 +1891,22 @@ static int nvme_tcp_start_io_queues(struct 
>> nvme_ctrl *ctrl,
>>   static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl)
>>   {
>> -    int ret;
>> +    int ret = -EINVAL, num_keys, k;
>> +    key_serial_t keylist[4];
>> -    ret = nvme_tcp_alloc_queue(ctrl, 0);
>> -    if (ret)
>> -        return ret;
>> +    memset(keylist, 0, sizeof(key_serial_t));
>> +    num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>> +    for (k = 0; k < num_keys; k++) {
>> +        ret = nvme_tcp_alloc_queue(ctrl, 0, keylist[k]);
>> +        if (!ret)
>> +            break;
>> +    }
>> +    if (ret) {
>> +        /* Try without TLS */
> 
> Why? this is trying to always connect with tls and fallback to no-tls?
> Why not simply do what userspace is asking us to do?
> 
> Seems backwards to me. Unless there is a statement in the spec
> that I'm not aware of which tells us to do so.
> 
This is an implication of the chosen method to select the PSK from the 
kernel code.
If we move PSK selection to userspace we clearly wouldn't need this.
But if we move PSK selection to userspace we need an updated nvme-cli 
for a) selecting the PSK from the keystore and b) passing in the new option.
So for development it was easier to run with the in-kernel selection as 
I don't need to modify nvme-cli.

>> +        ret = nvme_tcp_alloc_queue(ctrl, 0, 0);
>> +        if (ret)
>> +            goto out_free_queue;
>> +    }
>>       ret = nvme_tcp_alloc_async_req(to_tcp_ctrl(ctrl));
>>       if (ret)
>> @@ -1793,12 +1921,23 @@ static int nvme_tcp_alloc_admin_queue(struct 
>> nvme_ctrl *ctrl)
>>   static int __nvme_tcp_alloc_io_queues(struct nvme_ctrl *ctrl)
>>   {
>> -    int i, ret;
>> +    int i, ret, num_keys = 0, k;
>> +    key_serial_t keylist[4];
>> +    memset(keylist, 0, sizeof(key_serial_t));
>> +    num_keys = nvme_tcp_lookup_psks(ctrl, keylist, 4);
>>       for (i = 1; i < ctrl->queue_count; i++) {
>> -        ret = nvme_tcp_alloc_queue(ctrl, i);
>> -        if (ret)
>> -            goto out_free_queues;
>> +        ret = -EINVAL;
>> +        for (k = 0; k < num_keys; k++) {
>> +            ret = nvme_tcp_alloc_queue(ctrl, i, keylist[k]);
>> +            if (!ret)
>> +                break;
> 
> What is going on here. are you establishing queue_count x num_keys nvme 
> queues?
> 
No, I am _trying_ to establish a connection, breaking out if the attempt
_succeeded_.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman




More information about the Linux-nvme mailing list