[RFC] mipi-i3c-hci: Support for DMA Ring Pipelining / High-throughput Streaming

Wed Jun 3 11:23:37 PDT 2026

On Wed, Jun 03, 2026 at 12:43:40AM -0700, Sam Agazaryan wrote:
> On Sat, May 23, 2026 at 7:48 AM Frank Li <Frank.li at nxp.com> wrote:
> >
> > On Fri, May 22, 2026 at 03:37:35PM -0700, Sam Agazaryan wrote:
> > > Hello all,
> > >
> > > I am working on a project using the mipi-i3c-hci driver that involves
> > > large packet bursts (exceeding the physical hardware DMA ring size,
> > > such as large MCTP-over-I3C payloads).
> >
> > How large to exceed DMA ring size, can you increase ring size?
> >
> > And if large transfer, it will defer IBI handle for while. IBI check only
> > happen at every START phase.
> >
>
> Right now we are working with the DMA ring maxed out at 255 entries.
> The mipi i3c hci
> controller I'm working with allows up to 128 bytes per transaction due
> to hardware limitations of
> the SoC.
>
> To my understanding with the driver implementation right now if I send
> down 255 entries, TOC gets
> set on the final 255th transaction, and according to the MIPI HCi
> standard v1.2 section 6.12 this
> means that the STOP signal will be sent on the bus after that final
> transaction. My
> design proposes a change to this:
>
> For transfers involving transfer descriptors less than or equal to the
> maximum ring size the driver
> functions exactly the same.
>
> For transfers involving more than the ring size, say ring size = 255
> and we send down 500 descriptors
> to the driver - We begin setting up TOC = 1 on the 127th descriptor
> and the last descriptor in the buffer,
> this is so we can wake up the thread responsible for sending data down
> the bus to queue up another 127

Assume left 128 entry, each entry only 1 byte,  about 128 * (10 + 8)'s
SCL,

If SCL is 12M, only have 192us to fill new descriptor. time of wake up a
thread may be bigger than this time.

Frank
> descriptors and then go back to waiting for completion again - until
> the 127 descriptors the DMA controller
> is currently handling is done. This is repeated until all 500
> descriptors are transferred. In theory, we should
> still have relatively good IBI responsiveness compared to the current
> state of the driver.
>
> > >
> > > I noticed that the current driver implementation treats these as
> > > discrete batches.
> > >
> > > I am considering implementing a ring pipelining or DMA streaming
> > > mechanism to allow for asynchronous refills while the ring is running.
> > > This would leverage the
> > > standard ENQ_PTR doorbell mechanism (per MIPI HCI v1.2, Section 6.8.2)
> > > to continuously feed the hardware. I figured in that case it may be
> > > worth while to see how the upstream community feels about this
> > > feature.
> > >
> > > Before I dive into the implementation for upstream, I wanted to check:
> > > 1. Is there any existing work or a roadmap for DMA
> > > streaming/pipelining in the HCI driver?
> > > 2. Is a generic dma streaming mechanism for large transfers something
> > > you would be interested in seeing as a contribution to the mainline
> > > driver?
> >
> > Usb\network\storage is async. The I3C's framework is sync.
> >
>
> I think I made this sound like an asynchronous operation where we
> return before finishing all
> transactions; the driver would still be synchronous in this case.
>
> > >
> > > Currently, my proof-of-concept handles the pipelining at the core
> > > transfer level, but I
> >
> > Can you send your patch as RFC to check what you already did?
> >
>
> Yes, below is a implementation of the streaming logic
> implemented in core.c.
>
> I initially considered moving this into dma.c to support multiple Ring
> Bundles. However, because dma.c expects pre-translated hci_xfer structs,
> moving the logic there would require core.c to allocate the
> entire transfer array upfront (e.g., a massive kmalloc for 500+ structs
> or however big the transfers end up being). This leads to heap exhaustion and
> "Sleep while Atomic" crashes if an IBI triggers a readback
> concurrently. By keeping
> the sliding window in core.c, we can reuse a single, pre-allocated 255-entry
> hci_xfer array.
>
>
> Signed-off-by: Sam Agazaryan <samagazaryan at google.com>
> ---
> drivers/i3c/master/mipi-i3c-hci/core.c | 133 +++++++++++++++++++------
> drivers/i3c/master/mipi-i3c-hci/dma.c | 1 -
> drivers/i3c/master/mipi-i3c-hci/hci.h | 6 ++
> 3 files changed, 106 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/i3c/master/mipi-i3c-hci/core.c
> b/drivers/i3c/master/mipi-i3c-hci/core.c
> index b781dbed2165..77e04585150b 100644
> --- a/drivers/i3c/master/mipi-i3c-hci/core.c
> +++ b/drivers/i3c/master/mipi-i3c-hci/core.c
> @@ -290,9 +290,12 @@ static int i3c_hci_send_ccc_cmd(struct
> i3c_master_controller *m,
> dev_dbg(&hci->master.dev, "cmd=%#x rnw=%d ndests=%d data[0].len=%d",
> ccc->id, ccc->rnw, ccc->ndests, ccc->dests[0].payload.len);
> + mutex_lock(&hci->xfer_lock);
> xfer = hci_alloc_xfer(nxfers);
> - if (!xfer)
> + if (!xfer) {
> + mutex_unlock(&hci->xfer_lock);
> return -ENOMEM;
> + }
> if (prefixed) {
> xfer->data = NULL;
> @@ -346,6 +349,7 @@ static int i3c_hci_send_ccc_cmd(struct
> i3c_master_controller *m,
> ccc->dests[0].payload.len, ccc->dests[0].payload.data);
> out:
> + mutex_unlock(&hci->xfer_lock);
> hci_free_xfer(xfer, nxfers);
> return ret;
> }
> @@ -363,53 +367,100 @@ static int i3c_hci_i3c_xfers(struct i3c_dev_desc *dev,
> {
> struct i3c_master_controller *m = i3c_dev_get_master(dev);
> struct i3c_hci *hci = to_i3c_hci(m);
> - struct hci_xfer *xfer;
> - DECLARE_COMPLETION_ONSTACK(done);
> + struct hci_xfer *xfer = hci->xfer_ring;
> unsigned int size_limit;
> - int i, last, ret = 0;
> + int i, ret = 0;
> + int processed = 0;
> + int queued = 0;
> + int ring_size = XFER_RING_ENTRIES;
> + int chunk_limit = (ring_size - 1) / 2;
> + int sw_ring_size = chunk_limit * 2;
> dev_dbg(&hci->master.dev, "nxfers = %d", nxfers);
> - xfer = hci_alloc_xfer(nxfers);
> if (!xfer)
> - return -ENOMEM;
> + return -ENODEV;
> +
> + mutex_lock(&hci->xfer_lock);
> + reinit_completion(&hci->xfer_done);
> size_limit = 1U << (16 + FIELD_GET(HC_CAP_MAX_DATA_LENGTH, hci->caps));
> - for (i = 0; i < nxfers; i++) {
> - xfer[i].data_len = i3c_xfers[i].len;
> - ret = -EFBIG;
> - if (xfer[i].data_len >= size_limit)
> - goto out;
> - xfer[i].rnw = i3c_xfers[i].rnw;
> - if (i3c_xfers[i].rnw) {
> - xfer[i].data = i3c_xfers[i].data.in;
> - } else {
> - /* silence the const qualifier warning with a cast */
> - xfer[i].data = (void *) i3c_xfers[i].data.out;
> + /* 1. Prime the ring */
> + while (queued < nxfers && (queued - processed) < sw_ring_size) {
> + int n = min(nxfers - queued, chunk_limit);
> + struct hci_xfer *chunk_start = xfer + (queued % sw_ring_size);
> +
> + for (i = 0; i < n; i++) {
> + int idx = queued + i;
> + chunk_start[i].data_len = i3c_xfers[idx].len;
> + chunk_start[i].rnw = i3c_xfers[idx].rnw;
> + chunk_start[i].data = i3c_xfers[idx].rnw ?
> + i3c_xfers[idx].data.in :
> + (void *)i3c_xfers[idx].data.out;
> + hci->cmd->prep_i3c_xfer(hci, dev, &chunk_start[i]);
> + chunk_start[i].cmd_desc[0] |= CMD_0_ROC;
> + chunk_start[i].completion = NULL;
> }
> - hci->cmd->prep_i3c_xfer(hci, dev, &xfer[i]);
> - xfer[i].cmd_desc[0] |= CMD_0_ROC;
> + chunk_start[n - 1].cmd_desc[0] |= CMD_0_TOC;
> + chunk_start[n - 1].completion = &hci->xfer_done;
> +
> + ret = hci->io->queue_xfer(hci, chunk_start, n);
> + if (ret)
> + goto out;
> + queued += n;
> }
> - last = i - 1;
> - xfer[last].cmd_desc[0] |= CMD_0_TOC;
> - xfer[last].completion = &done;
> - xfer[last].timeout = HZ;
> - ret = i3c_hci_process_xfer(hci, xfer, nxfers);
> - if (ret)
> - goto out;
> - for (i = 0; i < nxfers; i++) {
> - if (i3c_xfers[i].rnw)
> - i3c_xfers[i].len = RESP_DATA_LENGTH(xfer[i].response);
> - if (RESP_STATUS(xfer[i].response) != RESP_SUCCESS) {
> - ret = -EIO;
> + /* 2. Sliding Window Loop (Counting Semaphore) */
> + while (processed < nxfers) {
> + if (!wait_for_completion_timeout(&hci->xfer_done, HZ)) {
> + hci->io->dequeue_xfer(hci, xfer, ring_size);
> + ret = -ETIME;
> goto out;
> }
> +
> + int n_done = min(nxfers - processed, chunk_limit);
> + struct hci_xfer *done_chunk = xfer + (processed % sw_ring_size);
> +
> + for (i = 0; i < n_done; i++) {
> + int idx = processed + i;
> + if (i3c_xfers[idx].rnw)
> + i3c_xfers[idx].len = RESP_DATA_LENGTH(done_chunk[i].response);
> + if (RESP_STATUS(done_chunk[i].response) != RESP_SUCCESS) {
> + ret = -EIO;
> + goto out;
> + }
> + }
> + processed += n_done;
> +
> + /* 3. Refill the ring */
> + if (queued < nxfers) {
> + int n_next = min(nxfers - queued, chunk_limit);
> + struct hci_xfer *next_chunk = xfer + (queued % sw_ring_size);
> +
> + for (i = 0; i < n_next; i++) {
> + int idx = queued + i;
> + next_chunk[i].data_len = i3c_xfers[idx].len;
> + next_chunk[i].rnw = i3c_xfers[idx].rnw;
> + next_chunk[i].data = i3c_xfers[idx].rnw ?
> + i3c_xfers[idx].data.in :
> + (void *)i3c_xfers[idx].data.out;
> + hci->cmd->prep_i3c_xfer(hci, dev, &next_chunk[i]);
> + next_chunk[i].cmd_desc[0] |= CMD_0_ROC;
> + next_chunk[i].completion = NULL;
> + }
> + next_chunk[n_next - 1].cmd_desc[0] |= CMD_0_TOC;
> + next_chunk[n_next - 1].completion = &hci->xfer_done;
> +
> + ret = hci->io->queue_xfer(hci, next_chunk, n_next);
> + if (ret)
> + goto out;
> + queued += n_next;
> + }
> }
> out:
> - hci_free_xfer(xfer, nxfers);
> + mutex_unlock(&hci->xfer_lock);
> return ret;
> }
> @@ -424,9 +475,12 @@ static int i3c_hci_i2c_xfers(struct i2c_dev_desc *dev,
> dev_dbg(&hci->master.dev, "nxfers = %d", nxfers);
> + mutex_lock(&hci->xfer_lock);
> xfer = hci_alloc_xfer(nxfers);
> - if (!xfer)
> + if (!xfer) {
> + mutex_unlock(&hci->xfer_lock);
> return -ENOMEM;
> + }
> for (i = 0; i < nxfers; i++) {
> xfer[i].data = i2c_xfers[i].buf;
> @@ -451,6 +505,7 @@ static int i3c_hci_i2c_xfers(struct i2c_dev_desc *dev,
> }
> out:
> + mutex_unlock(&hci->xfer_lock);
> hci_free_xfer(xfer, nxfers);
> return ret;
> }
> @@ -1019,6 +1074,18 @@ static int i3c_hci_probe(struct platform_device *pdev)
> if (hci->quirks & HCI_QUIRK_RPM_IBI_ALLOWED)
> hci->master.rpm_ibi_allowed = true;
> + /* Pre-allocate ring for high-throughput sliding window transfers */
> +
> + hci->xfer_ring = devm_kcalloc(&pdev->dev, XFER_RING_ENTRIES,
> sizeof(struct hci_xfer), GFP_KERNEL);
> +
> + if (!hci->xfer_ring)
> +
> + return -ENOMEM;
> +
> + init_completion(&hci->xfer_done);
> +
> + mutex_init(&hci->xfer_lock);
> +
> return i3c_master_register(&hci->master, &pdev->dev, &i3c_hci_ops, false);
> }
> diff --git a/drivers/i3c/master/mipi-i3c-hci/dma.c
> b/drivers/i3c/master/mipi-i3c-hci/dma.c
> index e4daaa612055..2cfd6ff25040 100644
> --- a/drivers/i3c/master/mipi-i3c-hci/dma.c
> +++ b/drivers/i3c/master/mipi-i3c-hci/dma.c
> @@ -26,7 +26,6 @@
> */
> #define XFER_RINGS 1 /* max: 8 */
> -#define XFER_RING_ENTRIES 16 /* max: 255 */
> #define IBI_RINGS 1 /* max: 8 */
> #define IBI_STATUS_RING_ENTRIES 32 /* max: 255 */
> diff --git a/drivers/i3c/master/mipi-i3c-hci/hci.h
> b/drivers/i3c/master/mipi-i3c-hci/hci.h
> index f17f43494c1b..c6c8eabbcba8 100644
> --- a/drivers/i3c/master/mipi-i3c-hci/hci.h
> +++ b/drivers/i3c/master/mipi-i3c-hci/hci.h
> @@ -37,6 +37,8 @@ struct dat_words {
> };
> /* Our main structure */
> +#define XFER_RING_ENTRIES 255 /* max: 255 */
> +
> struct i3c_hci {
> struct i3c_master_controller master;
> void __iomem *base_regs;
> @@ -70,6 +72,10 @@ struct i3c_hci {
> u32 vendor_version_id;
> u32 vendor_product_id;
> void *vendor_data;
> + /* High-throughput sliding window support */
> + struct hci_xfer *xfer_ring;
> + struct completion xfer_done;
> + struct mutex xfer_lock;
> };
> /*
> --
>
> --
> linux-i3c mailing list
> linux-i3c at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-i3c