[PATCH 7/8] serial: imx: use readl() to optimize FIFO reading loop

Uwe Kleine-König u.kleine-koenig at pengutronix.de
Wed Jan 18 23:01:14 PST 2023


Hello Sergey,

On Wed, Jan 18, 2023 at 06:40:17PM +0300, Sergey Organov wrote:
> Uwe Kleine-König <u.kleine-koenig at pengutronix.de> writes:
> > On Tue, Jan 17, 2023 at 04:22:51PM +0300, Sergey Organov wrote:
> >> Uwe Kleine-König <u.kleine-koenig at pengutronix.de> writes:
> >> > On Fri, Jan 13, 2023 at 09:43:33PM +0300, Sergey Organov wrote:
> >> >> Use readl() instead of heavier imx_uart_readl() in the Rx ISR, as we know
> >> >> we read registers that must not be cached.
> >> >> 
> >> >> Signed-off-by: Sergey Organov <sorganov at gmail.com>
> >> >> ---
> >> >>  drivers/tty/serial/imx.c | 5 +++--
> >> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >> >> 
> >> >> diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
> >> >> index be00362b8b67..f4236e8995fa 100644
> >> >> --- a/drivers/tty/serial/imx.c
> >> >> +++ b/drivers/tty/serial/imx.c
> >> >> @@ -890,14 +890,15 @@ static irqreturn_t __imx_uart_rxint(int irq, void *dev_id)
> >> >>  	struct imx_port *sport = dev_id;
> >> >>  	unsigned int rx, flg;
> >> >>  	struct tty_port *port = &sport->port.state->port;
> >> >> +	typeof(sport->port.membase) membase = sport->port.membase;
> >> >>  	u32 usr2;
> >> >>  
> >> >>  	/* If we received something, check for 0xff flood */
> >> >> -	usr2 = imx_uart_readl(sport, USR2);
> >> >> +	usr2 = readl(membase + USR2);
> >> >>  	if (usr2 & USR2_RDR)
> >> >>  		imx_uart_check_flood(sport, usr2);
> >> >>  
> >> >> -	while ((rx = imx_uart_readl(sport, URXD0)) & URXD_CHARRDY) {
> >> >> +	while ((rx = readl(membase + URXD0)) & URXD_CHARRDY) {
> >> >>  		flg = TTY_NORMAL;
> >> >>  		sport->port.icount.rx++;
> >> >
> >> > One of the motivations to introduce imx_uart_readl was to have a single
> >> > place to add a debug output to be able to inspect what the driver is
> >> > doing.
> >> >
> >> > I wonder where your need for higher speed comes from and if the compiler
> >> > really generates more effective code with your change.
> >> 
> >> Mostly it's because I'm obviously slowing things down a bit with the
> >> patch to fight the flood, so I feel obliged to get things back on par
> >> with the origin. Then, higher speed, let alone the time spent with
> >> interrupts disabled and/or spinlocks taken, is always one of generic
> >> goals for me.
> >> 
> >> As for the generated code, with this patch I don't aim to affect code
> >> generation, I rather avoid execution of part of existing code while
> >> being on the most critical path. It should be quite obvious that not
> >> executing some code is at least not slower than executing it.
> >
> > That's true, but I think it doesn't apply here.
> 
> Well, "at least not slower" still applies ;-)
> 
> >
> > I would expect that the compiler "sees" for the call
> >
> > 	imx_uart_readl(sport, USR2)
> >
> > that the 2nd argument is constant and that for that value of offset the
> > call is equivalent to readl(sport->port.membase + offset);
> >
> > So I doubt you're making anything quicker here.
> 
> Yep, it's nice compiler is clever enough to optimize-out the switch for
> constant argument, though I still typically prefer to avoid over-relying
> on optimizations. That said, I now tend to agree with your POV in this
> particular case.
> 
> >
> > I tried the following patch on mainline (that is without the preceding
> > patches in this series):
> >
> > diff --git a/drivers/tty/serial/imx.c b/drivers/tty/serial/imx.c
> > index 757825edb0cd..cfc2f7057345 100644
> > --- a/drivers/tty/serial/imx.c
> > +++ b/drivers/tty/serial/imx.c
> > @@ -807,7 +807,7 @@ static irqreturn_t __imx_uart_rxint(int irq, void *dev_id)
> >  	unsigned int rx, flg, ignored = 0;
> >  	struct tty_port *port = &sport->port.state->port;
> >  
> > -	while (imx_uart_readl(sport, USR2) & USR2_RDR) {
> > +	while (readl(sport->port.membase + USR2) & USR2_RDR) {
> >  		u32 usr2;
> >  
> >  		flg = TTY_NORMAL;
> >
> > and the resulting code didn't change at all. For a bigger change (i.e.
> > adding a variable for sport->port.membase and replacing two
> > imx_uart_readl) the code changed quite a bit (it got 28 bytes bigger for
> > imx_v6_v7_defconfig) and in the short time I tried I couldn't judge if
> > the resulting code is better or not.
> >
> > So a change that explicitly doesn't execute the code that the compiler
> > optimizes away anyhow isn't a win. Together with the fact that your
> > patch makes register access use different idioms and so makes it harder
> > to understand for a human I'd say the net benefit of your patch is
> > negative.
> 
> OK, you convinced me to drop it.
> 
> >
> >> > Please either drop the patch from your series or provide the differences
> >> > the compiler produces and a benchmark.
> >> 
> >> If your only objection against this patch is the desire to keep a single
> >> place to add debug output, I'll be happy to tune the resulting code to
> >> still have one.
> >
> > I don't see the need to optimize it.
> >
> >> That said, before we make a decision, could you please tell why register
> >> shadows that the imx_uart_readl/writel are dealing with are needed in
> >> the first place? It looks like all the registers that are shadowed are
> >> readable as well. What's going on here, and if it happens to be a
> >> speed-up, do we have any benchmarks?
> >
> > Not sure I did benchmarks back then, probably not. The main motivation
> > was really to have that single access function. So I admit being guilty
> > to have implemented an optimization without hard numbers just assuming
> > that access to (cached) RAM is quicker than the register space.
> 
> Well, even if it is quicker, we still spend time writing to both RAM and
> register, and then there is no gain for the data Tx/Rx registers that
> aren't cached, yet are on most critical paths.

Well, assuming we're saving some time for the ctrl registers, it's worth
keeping it even though there is no gain for RX/TX, right? There is no
overhead for RX/TX.

> So, if this is just caching and doesn't change behavior, I'd suggest to
> get rid of the shadowing altogether, making code simpler to follow.

Knowing it's subjective I don't think the shadowing is complicated.
Functions are using the driver specific readl and writel functions and
shadowing is limited to these two functions.

in sum today I wouldn't change if the code does shadow the registers or
not if there isn't at least a strong hint that the one or the other
variant is better. So if you still want to work on that you're welcome,
but I invite you to do some benchmarks first and not only assume one or
the other variant is better.

My (unproved) assumption is that for console usage there is hardly a
difference and with a workflow that needs more changing of control
settings (like half duplex rs485) shadowing is slightly better.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20230119/1a8d937f/attachment-0001.sig>


More information about the linux-arm-kernel mailing list