[PATCH] usb: ehci: make HC see up-to-date qh/qtd descriptor ASAP

Wed Aug 31 14:49:20 EDT 2011

On 08/31/2011 01:35 PM, Mark Salter wrote:
> On Wed, 2011-08-31 at 13:19 -0500, Rob Herring wrote:
>> On 08/31/2011 12:51 PM, Will Deacon wrote:
>>> On Wed, Aug 31, 2011 at 06:46:50PM +0100, Nicolas Pitre wrote:
>>>> On Wed, 31 Aug 2011, Will Deacon wrote:
>>>>
>>>>> On Wed, Aug 31, 2011 at 02:43:33PM +0100, Mark Salter wrote:
>>>>>> On Wed, 2011-08-31 at 09:49 +0100, Will Deacon wrote:
>>>>>>> On Wed, Aug 31, 2011 at 01:23:47AM +0100, Chen Peter-B29397 wrote:
>>>>>>>> One question: why this write buffer issue did not happen at UP ARM V7 platform, whose dma buffer
>>>>>>>> also uncache, but bufferable?
>>>>>>>
>>>>>>> Which CPU was on this platform?
>>>>>>
>>>>>> Using a 3.1.0-rc4+ kernel on a Pandaboard, and running 'hdparm -t' on a
>>>>>> usb disk drive, I see ~5.8MB/s read speed. Same kernel, but passing
>>>>>> nosmp on the commandline, I see 20.3MB/s.
>>>>>>
>>>>>> Can someone explain why nosmp would make such a difference?
>>>>>
>>>>> Oh gawd, that's horrible. I have a feeling it's probably a separate issue
>>>>> though, caused by:
>>>>>
>>>>> omap_modify_auxcoreboot0(0x200, 0xfffffdff);
>>>>>
>>>>> in boot_secondary for OMAP. Unfortunately I have no idea what that line is
>>>>> doing because it ends up talking to the secure monitor.
>>>>
>>>> Well, this issue is apparently affecting other ARMv9 implementations 
>>>> too.  In which case this code in arch/arm/mm/mmu.c could be responsible:
>>>>
>>>>                 if (is_smp()) {
>>>>                         /*
>>>>                          * Mark memory with the "shared" attribute
>>>>                          * for SMP systems
>>>>                          */
>>>>                         user_pgprot |= L_PTE_SHARED;
>>>>                         kern_pgprot |= L_PTE_SHARED;
>>>>                         vecs_pgprot |= L_PTE_SHARED;
>>>>                         mem_types[MT_DEVICE_WC].prot_sect |= PMD_SECT_S;
>>>>                         mem_types[MT_DEVICE_WC].prot_pte |= L_PTE_SHARED;
>>>>                         mem_types[MT_DEVICE_CACHED].prot_sect |= PMD_SECT_S;
>>>>                         mem_types[MT_DEVICE_CACHED].prot_pte |= L_PTE_SHARED;
>>>>                         mem_types[MT_MEMORY].prot_sect |= PMD_SECT_S;
>>>>                         mem_types[MT_MEMORY].prot_pte |= L_PTE_SHARED;
>>>>                         mem_types[MT_MEMORY_NONCACHED].prot_sect |= PMD_SECT_S;
>>>>                         mem_types[MT_MEMORY_NONCACHED].prot_pte |= L_PTE_SHARED;
>>>>                 }
>>>>
>>>> However I don't see the nosmp kernel argument having any effect on the 
>>>> result from is_smp().
>>>
>>> Yes, the first thing that sprung to mind was the shared attribute, but like
>>> you say, that doesn't seem to be affected by the nosmp command line
>>> argument.
>>>
>>> Another thing that Marc and I tried on OMAP4 was not bringing up the secondary
>>> CPU during boot (by commenting out most of smp_init). In this case, I/O
>>> performance was good until we tried to online the secondary CPU. The online
>>> failed but after that the I/O performance was certainly degraded.
>>>
>>
>> Was the SCU enabled at that point? One diff between nosmp boot and
>> offlining the 2nd core would be that the SCU remains enabled in the
>> latter case. I think the SCU does not get enabled for nosmp.
>>
>> Do we really know which write buffer the data is sitting? Some
>> experiments to only flush the L1 write buffer would be interesting.
>> Perhaps something executed on the 2nd core has a mb which doesn't help
>> for SMP because the other core's L1 write buffer is not flushed, but it
>> helps for nosmp because everything runs on 1 core and any occurrence of
>> a mb will flush all data out. I wouldn't expect the behavior to be so
>> consistent though. Could it be something is not visible to the other
>> core rather than not visible to the EHCI controller?
> 
> One experiment I did a few days ago was to pin processes and interrupts
> to core#0 (except IPI and local timer). This didn't make any noticeable
> difference.
> 
> My current understanding is that the writes are getting hung up in a
> cache and not a write buffer. I am seeing delays of 10-15ms between
> queuing the urb and getting an interrupt for urb completion. That
> drops to a few hundred microseconds with the explicit flushing added
> to the ehci driver. I don't see how any write buffer could hold data
> that long without draining out on its own. What I see seems to suggest
> that the memory is only coherent among the cores and not coherent for
> CPU writes/device reads. Adding just a dsb() for the ehci flush does
> not help. An outer_sync() is also necessary.
> 
An outer_sync will only drain the write buffer of the L2. It does not
flush the cache though. If the write buffer does in fact keep data as
long as possible (until it needs a free slot or the line is full), then
long delays to write out data are certainly possible. The exact
operation is not documented AFAIR.

Rob