ARM926EJ-S TLB lockdown
Johannes Stezenbach
js at sig21.net
Wed Sep 1 11:19:20 EDT 2010
Hi,
this is just a FYI in case someone is interested, but comments
are of course welcome.
ARM926EJ-S has two TLBs, one is 64-entry 2-way set associative,
the other is 8-entry fully associative for lockdown TLB entries.
The lockdown TLB is currently unused in Linux. I thought maybe
I could get a performance win so I added the following to
the MACHINE_START's .map_io function of my platform:
#define tlb_lockdown(addr) \
__asm__ volatile ( \
" ldr r1, =" #addr " @ virtual address\n" \
" mrc p15,0,r0,c10,c0,0 @ read lockdown register\n" \
" orr r0,r0,#1 @ set preserve bit\n" \
" mcr p15,0,r0,c10,c0,0 @ write lockdown register\n" \
" mcr p15,0,r1,c8,c7,1 @ invalidate TLB single entry\n" \
" ldr r1,[r1] @ cause TLB miss to load TLB entry\n" \
" mrc p15,0,r0,c10,c0,0 @ read lockdown register\n" \
" bic r0,r0,#1 @ clear preserve bit\n" \
" mcr p15,0,r0,c10,c0,0 @ write lockdown register\n" \
: : : "r0", "r1")
tlb_lockdown(0xffff0000); // exception vectors
tlb_lockdown(0xc0000000); // kernel code / data
tlb_lockdown(0xc0100000); // kernel code / data
tlb_lockdown(0xc0200000); // kernel code / data
tlb_lockdown(0xc0300000); // kernel code / data
tlb_lockdown(0xc0400000); // kernel code / data
tlb_lockdown(0xc0500000); // kernel code / data
tlb_lockdown(0xc0600000); // kernel code / data
#undef tlb_lockdown
I used a JTAG debugger to dump the TLB to confirm the lockdown entries
are correct and stay in the TLB during run time.
Then I compared lmbench results (with init=/bin/sh):
Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
plain Linux 2.6.32. 330 1.15 2.72 14.9 21.5 89.7 5.33 12.5 2497 9497 15.K
tlb Linux 2.6.32. 330 1.11 1.96 14.8 21.1 89.3 3.90 12.4 2461 9392 15.K
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
plain Linux 2.6.32. 139.2 221.6 144.0 237.4 161.3 241.0 162.8
tlb Linux 2.6.32. 134.3 216.0 139.6 228.2 158.4 234.1 158.6
File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
plain Linux 2.6.32. 56.0 30.0 262.1 69.6 2764.0 2.817 21.9 43.4
tlb Linux 2.6.32. 53.7 28.9 266.8 65.7 2806.0 2.500 21.9 44.3
*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
plain Linux 2.6.32. 33.6 36.3 30.6 44.3 115.1 95.5 83.9 113. 212.2
tlb Linux 2.6.32. 34.0 34.6 30.9 45.7 117.9 95.5 83.9 115. 212.3
It seems syscall-heavy micro benchmarks like "null I/O" benefit, but most
of the result changes are within the measurement noise.
I also ran iperf TCP benchmark and got no improvement.
BTW, I updated elinux.org Wiki page about lmbench.
http://elinux.org/Benchmark_Programs
Cheers,
Johannes
More information about the linux-arm-kernel
mailing list