Query about float point operation on Cortex A9
bill4carson
bill4carson at gmail.com
Thu Nov 3 04:21:28 EDT 2011
Hi, all
I run STREAM test on ARM Versatile Express Cortex-A9x4 tile with Linux
version 3.1.0+ by adding hugeptlb support. When using huge page,
performance improvements surprisingly hit *only* "Copy" function with
2.7% ~ 10.7%, while "Scale" "Add" "Triadd" functions barely exceed 4k page.
By analyzing the code with Oprofile, it turn out __adddf3/__muldf3
operations
eating out most of CPU cycles which can be seen from the following log.
Apparently these two operations destroy the benefit brought by hugetlb,
To make hugetlb benchmarking more convincing for the community to accept
I use -mfloat-abi=softfp compile source code, unfortunately I got illegal
instructions after each run.
Is there any other thing need to be done to execute instructions such as
vldr/vmul.f64/vadd.f64 ?
root at localhost:/root> ./run_huge.sh
Profiler running.
hugectl: WARNING: data and bss remapped together in the default hugepage
size
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2097152, Offset = 0
Total memory required = 48.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 9 microseconds.
Each test below will take on the order of 267867 microseconds.
(= 29763 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 209.0475 0.1613 0.1605 0.1624
Scale: 95.3991 0.3819 0.3517 0.4015
Add: 125.4612 0.4245 0.4012 0.4696
Triad: 62.7207 0.8052 0.8025 0.8094
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
Stopping profiling.
Overflow stats not available
CPU: ARM Cortex-A9, speed 0 MHz (estimated)
Counted CPU_CYCLES events (Number of CPU cycles) with a unit mask of
0x00 (No unit mask) count 100000
warning: the last modified time of the binary file does not match that
of the sample file for /root/stream_c_2M.exe
Either this is the wrong binary or the binary has been modified since
the sample file was created.
samples % symbol name
235526 46.7135 __adddf3
185916 36.8740 __muldf3
37467 7.4311 tuned_STREAM_Copy
14374 2.8509 tuned_STREAM_Triad
11627 2.3061 tuned_STREAM_Add
10815 2.1450 tuned_STREAM_Scale
6066 1.2031 main
2399 0.4758 checkSTREAMresults
1 2.0e-04 __divdf3
1 2.0e-04 __floatsidf
1 2.0e-04 mysecond
root at localhost:/root> ./run_4k.sh
Profiler running.
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 2097152, Offset = 0
Total memory required = 48.0 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 9 microseconds.
Each test below will take on the order of 268604 microseconds.
(= 29844 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 265.7375 0.1269 0.1263 0.1275
Scale: 99.3782 0.3583 0.3376 0.3796
Add: 132.0995 0.3940 0.3810 0.4173
Triad: 71.8116 0.7040 0.7009 0.7082
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
Stopping profiling.
Overflow stats not available
CPU: ARM Cortex-A9, speed 0 MHz (estimated)
Counted CPU_CYCLES events (Number of CPU cycles) with a unit mask of
0x00 (No unit mask) count 100000
warning: the last modified time of the binary file does not match that
of the sample file for /root/stream_c_2M.exe
Either this is the wrong binary or the binary has been modified since
the sample file was created.
samples % symbol name
274767 46.7235 __adddf3
216873 36.8788 __muldf3
43709 7.4326 tuned_STREAM_Copy
16682 2.8367 tuned_STREAM_Triad
13579 2.3091 tuned_STREAM_Add
12587 2.1404 tuned_STREAM_Scale
7079 1.2038 main
2791 0.4746 checkSTREAMresults
1 1.7e-04 __divdf3
1 1.7e-04 __floatsidf
1 1.7e-04 mysecond
--
I am a slow learner
but I will keep trying to fight for my dreams!
--bill
More information about the linux-arm-kernel
mailing list