[PATCH 01/23] all: syscall wrappers: add documentation
Yury Norov
ynorov at caviumnetworks.com
Tue Jun 14 16:08:38 PDT 2016
Hi Catalin, David, all
> COMPAT_SYSCALL_WRAP2(creat, ...):
> mov w0, w0
> b <sys_creat>
>
> > > Cost wise, this seems like it all cancels out in the end, but what
> > > do I know?
> >
> > I think you know something, and I also think Heiko and other s390 guys
> > know something as well. So I'd like to listen their arguments here.
> >
> > For me spark64 way is looking reasonable only because it's really simple
> > and takes less coding. I'll try it on some branch and share here what happened.
>
> The kernel code will definitely look simpler ;). It would be good to see
> if there actually is any performance impact. Even with 16 more cycles on
> syscall entry, would they be lost in the noise? You don't need a full
> implementation, just some dummy mov x0, x0 on the entry path.
>
> --
> Catalin
I wrote a simple test:
struct timeval start, end;
unsigned long long ut;
int main()
{
gettimeofday(&start, NULL);
for (int i = 1000000; i; i--)
syscall(__NR_getrusage, 100 /* EINVAL */, NULL);
gettimeofday(&end, NULL);
ut = (end.tv_sec - start.tv_sec) * 1000000ULL
+ end.tv_usec - start.tv_usec;
printf("%lld\n", ut);
exit(EXIT_SUCCESS);
}
In kernel there's minimal overhead:
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be4..003d5ad 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1634,6 +1634,17 @@ COMPAT_SYSCALL_DEFINE2(getrusage, int, who,
struct compat_rusage __user *, ru)
{
struct rusage r;
+ asm volatile (
+ " mov w0, w0 \n"
+ " mov w1, w1 \n"
+ " mov w2, w2 \n"
+ " mov w3, w3 \n"
+ " mov w4, w4 \n"
+ " mov w5, w5 \n"
+ " mov w6, w6 \n"
+ " mov w7, w7 \n"
+ );
+
if (who != RUSAGE_SELF && who != RUSAGE_CHILDREN &&
who != RUSAGE_THREAD)
return -EINVAL;
On QEMU:
With MOVs: W/O MOVs:
832015 814564
840639 803165
830482 813116
832895 802928
832083 832658
834461 802993
829405 812465
846677 822651
828409 803393
836845 821470
828716 801044
831620 821301
825423 800278
829946 821476
We have 83 mS vs 81 mS, ~2.6% of performance degradation.
And I can show bigger numbers if I'll use asm svc instead of
syscall() wrapper which increases time as well.
It's definitely more than 0, but not so big anyway. For syscalls
with heavy payload it will be non-measurable. So the choice
is still there. Should we use wrappers and save 2.5% of syscall
performance. Or clear top-halves unconditionally and win in simplicity?
If QEMU is looking non-representative, I can run test on real
hardware, but it takes a time, and I think will end up with similar
results.
Latest kernel with wrappers and library are here:
https://github.com/norov/linux/commits/ilp32
https://github.com/norov/glibc/commits/ilp32-dev
BTW, notice the change in ABI: syscalls that take stat and statfs
structures now routed to (wrapped) native handlers, after switching
userspace to use 64-bit off_t, ino_t, blkcnt_t, fsblkcnt_t and
fsfilcnt_t types.
Yury.
More information about the linux-arm-kernel
mailing list