[RFC PATCH 1/2] ARM: deprecate old APCS frame format

Jean-Philippe Brucker Jean-Philippe.Brucker at arm.com
Thu Feb 4 04:28:05 PST 2016


Russell,

On Mon, Feb 01, 2016 at 05:59:07PM +0000, Russell King - ARM Linux wrote:
> On Mon, Jan 25, 2016 at 05:05:20PM +0000, Jean-Philippe Brucker wrote:
> > GCC 5 deprecated option -mapcs-frame, which generated frames following
> > the old ABI format. In preparation for a possible removal in future
> > versions of GCC, deprecate use of this flag in Linux as well.
> > 
> > Instead of completely removing the bits that depend on APCS frame, we
> > move them behind a DEPRECATED config option to smoothen the transition.
> > 
> > Neither AAPCS nor EABI guarantees any frame format, so only ARM_UNWIND
> > should now be used for stack traces and introspection. It is already
> > selected by most defconfigs.
> > Furthermore, frames currently generated by GCC are different for leaf
> > and non-leaf functions, which would make adaptation quite tricky.
> 
> As I mentioned on the call last week, I'm not happy with deprecating
> APCS frames.  The unwinder does not always work, and each time the
> unwinder fails to work, it creates more work.
> 
> The latest scenario can be found in the "[PATCH 1/2] ARM: make
> virt_to_idmap() return unsigned long" thread, where we have a
> failure to unwind the first oops - unwinding stops for apparently
> no reason after the first frame.

I spent some time trying to break the unwinder this week. The only
possibility I found so far is that the oops is triggered before the
function stack is stable.
I don't think it happens in your example, since lr couldn't be on
spin_unlock, but I will still describe that case because it is a bit
worrying.

My GCC 4.9 generates the following prologue for driver_unregister in a
Thumb2 kernel:

    801fd194 <driver_unregister>
    cbz     r0, 801fd1b0        <driver_unregister+0x1c>
    ldr     r3, [r0, #60]
    cbz     r3, 801fd1b0        <driver_unregister+0x1c>
    push    {r4, lr}

If the second instruction faults, we won't get a proper trace, because
the unwinder will find the following entry in the unwind table, and
assume r4 and lr are already on the stack:

    $ readelf -u vmlinux
    ...
    0x801fd194 <driver_unregister>: 0x80a8b0b0
    Compact model 0
    0xa8      pop {r4, r14}
    0xb0      finish
    ...

I might have missed something in the ABI, but if there's no possible way
to handle this case properly in unwind.c, we need to sort it out with
the GCC people.
I found similar constructions in kernels generated with GCC 4.6 and 5.1.

> So, we have no idea how tasket_action() got called.
> 
> My current idea on that oops is... I have no idea, and the only way
> to get more of an idea would be to request a rebuild with thumb
> disabled and frame pointers enabled, and for the problem to be
> reproduced.
> 
> This is the issue: APCS frame pointer based backtracing works every
> time.  The unwinder is and always has been unreliable.
> 
> GCC people need to think about this, because if they persue, we're
> going to end up having to tell people to rebuild their kernels with
> older GCC versions just to get debug information out of them, unless
> the quality of the unwinder and unwind information can be improved.
> 
> However, having experienced the "quality" of gdb over the last five
> years in userspace, I'm really not hopeful that there is a solution
> that works using the unwind information: I suspect there's a great
> number of cases where the unwind information is basically wrong or
> otherwise broken.
> 
> The problem with unwind based solutions is the only time you know
> that the unwinding information is broken is when the code goes wrong
> and the unwind information has to be used.  The nice thing about the
> frame-pointer based solution is that it's correctness is an inherent
> requirement for the program to work, so any problem with a frame-
> pointer based solution shows up as a program crash.
> 
> This is probably the root cause why the unwind based backtracing
> tends to fail.
> 
[...]
> 
> If we can't get a good backtrace, it limits those who can diagnose
> the case of the failure, which really isn't good.
> 

The current unwinder code also doesn't handle recursive calls (e.g.
of_platform_bus_create). There is room for improvement, and we probably
need to come up with a good self-test infrastructure before that
EXPERIMENTAL qualifier can be removed.
I agree that unwinding is not ready, which is one of the main reasons we
can't get rid of the -mapcs flag quite yet. However, I believe most
kernels out there already use ARM_UNWIND by default.

The other reason we can't remove -mapcs is that FUNCTION_GRAPH_TRACER
depends solely on the APCS frame at the moment, and is not easy to
change.
During my tests, I hacked function_graph to work with unwinding info.
It is a terrible solution because of the massive overhead it introduces
for each function call.
But the resulting graphs are the same as the ones generated using APCS
frames, which means that unwinding info are not too far from being
reliable.
I think we need to gather a comprehensive list of broken cases before we
can argue against unwinding to GCC people and ask for something more
reliable, but it does seem like we'll need a better solution for traces.

The problem is that -mapcs is not coming back as is, since it relies on
deprecated instructions. The motivation behind my series (apart from
starting this discussion) is to prevent people from introducing more
features that rely on it, while we work with GCC people to improve the
backtrace.

> What we _might_ end up having to do is to walk every word on the
> kernel stack, check whether it's in a module .text or kernel .text,
> and print it as a "best guess" at a backtrace, but that will be
> very misleading, as registers contain pointers to literal data,
> which can be located not only at the end of a function, but also
> the middle of a function, and that can end up on the stack too.

I really hope it won't come to that...

Thanks,
Jean-Philippe



More information about the linux-arm-kernel mailing list