[RFC PATCH v2 5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable
Madhavan T. Venkataraman
madvenka at linux.microsoft.com
Tue Mar 23 17:27:36 GMT 2021
On 3/23/21 12:23 PM, Madhavan T. Venkataraman wrote:
>
>
> On 3/23/21 12:02 PM, Mark Rutland wrote:
>> On Tue, Mar 23, 2021 at 11:20:44AM -0500, Madhavan T. Venkataraman wrote:
>>> On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
>>>> On 3/23/21 9:57 AM, Mark Rutland wrote:
>>>>> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>>>> So, my next question is - can we define a practical limit for the
>>>> nesting so that any nesting beyond that is fatal? The reason I ask
>>>> is - if there is a max, then we can allocate an array of stack
>>>> frames out of band for the special frames so they are not part of
>>>> the stack and will not likely get corrupted.
>>>>
>>>> Also, we don't have to do any special detection. If the number of
>>>> out of band frames used is one or more then we have exceptions and
>>>> the stack trace is unreliable.
>>>
>>> Alternatively, if we can just increment a counter in the task
>>> structure when an exception is entered and decrement it when an
>>> exception returns, that counter will tell us that the stack trace is
>>> unreliable.
>>
>> As I noted earlier, we must treat *any* EL1 exception boundary needs to
>> be treated as unreliable for unwinding, and per my other comments w.r.t.
>> corrupting the call chain I don't think we need additional protection on
>> exception boundaries specifically.
>>
>>> Is this feasible?
>>>
>>> I think I have enough for v3 at this point. If you think that the
>>> counter idea is OK, I can implement it in v3. Once you confirm, I will
>>> start working on v3.
>>
>> Currently, I don't see a compelling reason to need this, and would
>> prefer to avoid it.
>>
>
> I think that I did a bad job of explaining what I wanted to do. It is not
> for any additional protection at all.
>
> So, let us say we create a field in the task structure:
>
> u64 unreliable_stack;
>
> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
> On exiting the above, decrement unreliable_stack.
>
> In arch_stack_walk_reliable(), simply do this check upfront:
>
> if (task->unreliable_stack)
> return -EINVAL;
>
> This way, the function does not even bother unwinding the stack to find
> exception frames or checking for different return addresses or anything.
> We also don't have to worry about code being reorganized, functions
> being renamed, etc. It also may help in debugging to know if a task is
> experiencing an exception and the level of nesting, etc.
>
>> More generally, could we please break this work into smaller steps? I
>> reckon we can break this down into the following chunks:
>>
>> 1. Add the explicit final frame and associated handling. I suspect that
>> this is complicated enough on its own to be an independent series,
>> and it's something that we can merge without all the bits and pieces
>> necessary for truly reliable stacktracing.
>>
>
> OK. I can do that.
>
>> 2. Figure out how we must handle kprobes and ftrace. That probably means
>> rejecting unwinds from specific places, but we might also want to
>> adjust the trampolines if that makes this easier.
>>
>
> I think I am already doing all the checks except the one you mentioned
> earlier. Yes, I can do this separately.
>
>> 3. Figure out exception boundary handling. I'm currently working to
>> simplify the entry assembly down to a uniform set of stubs, and I'd
>> prefer to get that sorted before we teach the unwinder about
>> exception boundaries, as it'll be significantly simpler to reason
>> about and won't end up clashing with the rework.
>>
>
> So, here is where I still have a question. Is it necessary for the unwinder
> to know the exception boundaries? Is it not enough if it knows if there are
> exceptions present? For instance, using something like num_special_frames
Typo - num_special_frames should be unreliable_stack. That is the name of
the counter I used above.
Sorry about that.
Madhavan
More information about the linux-arm-kernel
mailing list