Queries on ARM SDEI Linux kernel code

Wed Oct 21 13:31:41 EDT 2020

Hi James,

Sorry for late reply. Thanks for your comments!

On 10/16/2020 9:57 PM, James Morse wrote:
> Hi Neeraj,
> 
> On 15/10/2020 07:07, Neeraj Upadhyay wrote:
>> 1. Looks like interrupt bind interface (SDEI_1_0_FN_SDEI_INTERRUPT_BIND) is not available
>> for clients to use; can you please share information on
>> why it is not provided?
> 
> There is no compelling use-case for it, and its very complex to support as the driver can
> no longer hide things like hibernate.
> 
> Last time I looked, it looked like the SDEI driver would need to ask the irqchip to
> prevent modification while firmware re-configures the irq. I couldn't work out how this
> would work if the irq is in-progress on another CPU.
> 

Got it. I will think in this direction, on how to achieve this.

> The reasons to use bound-interrupts can equally be supported with an event provided by
> firmware.
> 
> 
Ok, I will explore in that direction.

>> While trying to dig information on this, I saw  that [1] says:
>>    Now the hotplug callbacks save  nothing, and restore the OS-view of registered/enabled.
>> This makes bound-interrupts harder to work with.
> 
>> Based on this comment, the changes from v4 [2], which I could understand is, cpu down path
>> does not save the current event enable status, and we rely on the enable status
>> `event->reenable', which is set, when register/unregister, enable/disable calls are made;
>> this enable status is used during cpu up path, to decide whether to reenable the interrupt.
> 
>> Does this make, bound-interrupts harder to work with? how? Can you please explain? Or
>> above save/restore is not the reason and you meant something else?
> 
> If you bind a level-triggered interrupt, how does firmware know how to clear the interrupt
> from whatever is generating it?
> 
> What happens if the OS can't do this either, as it needs to allocate memory, or take a
> lock, which it can't do in nmi context?
> 
> Ok, makes sense.
> The people that wrote the SDEI spec's answer to this was that the handler can disable the
> event from inside the handler... and firmware will do, something, to stop the interrupt
> screaming.
> 
> So now an event can become disabled anytime its registered, which makes it more
> complicated to save/restore.
> 
> 
>> Also, does shared bound interrupts
> 
> Shared-interrupts as an NMI made me jump. But I think you mean a bound interrupt as a
> shared event. i.e. and SPI not a PPI.
> 
> 
Sorry I should have worded properly; yes I meant SPI as shared event.

>> also have the same problem, as save/restore behavior
>> was only for private events?
> 
> See above, the problem is the event disabling itself.
> 
This makes sense now.

> Additionally those changes to unregister the private-event mean the code can't tell the
> difference between cpuhp and hibernate... only hibernate additionally loses the state in
> firmware.
> 
> 
Got it!
>> 2. SDEI_EVENT_SIGNAL api is not provided? What is the reason for it? Its handling has the
>> same problems, which are there for bound interrupts?
> 
> Its not supported as no-one showed up with a use-case.
> While firmware is expected to back it with a PPI, its doesn't have the same problems as
> bound-interrupts as its not an interrupt the OS ever knows about.
> 
> 
>> Also, if it is provided, clients need to register event 0 ? Vendor events or other event
>> nums are not supported, as per spec.
> 
> Ideally the driver would register the event, and provide a call_on_cpu() helper to trigger
> it. This should fit in with however the GIC's PMR based NMI does its PPI based
> crash/stacktrace call so that the caller doesn't need to know if its back by IRQ, pNMI or
> SDEI.
> 
> 
Ok; I will explore how PMR based NMIs work; I thought it was SGI based. 
But will recheck.

>> 3. Can kernel panic() be triggered from sdei event handler?
> 
> Yes,
> 
> 
>> Is it a safe operation?
> 
> panic() wipes out the machine... did you expect it to keep running?

I wanted to check the case where panic triggers kexec/kdump path into 
capture kernel.

> What does safe mean here?
> I think I didn't put it correctly; I meant what possible scenarios can 
happen in this case and you explained one below, thanks!

> You should probably call nmi_panic() if there is the risk that the event occurred during
> panic() on the same CPU, as it would otherwise just block.
> 
> 
>> The spec says, synchronous exceptions should not be triggered; I think panic
>> won't do it; but anything which triggers a WARN
>> or other sync exception in that path can cause undefined behavior. Can you share your
>> thoughts on this?
> 
> What do you mean by undefined behaviour?
> 
I was thinking, if SDEI event preempts EL1, at the point, where EL1 has 
just entered an exception, and hasn't captured the registers like 
spsr_el1, elr_el1 and other registers, what will be the behavior?

> SDEI was originally to report external abort to the OS in regions where the OS can't take
> an exception because the exception-registers are live, just after and exception and just
> before eret.
> 
> If you take another exception from the NMI handler, chances are you're going to go back
> round the loop again, only this time firmware can't inject the SDEI event, so it has to
> reboot.
> 
Got it.
> If you know it might cause an exception, you shouldn't do it in NMI context.
> 
> 
Ok, I understand now.

>> "The handler code should not enable asynchronous exceptions by clearing any of the
>> PSTATE.DAIF bits, and should not cause synchronous exceptions to the client Exception level."
> 
> 
> What are you using this thing for?
> 
> 
Usecase is, a watchdog SPI interrupt, which we want to bound to a SDEI 
event. Below is the flow:

wdog expiry -> SDEI event -> HLOS panic -> trigger kexec/kdump

Thanks
Neeraj

> Thanks,
> 
> James
> 

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a 
member of the Code Aurora Forum, hosted by The Linux Foundation