Why is the ARM SMMU v1/v2 put into bypass mode on kexec?

Tue Mar 19 11:17:39 PDT 2024

On 2024-03-19 3:47 pm, Will Deacon wrote:
> On Tue, Mar 19, 2024 at 12:57:52PM +0000, Robin Murphy wrote:
>> On 2024-03-14 7:06 pm, Tyler Hicks wrote:
>>> On 2024-03-14 09:55:46, Tyler Hicks wrote:
>>>> Given that drivers are only optionally asked to implement the .shutdown
>>>> hook, which is required to properly quiesce devices before a kexec, why
>>>> is it that we put the ARM SMMU v1/v2 into bypass mode in the arm-smmu
>>>> driver's own .shutdown hook?
>>>>
>>>>    arm_smmu_device_shutdown() -> set SMMU_sCR0.CLIENTPD bit to 1
>>>>
>>>> Driver authors often forget to even implement a .shutdown hook, which
>>>> results in some hard-to-debug memory corruption issues in the kexec'ed
>>>> target kernel due to pending DMA operations happening on untranslated
>>>> addresses. Why not leave the SMMU in translate mode but clear the stream
>>>> mapping table (or maybe even call arm_smmu_device_reset()) in the SMMU's
>>>> .shutdown hook to prevent the memory corruption from happening in the
>>>> first place?
>>>>
>>>> Fully acknowledging that the proper fix is to quiesce the devices, I
>>>> feel like resetting the SMMU and leaving it in translate mode across
>>>> kexec would be more consistent with the intent behind v5.2 commit
>>>> 954a03be033c ("iommu/arm-smmu: Break insecure users by disabling bypass
>>>> by default"). The incoming transactions of devices, that weren't
>>>> properly quiesced during a kexec, would be blocked until their drivers
>>>> have a chance to reinitialize the devices in the new kernel.
>>>>
>>>> I appreciate any help understanding why bypass mode is utilized here as
>>>> I'm sure there are nuances that I haven't considered. Thank you!
>>>
>>> I now see that Will has previously mentioned that he'd be open to such a
>>> change:
>>>
>>>    One thing I would be in favour of is changing the ->shutdown() code to
>>>    honour disable_bypass=1 so that we put the SMMU in an aborting state
>>>    instead of passthrough. Would that help at all? It would at least
>>>    avoid the memory corruption on missing shutdown callback.
>>>
>>>    - https://lore.kernel.org/linux-arm-kernel/20200608113852.GA3108@willie-the-truck/
>>>
>>> Robin mentions the need to support kexec into a non-SMMU-aware OS. I
>>> hadn't considered that bit of complexity:
>>>
>>>    ... consider if the first kernel *did* leave it enabled with whatever
>>>    left-over translations in place, and kexec'ed into another OS that
>>>    wasn't SMMU-aware...
>>>
>>>    - https://lore.kernel.org/linux-arm-kernel/e072f61a-d6cf-2e34-efd5-c1db38c5c622@arm.com/
>>>
>>> Now that we're 3-4 years removed from that conversation, has anything
>>> changed? Will, is there anything we'd need to watch out for if we were
>>> to prototype this sort of change? For example, would it be wise to
>>> disable fault interrupts when putting the SMMU in an aborting state
>>> before kexec'ing?
> 
> I've grown older and wiser in those four years and no longer think that's
> a good idea :) Well, older maybe, but the reality is that the code around
> the driver has evolved and 'disable_bypass' is even more of a hack now
> than it used to be.
> 
>> Fundamentally, we expect the SMMU to be disabled at initial boot, so per the
>> intent of kexec we put it back in that state. That also seems the most
>> likely expectation of anything we could kexec into, given that it is the
>> natural state of an untouched SMMU after a hard reset, and thus comes out as
>> the least-worst option.
> 
> Heh, that sounded too good to be true when I read it so I went and looked at
> the code:
> 
> SMMUv3: arm_smmu_device_shutdown() -> clears CR0 but doesn't touch GBPA
> SMMUv2: arm_smmu_device_shutdown() -> writes CLIENTPD to CR0
> 
> So it's a bit of a muddle afaict; SMMUv2 explicitly goes into bypass
> whereas SMMUv3 probably does honour disable_bypass=false! Did I miss
> something?

I think so, namely the utter madness around how and when we do actually 
touch GBPA - if we found SMMUEN set at the start of probe, then we set 
GBPA to abort before initially clearing SMMUEN; if the DT is broken then 
we set GBPA to bypass instead of enabling SMMUEN at the end of probe, 
*unless* disable_bypass was set. Thus by the time we get to shutdown, 
SMMUEN may or may not already be 0 and GPBA may or may not have been 
changed from its initial value to either one of bypass or abort.

> As discussed elsewhere, if we remove disable_bypass from SMMUv3, then
> we should be able to be consistent here. The question is, what's the
> right behaviour?

"Not that", at the very least ;)

In terms of the shutdown behaviour, I think it actually works out as-is. 
For the normal case we haven't touched GBPA, so we are truly returning 
to the boot-time condition; in the unexpected case where SMMUEN was 
already enabled then we'll go into an explicit GPBA abort state, but 
that seems a not-unreasonable compromise for not preserving the entire 
boot-time Stream Table etc., whose presence kind of implies it wouldn't 
have been bypassing everything anyway.

The more I look at the remaining aspect of disable_bypass for 
controlling broken-DT behaviour the more I suspect it can't actually be 
useful either way, especially not since default domains. I have no 
memory of what my original reasoning might have been, so I'm inclined to 
just rip that all out and let probe fail. I see no reason these days not 
to expect a broken DT to leads to a broken system, especially not now 
with DTSchema validation.

Then there's just the kdump warning it suppresses, of which I also have 
no idea why it's there either, but apparently that one's on you :P

Cheers,
Robin.