ath11k and vfio-pci support

Tue Jan 16 13:10:56 PST 2024

On 1/16/2024 10:28 AM, Alex Williamson wrote:
> On Tue, 16 Jan 2024 11:41:19 +0100
> David Woodhouse <dwmw2 at infradead.org> wrote:
> 
>> On Tue, 2024-01-16 at 18:08 +0800, Baochen Qiang wrote:
>>>
>>>
>>> On 1/16/2024 1:46 AM, Alex Williamson wrote:  
>>>> On Sun, 14 Jan 2024 16:36:02 +0200
>>>> Kalle Valo <kvalo at kernel.org> wrote:
>>>>   
>>>>> Baochen Qiang <quic_bqiang at quicinc.com> writes:
>>>>>   
>>>>>>>> Strange that still fails. Are you now seeing this error in your
>>>>>>>> host or your Qemu? or both?
>>>>>>>> Could you share your test steps? And if you can share please be as
>>>>>>>> detailed as possible since I'm not familiar with passing WLAN
>>>>>>>> hardware to a VM using vfio-pci.  
>>>>>>>
>>>>>>> Just in Qemu, the hardware works fine on my host machine.
>>>>>>> I basically follow this guide to set it up, its written in the
>>>>>>> context of GPUs/libvirt but the host setup is exactly the same. By
>>>>>>> no means do you need to read it all, once you set the vfio-pci.ids
>>>>>>> and see your unclaimed adapter you can stop:
>>>>>>> https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF
>>>>>>> In short you should be able to set the following host kernel options
>>>>>>> and reboot (assuming your motherboard/hardware is compatible):
>>>>>>> intel_iommu=on iommu=pt vfio-pci.ids=17cb:1103
>>>>>>> Obviously change the device/vendor IDs to whatever ath11k hw you
>>>>>>> have. Once the host is rebooted you should see your wlan adapter as
>>>>>>> UNCLAIMED, showing the driver in use as vfio-pci. If not, its likely
>>>>>>> your motherboard just isn't compatible, the device has to be in its
>>>>>>> own IOMMU group (you could try switching PCI ports if this is the
>>>>>>> case).
>>>>>>> I then build a "kvm_guest.config" kernel with the driver/firmware
>>>>>>> for ath11k and boot into that with the following Qemu options:
>>>>>>> -enable-kvm -device -vfio-pci,host=<PCI address>
>>>>>>> If it seems easier you could also utilize IWD's test-runner which
>>>>>>> handles launching the Qemu kernel automatically, detecting any
>>>>>>> vfio-devices and passes them through and mounts some useful host
>>>>>>> folders into the VM. Its actually a very good general purpose tool
>>>>>>> for kernel testing, not just for IWD:
>>>>>>> https://git.kernel.org/pub/scm/network/wireless/iwd.git/tree/doc/test-runner.txt
>>>>>>> Once set up you can just run test-runner with a few flags and you'll
>>>>>>> boot into a shell:
>>>>>>> ./tools/test-runner -k <kernel-image> --hw --start /bin/bash
>>>>>>> Please reach out if you have questions, thanks for looking into
>>>>>>> this.  
>>>>>>
>>>>>> Thanks for these details. I reproduced this issue by following your guide.
>>>>>>
>>>>>> Seems the root cause is that the MSI vector assigned to WCN6855 in
>>>>>> qemu is different with that in host. In my case the MSI vector in qemu
>>>>>> is [Address: fee00000  Data: 0020] while in host it is [Address:
>>>>>> fee00578 Data: 0000]. So in qemu ath11k configures MSI vector
>>>>>> [Address: fee00000 Data: 0020] to WCN6855 hardware/firmware, and
>>>>>> firmware uses that vector to fire interrupts to host/qemu. However
>>>>>> host IOMMU doesn't know that vector because the real vector is
>>>>>> [Address: fee00578  Data: 0000], as a result host blocks that
>>>>>> interrupt and reports an error, see below log:
>>>>>>
>>>>>> [ 1414.206069] DMAR: DRHD: handling fault status reg 2
>>>>>> [ 1414.206081] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>>> request
>>>>>> [ 1414.210334] DMAR: DRHD: handling fault status reg 2
>>>>>> [ 1414.210342] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>>> request
>>>>>> [ 1414.212496] DMAR: DRHD: handling fault status reg 2
>>>>>> [ 1414.212503] DMAR: [INTR-REMAP] Request device [02:00.0] fault index
>>>>>> 0x0 [fault reason 0x25] Blocked a compatibility format interrupt
>>>>>> request
>>>>>> [ 1414.214600] DMAR: DRHD: handling fault status reg 2
>>>>>>
>>>>>> While I don't think there is a way for qemu/ath11k to get the real MSI
>>>>>> vector from host, I will try to read the vfio code to check further.
>>>>>> Before that, to unblock you, a possible hack is to hard code the MSI
>>>>>> vector in qemu to the same as in host, on condition that the MSI
>>>>>> vector doesn't change.  
>>>>>
>>>>> Baochen, awesome that you were able to debug this further. Now we at
>>>>> least know what's the problem.  
>>>>
>>>> It's an interesting problem, I don't think we've seen another device
>>>> where the driver reads the MSI register in order to program another
>>>> hardware entity to match the MSI address and data configuration.
>>>>
>>>> When assigning a device, the host and guest use entirely separate
>>>> address spaces for MSI interrupts.  When the guest enables MSI, the
>>>> operation is trapped by the VMM and triggers an ioctl to the host to
>>>> perform an equivalent configuration.  Generally the physical device
>>>> will interrupt within the host where it may be directly attached to KVM
>>>> to signal the interrupt, trigger through the VMM, or where
>>>> virtualization hardware supports it, the interrupt can directly trigger
>>>> the vCPU.   From the VM perspective, the guest address/data pair is used
>>>> to signal the interrupt, which is why it makes sense to virtualize the
>>>> MSI registers.  
>>>
>>> Hi Alex, could you help elaborate more? why from the VM perspective MSI 
>>> virtualization is necessary?  
>>
>> An MSI is just a write to physical memory space. You can even use it
>> like that; configure the device to just write 4 bytes to some address
>> in a struct in memory to show that it needs attention, and you then
>> poll that memory.
>>
>> But mostly we don't (ab)use it like that, of course. We tell the device
>> to write to a special range of the physical address space where the
>> interrupt controller lives — the range from 0xfee00000 to 0xfeefffff.
>> The low 20 bits of the address, and the 32 bits of data written to that
>> address, tell the interrupt controller which CPU to interrupt, and
>> which vector to raise on the CPU (as well as some other details and
>> weird interrupt modes which are theoretically encodable).
>>
>> So in your example, the guest writes [Address: fee00000  Data: 0020]
>> which means it wants vector 0x20 on CPU#0 (well, the CPU with APICID
>> 0). But that's what the *guest* wants. If we just blindly programmed
>> that into the hardware, the hardware would deliver vector 0x20 to the
>> host's CPU0... which would be very confused by it.
>>
>> The host has a driver for that device, probably the VFIO driver. The
>> host registers its own interrupt handlers for the real hardware,
>> decides which *host* CPU (and vector) should be notified when something
>> happens. And when that happens, the VFIO driver will raise an event on
>> an eventfd, which will notify QEMU to inject the appropriate interrupt
>> into the guest.
>>
>> So... when the guest enables the MSI, that's trapped by QEMU which
>> remembers which *guest* CPU/vector the interrupt should go to. QEMU
>> tells VFIO to enable the corresponding interrupt, and what gets
>> programmed into the actual hardware is up to the *host* operating
>> system; nothing to do with the guest's information at all.
>>
>> Then when the actual hardware raises the interrupt, the VFIO interrupt
>> handler runs in the guest, signals an event on the eventfd, and QEMU
> 
> s/guest/host/
> 
>> receives that and injects the event into the appropriate guest vCPU.
>>
>> (In practice QEMU doesn't do it these days; there's actually a shortcut
>> which improves latency by allowing the kernel to deliver the event to
>> the guest directly, connecting the eventfd directly to the KVM irq
>> routing table.)
>>
>>
>> Interrupt remapping is probably not important here, but I'll explain it
>> briefly anyway. With interrupt remapping, the IOMMU handles the
>> 'memory' write from the device, just as it handles all other memory
>> transactions. One of the reasons for interrupt remapping is that the
>> original definitions of the bits in the MSI (the low 20 bits of the
>> address and the 32 bits of what's written) only had 8 bits for the
>> target CPU APICID. And we have bigger systems than that now.
>>
>> So by using one of the spare bits in the MSI message, we can indicate
>> that this isn't just a directly-encoded cpu/vector in "Compatibility
>> Format", but is a "Remappable Format" interrupt. Instead of the
>> cpu/vector it just contains an index in to the IOMMU's Interrupt
>> Redirection Table. Which *does* have a full 32-bits for the target APIC
>> ID. That's why x2apic support (which gives us support for >254 CPUs)
>> depends on interrupt remapping. 
>>
>> The other thing that the IOMMU can do in modern systems is *posted*
>> interrupts. Where the entry in the IOMMU's IRT doesn't just specify the
>> host's CPU/vector, but actually specifies a *vCPU* to deliver the
>> interrupt to. 
>>
>> All of which is mostly irrelevant as it's just another bypass
>> optimisation to improve latency. The key here is that what the guest
>> writes to its emulated MSI table and what the host writes to the real
>> hardware are not at all related.
>>
>> If we had had this posted interrupt support from the beginning, perhaps
>> we could have have a much simpler model — we just let the guest write
>> its intended (v)CPU#/vector *directly* to the MSI table in the device,
>> and let the IOMMU fix it up by having a table pointing to the
>> appropriate set of vCPUs. But that isn't how it happened. The model we
>> have is that the VMM has to *emulate* the config space and handle the
>> interrupts as described above.
>>
>> This means that whenever a device has a non-standard way of configuring
>> MSIs, the VMM has to understand and intercept that. I believe we've
>> even seen some Atheros devices with the MSI target in some weird MMIO
>> registers instead of the standard location, so we've had to hack QEMU
>> to handle those too?
>>
>>> And, maybe a stupid question, is that possible VM/KVM or vfio only 
>>> virtualize write operation to MSI register but leave read operation 
>>> un-virtualized? I am asking this because in that way ath11k may get a
>>> chance to run in VM after getting the real vector.  
>>
>> That might confuse a number of operating systems. Especially if they
>> mask/unmask by reading the register, flipping the mask bit and writing
>> back again.
>>
>> How exactly is the content of this register then given back to the
>> firmware? Is that communication snoopable by the VMM?
>>
>>
>>>>
>>>> Off hand I don't have a good solution for this, the hardware is
>>>> essentially imposing a unique requirement for MSI programming that the
>>>> driver needs visibility of the physical MSI address and data.
>>>>   
>>
>> Strictly, the driver doesn't need visibility to the actual values used
>> by the hardware. Another way of it looking at it would be to say that
>> the driver programs the MSI through this non-standard method, it just
>> needs the VMM to trap and handle that, just as the VMM does for the
>> standard MSI table. 
>>
>> Which is what I thought we'd already seen on some Atheros devices.
>>
>>>>   It's
>>>> conceivable that device specific code could either make the physical
>>>> address/data pair visible to the VM or trap the firmware programming to
>>>> inject the correct physical values.  Is there somewhere other than the
>>>> standard MSI capability in config space that the driver could learn the
>>>> physical values, ie. somewhere that isn't virtualized?  Thanks,  
>>>
>>> I don't think we have such capability in configuration space.  
>>
>> Configuration space is a complete fiction though; it's all emulated. We
>> can do anything we like. Or we can have a PV hypercall which will
>> report it. I don't know that we'd *want* to, but all things are
>> possible.
> 
> RTL8169 has a back door to the MSI-X vector table, maybe that's the one
> you're thinking of.  Alternate methods for the driver to access config
> space is common on GPUs, presumably because they require extensive
> vBIOS support and IO port and MMIO windows through which pre-boot code
> can interact with config space is faster and easier than standard
> config accesses.  Much of the work of assigning a GPU to a VM is to
> wrap those alternate methods in virtualization to keep the driver
> working within the guest address space.
> 
> The fictitious config space was my thought too, an ath11k vfio-pci
> variant driver could insert a vendor defined capability into config
> space to expose the physical MSI address/data.  The driver would know
> by the presence of the capability that it's running in a VM and to
> prefer that mechanism to retrieve MSI address and data.
> 
> Alternatively as also suggested here, if programming of the firmware
> with the MSI address/data is something that a hypervisor could trap,
> then we might be able to make it transparent to the guest.  For example
> if it were programmed via MMIO, the guest address/data values could be
> auto-magically replaced with physical values.  Since QEMU doesn't know
> the physical values, this would also likely be through a device
> specific extension to vfio-pci through a variant driver, or maybe some
> combination of variant driver and QEMU if we need to make trapping
> conditional in order to avoid a performance penalty.
> 
> This is essentially device specific interrupt programming, which either
> needs to be virtualized (performed by the VMM) or paravirtualized
> (performed in cooperation with the guest).  This is also something to
> keep in mind relative to the initial source of this issue, ie. testing
> device drivers and hardware under device assignment.  There can be
> subtle differences.  Thanks,
> 
> Alex
> 
>

+ kernel at quicinc.com for added visibility and advice

Full thread:
<https://lore.kernel.org/all/adcb785e-4dc7-4c4a-b341-d53b72e13467@gmail.com/>