Quantcast
Channel: Ubuntu Forums - Virtualisation
Viewing all articles
Browse latest Browse all 4211

GPU passthrough quality of life improvements

$
0
0
Ubuntu 16.04.1 Server KVM
Code:

Linux vm03 4.4.0-34-generic #53 SMP Sun Aug 21 22:05:05 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
guest running 16.04.1 Desktop VM
Code:

Linux u1604dt05 4.7.0-040700rc6-generic #201607040332 SMP Mon Jul 4 07:34:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
I've been running a 2 (independent) GPUs with one of the passedthrough to a guest for a while now and this runs nicely but I'm having some issues and I'm having trouble capturing them in the logs figuring them out.

Issue #1
I have the ACS patch applied to my kernel but it seems to keep reverting each reboot and bunching up my iommu groups. To fix this I have to re dpkg -i the patched kernel and reboot. Basically on boot I check my iommu groups:
Code:

find /sys/kernel/iommu_groups/ -type l
And every time I boot they're wrong until I reinstall the kernel and reboot. How do I make the patched version permanent?

Issue #2
When I passthrough keyboard and mouse from hypervisor I have to unplug and plug them in. syslog on the hypervisor seems to indicated this is an apparmor issue? I thought I had read this was something about the usb devices being initialized prior to kvm starting.
Code:

Dec 10 14:26:34 vm03 kernel: [  135.400479] audit: type=1400 audit(1481397994.380:23): apparmor="DENIED" operation="open" profile="libvirt-2f647e4b-aff2-4228-bbcd-f0484fe2bf77" name="/run/udev/data/c189:1" pid=1631 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=111 ouid=0
Again after unplug/replug:
Code:

Dec 10 14:26:53 vm03 kernel: [  154.476791] audit: type=1400 audit(1481398013.457:61): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-2f647e4b-aff2-4228-bbcd-f0484fe2bf77//qemu_bridge_helper" pid=1827 comm="apparmor_parser"
My attach script if it matters:
Code:

#!/bin/bash

#mouse

MVID=$(lsusb | sed 's/:/ /g' | gawk '/075c/ {print $6}')
MPID=$(lsusb | sed 's/:/ /g' | gawk '/075c/ {print $7}')

echo "mouse using $MVID $MPID"

sed 's/VID/'"$MVID"'/;s/PID/'"$MPID"'/' ./dev.xml > m.xml

virsh detach-device u1604dt05 /root/m.xml
virsh attach-device u1604dt05 /root/m.xml

#kb

KVID=$(lsusb | sed 's/:/ /g' | gawk '/c31c/ {print $6}')
KPID=$(lsusb | sed 's/:/ /g' | gawk '/c31c/ {print $7}')

echo "keyboard using $KVID $KPID"

sed 's/VID/'"$KVID"'/;s/PID/'"$KPID"'/' ./dev.xml > kb.xml

virsh detach-device u1604dt05 /root/kb.xml
virsh attach-device u1604dt05 /root/kb.xml

#Bus 003 Device 007: ID 045e:075c Microsoft Corp.
#Bus 003 Device 006: ID 046d:c31c Logitech, Inc. Keyboard K120

How to I permit this to work the first time?

Issue #3
My GPU occasionally (twice a week?) crashes, I can restart the vm once. If it crashes again I am unable to restart the vm a second time. I'm unable to determine the cause from the logs on both the guest and the host.

Per:
Code:

lspci -vvvs 2:00.0
lspci -vvvs 2:00.1

Normally my GPU looks like this (from KVM):
Code:

02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev cf) (prog-if 00 [VGA controller])
        Subsystem: PC Partner Limited / Sapphire Technology Device e344
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 33
        Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
        Region 4: I/O ports at d000 [size=256]
        Region 5: Memory at d0200000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at d0240000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap:        MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl:        Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta:        CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap:        Port #1, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl:        ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta:        Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                        Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                        Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                        EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee003d8  Data: 0000
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150 v2] Advanced Error Reporting
                UESta:        DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:        DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt:        DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:        RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:        RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap:        First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [200 v1] #15
        Capabilities: [270 v1] #19
        Capabilities: [2b0 v1] Address Translation Service (ATS)
                ATSCap:        Invalidate Queue Depth: 00
                ATSCtl:        Enable-, Smallest Translation Unit: 00
        Capabilities: [2c0 v1] #13
        Capabilities: [2d0 v1] #1b
        Capabilities: [320 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap:        MFVC- ACS-, Next Function: 1
                ARICtl:        MFVC- ACS-, Function Group: 0
        Capabilities: [370 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
        Kernel driver in use: vfio-pci

02:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aae0
        Subsystem: PC Partner Limited / Sapphire Technology Device aae0
        Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin B routed to IRQ 5
        Region 0: Memory at d0260000 (64-bit, non-prefetchable) [disabled] [size=16K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap:        MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl:        Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta:        CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap:        Port #1, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl:        ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta:        Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
                        EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
                Address: 0000000000000000  Data: 0000
        Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
        Capabilities: [150 v2] Advanced Error Reporting
                UESta:        DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:        DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt:        DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:        RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
                CEMsk:        RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap:        First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
        Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap:        MFVC- ACS-, Next Function: 0
                ARICtl:        MFVC- ACS-, Function Group: 0
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

And like this from the guest:
Code:

lspci -vvvs 00:02.0
00:02.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev cf) (prog-if 00 [VGA controller])
        Subsystem: PC Partner Limited / Sapphire Technology Device e344
        Physical Slot: 2
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin A routed to IRQ 29
        Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
        Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
        Region 4: I/O ports at c000 [size=256]
        Region 5: Memory at d0200000 (32-bit, non-prefetchable) [size=256K]
        Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: [48] Vendor Specific Information: Len=08 <?>
        Capabilities: [50] Power Management version 3
                Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
                DevCap:        MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl:        Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta:        CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap:        Port #1, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl:        ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta:        Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
                        EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
        Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
                Address: 00000000fee03000  Data: 4022
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu

After the second crash it looks like this:
Code:

02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: vfio-pci

02:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aae0 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

I have no idea what it looks like from the guest because the machine has hung. In the logs I see this a few times:
Code:

Dec 10 14:21:36 vm03 kernel: [141155.373612] vfio-pci 0000:02:00.0: Refused to change power state, currently in D3
This a couple dozen times:
Code:

Dec 10 14:21:37 vm03 kernel: [141156.223634] vfio_ecap_init: 0000:02:00.0 hiding ecap 0xffff@0xffc
This a couple hundred times:
Code:

Dec 10 14:21:36 vm03 kernel: [141155.373612] vfio-pci 0000:02:00.0: Refused to change power state, currently in D3
and then after the crash, this every time I try (and fail) to restart the guest vm:
Code:

Dec 10 14:21:38 vm03 kernel: [141156.957651] vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway
However I have no idea how to capture info about the actual crash event. All the logs seem to show nothing at that moment. It seems to occur around these usb messages but I have no idea if they're related:
Code:

Dec 10 14:20:11 vm03 kernel: [141070.630621] usb 3-14: ep 0x81 - rounding interval to 64 microframes, ep desc says 80 microframes
Dec 10 14:20:11 vm03 kernel: [141070.630624] usb 3-14: ep 0x82 - rounding interval to 1024 microframes, ep desc says 2040 microframes

Any observations / suggestions / feedback welcome. None of the above are really a show stoppers but it's *nix and not win so rebooting the hypervisor every couple days feels wrong. :P

Viewing all articles
Browse latest Browse all 4211

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>