Ubuntu 16.04.1 Server KVM
guest running 16.04.1 Desktop VM
I've been running a 2 (independent) GPUs with one of the passedthrough to a guest for a while now and this runs nicely but I'm having some issues and I'm having trouble capturing them in the logs figuring them out.
Issue #1
I have the ACS patch applied to my kernel but it seems to keep reverting each reboot and bunching up my iommu groups. To fix this I have to re dpkg -i the patched kernel and reboot. Basically on boot I check my iommu groups:
And every time I boot they're wrong until I reinstall the kernel and reboot. How do I make the patched version permanent?
Issue #2
When I passthrough keyboard and mouse from hypervisor I have to unplug and plug them in. syslog on the hypervisor seems to indicated this is an apparmor issue? I thought I had read this was something about the usb devices being initialized prior to kvm starting.
Again after unplug/replug:
My attach script if it matters:
How to I permit this to work the first time?
Issue #3
My GPU occasionally (twice a week?) crashes, I can restart the vm once. If it crashes again I am unable to restart the vm a second time. I'm unable to determine the cause from the logs on both the guest and the host.
Per:
Normally my GPU looks like this (from KVM):
And like this from the guest:
After the second crash it looks like this:
I have no idea what it looks like from the guest because the machine has hung. In the logs I see this a few times:
This a couple dozen times:
This a couple hundred times:
and then after the crash, this every time I try (and fail) to restart the guest vm:
However I have no idea how to capture info about the actual crash event. All the logs seem to show nothing at that moment. It seems to occur around these usb messages but I have no idea if they're related:
Any observations / suggestions / feedback welcome. None of the above are really a show stoppers but it's *nix and not win so rebooting the hypervisor every couple days feels wrong. :P
Code:
Linux vm03 4.4.0-34-generic #53 SMP Sun Aug 21 22:05:05 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Code:
Linux u1604dt05 4.7.0-040700rc6-generic #201607040332 SMP Mon Jul 4 07:34:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Issue #1
I have the ACS patch applied to my kernel but it seems to keep reverting each reboot and bunching up my iommu groups. To fix this I have to re dpkg -i the patched kernel and reboot. Basically on boot I check my iommu groups:
Code:
find /sys/kernel/iommu_groups/ -type l
Issue #2
When I passthrough keyboard and mouse from hypervisor I have to unplug and plug them in. syslog on the hypervisor seems to indicated this is an apparmor issue? I thought I had read this was something about the usb devices being initialized prior to kvm starting.
Code:
Dec 10 14:26:34 vm03 kernel: [ 135.400479] audit: type=1400 audit(1481397994.380:23): apparmor="DENIED" operation="open" profile="libvirt-2f647e4b-aff2-4228-bbcd-f0484fe2bf77" name="/run/udev/data/c189:1" pid=1631 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=111 ouid=0
Code:
Dec 10 14:26:53 vm03 kernel: [ 154.476791] audit: type=1400 audit(1481398013.457:61): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-2f647e4b-aff2-4228-bbcd-f0484fe2bf77//qemu_bridge_helper" pid=1827 comm="apparmor_parser"
Code:
#!/bin/bash
#mouse
MVID=$(lsusb | sed 's/:/ /g' | gawk '/075c/ {print $6}')
MPID=$(lsusb | sed 's/:/ /g' | gawk '/075c/ {print $7}')
echo "mouse using $MVID $MPID"
sed 's/VID/'"$MVID"'/;s/PID/'"$MPID"'/' ./dev.xml > m.xml
virsh detach-device u1604dt05 /root/m.xml
virsh attach-device u1604dt05 /root/m.xml
#kb
KVID=$(lsusb | sed 's/:/ /g' | gawk '/c31c/ {print $6}')
KPID=$(lsusb | sed 's/:/ /g' | gawk '/c31c/ {print $7}')
echo "keyboard using $KVID $KPID"
sed 's/VID/'"$KVID"'/;s/PID/'"$KPID"'/' ./dev.xml > kb.xml
virsh detach-device u1604dt05 /root/kb.xml
virsh attach-device u1604dt05 /root/kb.xml
#Bus 003 Device 007: ID 045e:075c Microsoft Corp.
#Bus 003 Device 006: ID 046d:c31c Logitech, Inc. Keyboard K120
Issue #3
My GPU occasionally (twice a week?) crashes, I can restart the vm once. If it crashes again I am unable to restart the vm a second time. I'm unable to determine the cause from the logs on both the guest and the host.
Per:
Code:
lspci -vvvs 2:00.0
lspci -vvvs 2:00.1
Code:
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev cf) (prog-if 00 [VGA controller])
Subsystem: PC Partner Limited / Sapphire Technology Device e344
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 33
Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at d000 [size=256]
Region 5: Memory at d0200000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at d0240000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee003d8 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [200 v1] #15
Capabilities: [270 v1] #19
Capabilities: [2b0 v1] Address Translation Service (ATS)
ATSCap: Invalidate Queue Depth: 00
ATSCtl: Enable-, Smallest Translation Unit: 00
Capabilities: [2c0 v1] #13
Capabilities: [2d0 v1] #1b
Capabilities: [320 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 1
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [370 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=0us PortTPowerOnTime=170us
Kernel driver in use: vfio-pci
02:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aae0
Subsystem: PC Partner Limited / Sapphire Technology Device aae0
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin B routed to IRQ 5
Region 0: Memory at d0260000 (64-bit, non-prefetchable) [disabled] [size=16K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [328 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
Code:
lspci -vvvs 00:02.0
00:02.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev cf) (prog-if 00 [VGA controller])
Subsystem: PC Partner Limited / Sapphire Technology Device e344
Physical Slot: 2
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 29
Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
Region 4: I/O ports at c000 [size=256]
Region 5: Memory at d0200000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at 000c0000 [disabled] [size=128K]
Capabilities: [48] Vendor Specific Information: Len=08 <?>
Capabilities: [50] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #1, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L0s <64ns, L1 <1us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR+, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee03000 Data: 4022
Kernel driver in use: amdgpu
Kernel modules: amdgpu
Code:
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: vfio-pci
02:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aae0 (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
Code:
Dec 10 14:21:36 vm03 kernel: [141155.373612] vfio-pci 0000:02:00.0: Refused to change power state, currently in D3
Code:
Dec 10 14:21:37 vm03 kernel: [141156.223634] vfio_ecap_init: 0000:02:00.0 hiding ecap 0xffff@0xffc
Code:
Dec 10 14:21:36 vm03 kernel: [141155.373612] vfio-pci 0000:02:00.0: Refused to change power state, currently in D3
Code:
Dec 10 14:21:38 vm03 kernel: [141156.957651] vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway
Code:
Dec 10 14:20:11 vm03 kernel: [141070.630621] usb 3-14: ep 0x81 - rounding interval to 64 microframes, ep desc says 80 microframes
Dec 10 14:20:11 vm03 kernel: [141070.630624] usb 3-14: ep 0x82 - rounding interval to 1024 microframes, ep desc says 2040 microframes