Hi,
Sorry if I miss any information. If possible I'll try to add as much information as I can.
Problem:
Running Windows VMs with MSSQL on KVM causes SQL log corruption when running on Kernel 4.10 or higher.
Setup:
- Host OS: Ubuntu 16.04 LTS
- Host kernel: 4.13.0-39-generic #44~16.04.1-Ubuntu SMP
- QEMU/KVM version: 1:2.5+dfsg-5ubuntu10.25
- Virtual Disks in LVM
- Guest OS: Windows 2016 + SQL 2016
VM xml:
Reproduce:
Running above setup (default Ubuntu 16.04 with hwe kernel installed) and a Windows VM running and run an SQL benchmark (we used Benchmark Factory Trail 8.0.1). After a couple of minutes the SQL logs will become corrupt.
Workaround:
- Change the disk cache from 'none' to 'writethrough'
- Revert back to 4.4 / 4.8 kernel
Background information:
We've done some testing and when we are running the 4.8 kernel from the Ubuntu repo, all is fine and we can run large tests on the SQL servers. When we upgrade to a 4.10, 4.13 or 4.15 kernel it breaks after generating roughly 500Mb of logs, than SQL servers start to report the ldf files are corrupt and SQL mirrors start to fail. We also see more often that SQL servers are putting databases in Suspect mode. However this is harder to reproduce and could be something else. However we believe the problems are related to each other.
With changing the disk cache from none to writethrough seems to solve the problem, but the side effect is that live migrations are not safe anymore to do. Reverting to a 4.4 (lts) or 4.8 (16.10 kernel) seems also to solve the issue, but with some hardware choices we can't really stay behind on the kernel versions that long.
We've done some testing on 18.04 as well, and the problem seems to be there as well. In that case we were running the 4.15 kernel.
Questions:
- Am I on the right list? ;)
- Does anyone else experience the same behaviour?
- Better yet, does anyone have a better solution?
Sorry if I miss any information. If possible I'll try to add as much information as I can.
Problem:
Running Windows VMs with MSSQL on KVM causes SQL log corruption when running on Kernel 4.10 or higher.
Setup:
- Host OS: Ubuntu 16.04 LTS
- Host kernel: 4.13.0-39-generic #44~16.04.1-Ubuntu SMP
- QEMU/KVM version: 1:2.5+dfsg-5ubuntu10.25
- Virtual Disks in LVM
- Guest OS: Windows 2016 + SQL 2016
VM xml:
Code:
<domain type='kvm'>
<name>Windows2016</name>
<uuid>9be3ad53-55d5-43a1-bcc9-8cf31aeb382e</uuid>
<memory unit='KiB'>2097152</memory>
<currentMemory unit='KiB'>2097152</currentMemory>
<vcpu placement='static' cpuset='1,3,5,7,9,11,13,15,17,19,21,23'>1</vcpu>
<os>
<type arch='x86_64' machine='pc-i440fx-xenial'>hvm</type>
<boot dev='cdrom'/>
<boot dev='hd'/>
</os>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<cpu mode='custom' match='exact'>
<model fallback='allow'>Westmere</model>
<topology sockets='1' cores='1' threads='1'/>
</cpu>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/bin/kvm</emulator>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source dev='/dev/vg/cdisk'/>
<target dev='vda' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source dev='/dev/vg/ddisk'/>
<target dev='vdb' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</disk>
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<target dev='hdc' bus='ide'/>
<readonly/>
<address type='drive' controller='0' bus='1' target='0' unit='0'/>
</disk>
<controller type='usb' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
</controller>
<controller type='pci' index='0' model='pci-root'/>
<controller type='ide' index='0'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
</controller>
<interface type='bridge'>
<mac address='56:16:f5:b3:67:c1'/>
<source bridge='br_0'/>
<target dev='hl_if1'/>
<model type='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</interface>
<serial type='file'>
<source path='/var/log/libvirt/vm.serial.log'/>
<target port='0'/>
</serial>
<console type='file'>
<source path='/var/log/libvirt/vm.serial.log'/>
<target type='serial' port='0'/>
</console>
<input type='tablet' bus='usb'/>
<input type='mouse' bus='ps2'/>
<input type='keyboard' bus='ps2'/>
<graphics type='vnc' port='54749' autoport='no' listen='0.0.0.0' passwd=''>
<listen type='address' address='0.0.0.0'/>
</graphics>
<video>
<model type='cirrus' vram='16384' heads='1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</video>
<memballoon model='virtio'>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
</memballoon>
</devices>
</domain>
Reproduce:
Running above setup (default Ubuntu 16.04 with hwe kernel installed) and a Windows VM running and run an SQL benchmark (we used Benchmark Factory Trail 8.0.1). After a couple of minutes the SQL logs will become corrupt.
Workaround:
- Change the disk cache from 'none' to 'writethrough'
- Revert back to 4.4 / 4.8 kernel
Background information:
We've done some testing and when we are running the 4.8 kernel from the Ubuntu repo, all is fine and we can run large tests on the SQL servers. When we upgrade to a 4.10, 4.13 or 4.15 kernel it breaks after generating roughly 500Mb of logs, than SQL servers start to report the ldf files are corrupt and SQL mirrors start to fail. We also see more often that SQL servers are putting databases in Suspect mode. However this is harder to reproduce and could be something else. However we believe the problems are related to each other.
With changing the disk cache from none to writethrough seems to solve the problem, but the side effect is that live migrations are not safe anymore to do. Reverting to a 4.4 (lts) or 4.8 (16.10 kernel) seems also to solve the issue, but with some hardware choices we can't really stay behind on the kernel versions that long.
We've done some testing on 18.04 as well, and the problem seems to be there as well. In that case we were running the 4.15 kernel.
Questions:
- Am I on the right list? ;)
- Does anyone else experience the same behaviour?
- Better yet, does anyone have a better solution?