Bug #1100843 “Live Migration Causes Performance Issues” : Bugs : qemu-kvm package : Ubuntu

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-01-17:

#1

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu-kvm (Ubuntu):
status:	New → Confirmed

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-01-17:

#2

Thanks, that is very interesting. I'll be pushing qemu 1.3.0 to ubuntu raring hopefully tomorrow - it would be interesting to know if this still happens there.

Changed in qemu-kvm (Ubuntu):
importance:	Undecided → Medium
status:	Confirmed → Triaged

Revision history for this message

Mark Petersen (mpetersen-peak6) wrote on 2013-01-18:

#3

I don't see qemu-kvm 1.3.0 yet. Will test when you get it pushed, hopefully Tuesday (01/22/2013) if you've pushed by then.

Revision history for this message

Mark Petersen (mpetersen-peak6) wrote on 2013-02-04:

#4

I tested with qemu-kvm 1.3.0. It seems that the issue still exists, but that it exists without a live migration if you wait long enough.

That is if you start a VM on one node and run phoronix batch-run pts/compilation, wait 4 hours (with the VM and physical host doing nothing else) an re-run the test you'll get results on the VM similar to if you run the test and then live migrate to a new host. I have no idea what's causing this behavior, but it seems to be reproducible.

For now this can probably be closed. I'll resubmit a new bug (possibly upsteam) if I can figure out how to get more details to help properly diagnose how/when/why the VM's slow down over time.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-02-04:

#5

It doesn't sound like this bug should be removed, rather re-titled.

Are there any messages in syslog about expecting performance degradation? Have you tried to reproduce this with and without hugepages?

Can you reproduce the same thing with a simple local raw file or LVM backend?

Can you give the precise commands you use to start the tests, so it can be reproduced by others?

Changed in qemu-kvm (Ubuntu):
status:	Triaged → Incomplete

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-02-04:

#6

(marking incomplete pending response. In addition to retitling, I think the bug should also be targeted to project QEMU and qemu Ubuntu source package)

Revision history for this message

Mark Petersen (mpetersen-peak6) wrote on 2013-02-04:

#7

There's nothing in syslog for the VM or host that would imply performance degradation.

I have done this with hugepages and made sure huge page use was consistent. Previously I disabled hugepages and didn't see a difference but I haven't tested again.

I'm using (C)LVM back off FCoE/SAN but I haven't tried local LVM/qcow2 type backing.

I'm using libvirt, but the command line ends up looking like this, if you would like I can provide XML for libvirt as well.

/usr/bin/kvm -name one-10 -S -M pc-1.3 -cpu Westmere -enable-kvm -m 73728 -smp 16,sockets=2,cores=8,threads=1 -uuid 5ee0afd3-df3f-fb1f-02bd-7cde2bc4ee95 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/one-10.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/one//datastores/0/10/disk.0,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/one//datastores/0/10/disk.1,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=23,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=02:00:0a:64:02:fd,bus=pci.0,addr=0x3 -vnc 0.0.0.0:10,password -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

I'll try experimenting more with THP disabled and different IO backends.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-02-04:

#8

Launchpad had previously marked this confirmed for affecting several users. I'm curious who else has seen this behavior, and under what circumstances?

Changed in qemu-kvm (Ubuntu):
status:	Incomplete → New
status:	New → Confirmed

Revision history for this message

KS (khsh) wrote on 2013-02-15:

#9

Same thing happened with us with running ubuntu precise kvm for our hypervisors. VMs that are live-migrated suffer noticeable performance degradation. We tried a few performance tests against live-migrated vs non-migrated VMs and the problem is easily reproducible.. (we used jmeter)

Ubuntu (hypervisor): 12.04.1 LTS
libvirt:0.9.8-2ubuntu17.4
qemu-kvm:1.0+noroms-0ubuntu14.3
storage: NFS exports.

Guest OS: Ubuntu 12.04.1 & RHEL 5.5

Revision history for this message

Javi Fontan (jfontan) wrote on 2013-02-22:

#10

We are also getting reports from some OpenNebula users with this very same issue. Is there any extra information about about what is causing the slowdown or a fix?

Revision history for this message

Mark Petersen (mpetersen-peak6) wrote on 2013-02-22:

#11

Can anyone confirm if you similar slowdowns if you leave the VM running for a few days? I thought it was related to live migration, but I saw my performance degrade if the VM/physical host was up and idle for a couple days.

Revision history for this message

C Cormier (ccormier) wrote on 2013-02-22:

#12

I've been able to confirm the same as the OP regarding the different Ubuntu distrubutions as guests. These tests should help ruleout/pinpoint the kernel and modules.

Using the same 12.04 hypervisors for all tests.

Testing different guests I was able to determine.
-Lucid with default kernel is NOT affected.
-Precise with default kernel is affected.

The test...
I tared up each guest kernels and dropped them into the oposing distribution.

Results
-Lucid with Precise Kernel is NOT affected.
-Precise with Lucid Kernel is affected.

Conclusion
12.04 Precise distribution guest is affected regardless of the kernel used.

Revision history for this message

Mark Petersen (mpetersen-peak6) wrote on 2013-02-22:

#13

@ccormier

I've thought all along it might be a libc issue but testing libc 2.13 on precise would be rather difficult. To some extent I feel like this rules out the kernel as an issue though, since the same kernel on precise/lucid yield different results.

Have you tried letting a precise VM idle without a livemigration to see if performance degrades? If not perhaps you could leave one idle over the weekend and performance test on Monday?

I assume you're testing with qemu-kvm 1.0.0? I've been testing with qemu-kvm 1.2.0 as the performance is remarkably better for me. This would seem to indicate it's not qemu-kvm at fault either.

Revision history for this message

KS (khsh) wrote on 2013-02-25:

#14

Few updates from a few tests we ran:

Setup:
hypervisor ==> Ubuntu 12.04.2 LTS
libvirt ===> 0.9.8-2ubuntu17.7
qemu-kvm ===> 1.0+noroms-0ubuntu14.7
storage: NFS exports
Guest VM OS: Ubuntu 12.04.1 LTS

Test 1: Created a new VM and kept it Idle for 60+ hours, then ran Unix benchmark test against it. (NO Live-migration/migration)
Result 1: The performance was NOT affected and the result was the same as a fresh VM.

Test 2: Created a new VM, and perform a "virsh save <domain> <file>" followed by a "virsh restore <file>" on the SAME hypervisor. Then run unix benchmark against it.
Result 2: The VM is AFFECTED (cost to perform system calls degraded by ~21% )

Test 3: Rebooted the affected VM from test (2) and re-ran unix-benchmark tests against it.
Result 3: Performance is back to normal and is identical to a newly created and non-migrated/not-saved VM.

Thus, it seems the problem is also exposed by "saving" the running precise domain (memory pages) to state file followed by a "restore" even on the same hypervisor and not necessarily live-migration to a different host.

Serge Hallyn (serge-hallyn) on 2013-03-26

Changed in qemu-kvm (Ubuntu):
importance:	Medium → High
status:	Confirmed → Triaged

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-04-02:

#15

So far I've run the equivalentn of test 1 in comment #14, and also didn't find any
performance degradation. I left the host crunching a few other VMs for several
hours, but performance on a kernel compilation stayed the same.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-04-04:

#16

I couldn't reproduce this with a vm whose xml includes:

  <memory>524288</memory>
  <currentMemory>524288</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-1.2'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/serge/machines/clean-precise-mini-amd64/disk0.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>

I started the vm, ran 6 kernel compilations, did 'virsh save/virsh restore', then did 6 more kernel compilations. All were in 272 or 273 seconds.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-04-04:

#17

I guess the final thing to test is virsh save/restore with nfs backend.

Can you tell me the configuration of the nfs server?

Revision history for this message

C Cormier (ccormier) wrote on 2013-04-09:

#18

@serge-hallyn
I'm using Netapp filer in c-mode for NFS storage, mount options are: (rw,nosuid,nodev,noatime,hard,nfsvers=3,tcp,intr,rsize=32768,wsize=32768,addr=x.x.x.x).

However, I can reproduce this on a a host with or without NFS, using local disk, qcow2, or raw images and the OP was using FC SAN.

I've tried to reproduce your success by buiding a new VM the latest preciese 12.04.02 with updates, but i have been unsuccessful. It happens immediately after the restore 100% of the time.

The quickest test i've found is this, it helps aid in your reproduction as it's an immediate indication.
-Get/Compile lmbench
-build it
-run
# touch tmpfile
# for i in null read write open ; do sleep 1; ./lat_syscall -N 1 $i tmpfile; done
- the latencies for read/write generally increase up to 2x, and the open/close by about 30% post restore.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-04-09: Re: [Bug 1100843] Re: Live Migration Causes Performance Issues

#19

Quoting Chris Cormier (<email address hidden>):
> @serge-hallyn
> I'm using Netapp filer in c-mode for NFS storage, mount options are: (rw,nosuid,nodev,noatime,hard,nfsvers=3,tcp,intr,rsize=32768,wsize=32768,addr=x.x.x.x).
>
> However, I can reproduce this on a a host with or without NFS, using
> local disk, qcow2, or raw images and the OP was using FC SAN.
>
> I've tried to reproduce your success by buiding a new VM the latest
> preciese 12.04.02 with updates, but i have been unsuccessful. It happens
> immediately after the restore 100% of the time.
>
> The quickest test i've found is this, it helps aid in your reproduction as it's an immediate indication.
> -Get/Compile lmbench
> -build it
> -run
> # touch tmpfile
> # for i in null read write open ; do sleep 1; ./lat_syscall -N 1 $i tmpfile; done
> - the latencies for read/write generally increase up to 2x, and the open/close by about 30% post restore.

i did this and got the attached results (.1 before save, .2 after
restoer). This is with precise guest on precise host over nfs (from
a raring nfs server).

results.1 (before save)
Simple syscall: 0.1468 microseconds
Simple read: 0.3555 microseconds
Simple write: 0.2785 microseconds
Simple open/close: 3.3682 microseconds
serge@p1:~/lmbench3$ ./sergetest
Simple syscall: 0.1466 microseconds
Simple read: 0.3412 microseconds
Simple write: 0.2813 microseconds
Simple open/close: 3.3175 microseconds
serge@p1:~/lmbench3$ ./sergetest
Simple syscall: 0.1582 microseconds
Simple read: 0.3403 microseconds
Simple write: 0.2871 microseconds
Simple open/close: 2.7587 microseconds
serge@p1:~/lmbench3$ ./sergetest
Simple syscall: 0.1453 microseconds
Simple read: 0.3371 microseconds
Simple write: 0.2790 microseconds
Simple open/close: 3.3391 microseconds

results.2 (after restore)
Simple syscall: 0.1457 microseconds
Simple read: 0.3370 microseconds
Simple write: 0.2832 microseconds
Simple open/close: 3.1675 microseconds
Simple syscall: 0.1470 microseconds
Simple read: 0.3436 microseconds
Simple write: 0.2812 microseconds
Simple open/close: 2.8002 microseconds
Simple syscall: 0.1452 microseconds
Simple read: 0.3428 microseconds
Simple write: 0.2817 microseconds
Simple open/close: 2.7974 microseconds
Simple syscall: 0.1456 microseconds
Simple read: 0.3722 microseconds
Simple write: 0.2798 microseconds
Simple open/close: 2.7494 microseconds
Simple syscall: 0.1470 microseconds
Simple read: 0.3362 microseconds
Simple write: 0.2830 microseconds
Simple open/close: 2.7640 microsecond

Quoting Chris Cormier (ccormier@gmail.com):
> @serge-hallyn
> I'm using Netapp filer in c-mode for NFS storage, mount options are: (rw,nosuid,nodev,noatime,hard,nfsvers=3,tcp,intr,rsize=32768,wsize=32768,addr=x.x.x.x).
> 
> However, I can reproduce this on a a host with or without NFS, using
> local disk, qcow2, or raw images and the OP was using FC SAN.
> 
> I've tried to reproduce your success by buiding a new VM the latest
> preciese 12.04.02 with updates, but i have been unsuccessful. It happens
> immediately after the restore 100% of the time.
> 
> The quickest test i've found is this, it helps aid in your reproduction as it's an immediate indication.
> -Get/Compile lmbench
> -build it
> -run
> # touch tmpfile
> # for i in null read write open ; do sleep 1; ./lat_syscall -N 1 $i tmpfile; done
> - the latencies for read/write generally increase up to 2x, and the open/close by about 30% post restore.

i did this and got the attached results (.1 before save, .2 after
restoer).  This is with precise guest on precise host over nfs (from
a raring nfs server).

results.1 (before save)
Simple syscall: 0.1468 microseconds
Simple read: 0.3555 microseconds
Simple write: 0.2785 microseconds
Simple open/close: 3.3682 microseconds
serge@p1:~/lmbench3$ ./sergetest 
Simple syscall: 0.1466 microseconds
Simple read: 0.3412 microseconds
Simple write: 0.2813 microseconds
Simple open/close: 3.3175 microseconds
serge@p1:~/lmbench3$ ./sergetest 
Simple syscall: 0.1582 microseconds
Simple read: 0.3403 microseconds
Simple write: 0.2871 microseconds
Simple open/close: 2.7587 microseconds
serge@p1:~/lmbench3$ ./sergetest 
Simple syscall: 0.1453 microseconds
Simple read: 0.3371 microseconds
Simple write: 0.2790 microseconds
Simple open/close: 3.3391 microseconds

results.2 (after restore)
Simple syscall: 0.1457 microseconds
Simple read: 0.3370 microseconds
Simple write: 0.2832 microseconds
Simple open/close: 3.1675 microseconds
Simple syscall: 0.1470 microseconds
Simple read: 0.3436 microseconds
Simple write: 0.2812 microseconds
Simple open/close: 2.8002 microseconds
Simple syscall: 0.1452 microseconds
Simple read: 0.3428 microseconds
Simple write: 0.2817 microseconds
Simple open/close: 2.7974 microseconds
Simple syscall: 0.1456 microseconds
Simple read: 0.3722 microseconds
Simple write: 0.2798 microseconds
Simple open/close: 2.7494 microseconds
Simple syscall: 0.1470 microseconds
Simple read: 0.3362 microseconds
Simple write: 0.2830 microseconds
Simple open/close: 2.7640 microsecond

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-04-09:

#20

Can you reproduce this with 1 or 2 cores and 1G ram?

Revision history for this message

C Cormier (ccormier) wrote on 2013-04-09:

#21

Could you confirm that your .1 tests were on a freshly booted Guest OS? Our hardware likely different... but your latencies are close to my Post Restore times.

I just reproded with 1GB RAM and Single CPU.

-Pre Save-
Simple syscall: 0.0519 microseconds
Simple read: 0.1356 microseconds
Simple write: 0.1086 microseconds
Simple open/close: 1.0265 microseconds
local@cc-precise120402-raw:~/lmbench-3.0-a9/bin/x86_64-linux-gnu$

-Post Restore-
Simple syscall: 0.0500 microseconds
Simple read: 0.2125 microseconds
Simple write: 0.1913 microseconds
Simple open/close: 1.4482 microseconds
local@cc-precise120402-raw:~/lmbench-3.0-a9/bin/x86_64-linux-gnu$

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2013-04-09:

#22

Quoting C Cormier (<email address hidden>):
> Could you confirm that your .1 tests were on a freshly booted Guest OS?

Yup, it was a fresh boot.

Since the vms are on shared storage, do you have a box you could wire
up running raring, to test more recent qemu?

Revision history for this message

C Cormier (ccormier) wrote on 2013-04-15:

#23

I did some tests using Raring Server Beta 2. There are some interesting results for this test the results are mixing. Using different machine types produces different results. At this time i've only ran these simple "lat_syscall" tests from lmbench and haven't run some of the more exhaustive benchmarks or compilations.

Summary of the "lat_syscall" results using different machine types of qemu-kvm:
pc-1.0 - affected
pc-1.1 - normal
pc-1.2 - affected
pc-1.3 - affected
pc-i440fx-1.4 - normal

Test Setup

Hypervisor
OS: Raring Beta2
qemu: 1.4.0+dfsg-1expubuntu4
Kernel: 3.8.0-16-generic

Guest:
OS: Ubuntu 12.04.02 Minimal install using ubuntu-vmbuilder

LMBench results for lat syscall tests
---machine type pc-1.0---
-PRESAVE-
Simple read: 0.1301 microseconds
Simple write: 0.1052 microseconds
Simple open/close: 1.3819 microseconds
-POSTRESTORE-
Simple read: 0.2699 microseconds
Simple write: 0.2401 microseconds
Simple open/close: 1.4112 microseconds

---machine type pc-1.1---
-PRESAVE-
Simple read: 0.1265 microseconds
Simple write: 0.1033 microseconds
Simple open/close: 1.2974 microseconds
-POSTRESTORE-
Simple read: 0.1266 microseconds
Simple write: 0.1042 microseconds
Simple open/close: 1.2093 microseconds

---machine type pc-1.2---
-PRESAVE-
Simple syscall: 0.0484 microseconds
Simple read: 0.1291 microseconds
Simple write: 0.1042 microseconds
Simple open/close: 1.2501 microseconds
-POSTRESTORE-
Simple syscall: 0.0485 microseconds
Simple read: 0.2737 microseconds
Simple write: 0.2414 microseconds
Simple open/close: 1.3953 microseconds

---machine type pc-1.3---
-PRESAVE-
Simple read: 0.1276 microseconds
Simple write: 0.1101 microseconds
Simple open/close: 1.1501 microseconds
-POSTRESTORE-
Simple read: 0.2717 microseconds
Simple write: 0.2415 microseconds
Simple open/close: 1.3997 microseconds

---machine type pc-i440fx-1.4---
-PRESAVE-
Simple read: 0.1291 microseconds
Simple write: 0.1039 microseconds
Simple open/close: 1.2344 microseconds
-POSTRESTORE-
Simple read: 0.1335 microseconds
Simple write: 0.1080 microseconds
Simple open/close: 1.0453 microseconds

I did some tests using Raring Server Beta 2. There are some interesting results for this test the results are mixing. Using different machine types produces different results. At this time i've only ran these simple "lat_syscall" tests from lmbench and haven't run some of the more exhaustive benchmarks or compilations.

Summary of the "lat_syscall" results using different machine types of qemu-kvm:
pc-1.0 - affected
pc-1.1 - normal
pc-1.2 - affected
pc-1.3 - affected
pc-i440fx-1.4 - normal

Test Setup

Hypervisor
OS: Raring Beta2
qemu: 1.4.0+dfsg-1expubuntu4
Kernel: 3.8.0-16-generic

Guest:
OS: Ubuntu 12.04.02 Minimal install using ubuntu-vmbuilder

LMBench results for lat syscall tests
---machine type pc-1.0---
-PRESAVE-
Simple read: 0.1301 microseconds
Simple write: 0.1052 microseconds
Simple open/close: 1.3819 microseconds
-POSTRESTORE-
Simple read: 0.2699 microseconds
Simple write: 0.2401 microseconds
Simple open/close: 1.4112 microseconds

---machine type pc-1.1---
-PRESAVE-
Simple read: 0.1265 microseconds
Simple write: 0.1033 microseconds
Simple open/close: 1.2974 microseconds
-POSTRESTORE-
Simple read: 0.1266 microseconds
Simple write: 0.1042 microseconds
Simple open/close: 1.2093 microseconds

---machine type pc-1.2---
-PRESAVE-
Simple syscall: 0.0484 microseconds
Simple read: 0.1291 microseconds
Simple write: 0.1042 microseconds
Simple open/close: 1.2501 microseconds
-POSTRESTORE-
Simple syscall: 0.0485 microseconds
Simple read: 0.2737 microseconds
Simple write: 0.2414 microseconds
Simple open/close: 1.3953 microseconds

---machine type pc-1.3---
-PRESAVE-
Simple read: 0.1276 microseconds
Simple write: 0.1101 microseconds
Simple open/close: 1.1501 microseconds
-POSTRESTORE-
Simple read: 0.2717 microseconds
Simple write: 0.2415 microseconds
Simple open/close: 1.3997 microseconds

---machine type pc-i440fx-1.4---
-PRESAVE-
Simple read: 0.1291 microseconds
Simple write: 0.1039 microseconds
Simple open/close: 1.2344 microseconds
-POSTRESTORE-
Simple read: 0.1335 microseconds
Simple write: 0.1080 microseconds
Simple open/close: 1.0453 microseconds

Revision history for this message

Paolo Bonzini (bonzini) wrote on 2013-04-16:

#24

The results of comment 23 suggest that the issue is not 100% reproducible. Can you please run the benchmark 3-4 times (presave/postrestore) and showall 4 results? One benchmark only, e.g. "simple read" will do.

Also please try putting a big file on disk (something like "dd if=/dev/zero of=bigfile count=64K bs=64K") and then doing "cat bigfile > /dev/null" after restoring. Please check if that makes performance more consistent.

Revision history for this message

C Cormier (ccormier) wrote on 2013-04-16:

#25

Can you clarify what's not 100% reproducible? The only time that it is not reproducible on my system is between different qemu machine types as I listed. If tests are performed on same machine-type they are reproducible 100% of the time on the same host and vm guest as shown in comment #23.

I have re-run what your requesting for machine type pc-1.0.

---machine type pc-1.0---
-Presave-
Simple read: 0.1273 microseconds
Simple read: 0.1259 microseconds
Simple read: 0.1270 microseconds
Simple read: 0.1268 microseconds

-postrestore-
performing: dd if=/dev/zero of=bigfile count=32K bs=64K
32768+0 records in
32768+0 records out
2147483648 bytes (2.1 GB) copied, 15.2912 s, 140 MB/s
performing: cat bigfile > /dev/null
Simple read: 0.2700 microseconds
Simple read: 0.2736 microseconds
Simple read: 0.2713 microseconds
Simple read: 0.2747 microseconds

Revision history for this message

Jonathan Jefferson (jonjefferson) wrote on 2013-05-01:

#26

I have a few VMs (precise) that process high-volume transaction jobs each night. After I've done a live-migrate operation to replace faulty power supply on a bare-metal server, we encountered sluggish performance on the migrated VMs, significant higher CPU is recorded in particular, where the same nightly job would consume way more CPU and took more time to finish on identical hardware.

Upon investigation, we noticed that the only change introduced was the "live migrate" operation. Upon rebooting the guest OS of the VMs, the performance is back to normal. I suspect we're hitting the same problem as the one filed here.. I will attempt to run lmbench next to see if I would notice similar behavior on system calls costs as the one recorded in comments #19..21,23.

------
Latest KVM is used from Ubuntu 12.04 LTS :: qemu-kvm (1.0+noroms-0ubuntu14.8)

Revision history for this message

Jonathan Jefferson (jonjefferson) wrote on 2013-05-01:

#27

I used this handy tool to run system call preliminary benchmarks: http://code.google.com/p/byte-unixbench/

In a nutshell, what I found is a confirmation that live migration does indeed degrade performance on precise KVM.
I hope the below results help narrow down this critical problem to eventually have it resolved in 12.04 LTS version.

detail results:
Compiled the benchmarking tool and then:
root@sample-vm:~/UnixBench# ./Run syscall

Output:

** before live-migration **
------------------------------------------------------------------------
Benchmark Run: Wed May 01 2013 20:29:54 - 20:32:04
1 CPU in system; running 1 parallel copy of tests
System Call Overhead 4177612.4 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 4177612.4 2785.1
========
System Benchmarks Index Score (Partial Only) 2785.1
------------------------------------------------------------------------

** after live-migration **
------------------------------------------------------------------------
Benchmark Run: Wed May 01 2013 20:35:16 - 20:37:26
1 CPU in system; running 1 parallel copy of tests
System Call Overhead 3065118.3 lps (10.0 s, 7 samples)
System Benchmarks Partial Index BASELINE RESULT INDEX
System Call Overhead 15000.0 3065118.3 2043.4
========
System Benchmarks Index Score (Partial Only) 2043.4
------------------------------------------------------------------------

XML domain dump:

  <memory>1048576</memory>
  <currentMemory>1048576</currentMemory>
  <vcpu>1</vcpu>
  <cputune>
    <shares>1024</shares>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-1.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='HIDEME'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='HIDEME'/>
      <target dev='hda' bus='ide'/>
      <readonly/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>

I used this handy tool to run system call preliminary benchmarks: http://code.google.com/p/byte-unixbench/

In a nutshell, what I found is a confirmation that live migration does indeed degrade performance on precise KVM. 
 I hope the below results help narrow down this critical problem to eventually have it resolved in 12.04 LTS version.

detail results:
Compiled the benchmarking tool and then:
root@sample-vm:~/UnixBench# ./Run syscall

Output:

** before live-migration **
------------------------------------------------------------------------
Benchmark Run: Wed May 01 2013 20:29:54 - 20:32:04
1 CPU in system; running 1 parallel copy of tests
System Call Overhead                        4177612.4 lps   (10.0 s, 7 samples)
System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    4177612.4   2785.1
                                                                   ========
System Benchmarks Index Score (Partial Only)                         2785.1
------------------------------------------------------------------------

** after live-migration **
------------------------------------------------------------------------
Benchmark Run: Wed May 01 2013 20:35:16 - 20:37:26
1 CPU in system; running 1 parallel copy of tests
System Call Overhead                        3065118.3 lps   (10.0 s, 7 samples)
System Benchmarks Partial Index              BASELINE       RESULT    INDEX
System Call Overhead                          15000.0    3065118.3   2043.4
                                                                   ========
System Benchmarks Index Score (Partial Only)                         2043.4
------------------------------------------------------------------------

XML domain dump:

<memory>1048576</memory>
  <currentMemory>1048576</currentMemory>
  <vcpu>1</vcpu>
  <cputune>
    <shares>1024</shares>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-1.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='HIDEME'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='HIDEME'/>
      <target dev='hda' bus='ide'/>
      <readonly/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>

Revision history for this message

C Cormier (ccormier) wrote on 2013-05-08:

#28

Update:

From our testing this bug affects KVM Hypervisors on Intel processors that have the EPT feature enabled with Kernels 3.0 and greater. A list of Intel EPT supported CPUs here (http://ark.intel.com/Products/VirtualizationTechnology).

When using a KVM Hypervisor Host with Linux kernel 3.0 or newer kernel with Intel EPT this bug shows itself. If the kvm_intel module is loaded with option "ept=N" guest performance is significantly decreased versus enabled, but it does maintain consistent performance pre and post restoration/migration.

Exceptions:
-A KVM Host with 2.6.32 or 2.6.39 Kernel with EPT enabled this bug is not triggered.
-A KVM Host without the EPT feature enabled Intel CPU this bug is not triggered.
-A KVM Host with Kernel 3.0+ and EPT kvm_intel module option disabled in this bug is not triggered

A KVM hypervisor with EPT enabled on Linux Kernel > 3.0 appears to be the key here.

Revision history for this message

Brad Figg (brad-figg) wrote on 2013-05-08: Missing required logs.

#29

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1100843

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete
tags:	added: precise

Jonathan Jefferson (jonjefferson) on 2013-05-09

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Paolo Bonzini (bonzini) wrote on 2013-05-24:

#30

Can you please check if you have EPT enabled? This could be https://bugzilla.kernel.org/show_bug.cgi?id=58771

Revision history for this message

Paolo Bonzini (bonzini) wrote on 2013-05-24:

#31

Oops, I missed Chris's comment #28. Thanks.

From comment #23, the 1.4 machine type seems to be "fast", while 1.3 is slow. This doesn't make much sense, given the differences between the two machine types:

enable_compat_apic_id_mode();

            .driver = "usb-tablet",\
            .property = "usb_version",\
            .value = stringify(1),\

            .driver = "virtio-net-pci",\
            .property = "ctrl_mac_addr",\
            .value = "off", \

            .driver = "virtio-net-pci", \
            .property = "mq", \
            .value = "off", \

            .driver = "e1000",\
            .property = "autonegotiation",\
            .value = "off",\

This is why I suspected the issue was not 100% reproducible.

Revision history for this message

C Cormier (ccormier) wrote on 2013-05-24:

#32

@Paolo yes, when i was doing that testing i was able to consistently reproduce those results in #23, but it was a red herring, as of now i cannot reproduce the results in #23 consistently (i suspect it may have had something to do with the order i was executing tests but didn’t chase it any further).

Yes, EPT enabled, I submitted that kernel bug in #30.

Revision history for this message

Fletcher Kubota (fletcherkubota) wrote on 2013-07-08:

#33

My HyperDex cluster nodes performance dropped significantly after migrating them (virsh migrate --live ...).they are hosted on precise KVM (12.04.2 Precise Pangolin). first Google search result landed me on this page. it seems i'm not the only one who's encountering this problem. I hope this gets resolved soon as livemigration is a major feature for any hypervisor solution in my opinion ...
cheers

Revision history for this message

Stephen Gran (sgran) wrote on 2013-09-02:

#34

We are reliably seeing this post live-migration on an openstack platform.

Setup:
hypervisor ==> Ubuntu 12.04.3 LTS
libvirt ===> 1.0.2-0ubuntu11.13.04.2~cloud0
qemu-kvm ===> 1.0+noroms-0ubuntu14.10
storage: NFS exports
Guest VM OS: Ubuntu 12.04.1 LTS and CentOS 6.4

We have ept enabled.

Sample instance:

<domain type="kvm">
  <uuid>f3c16d27-2586-44c8-b9d9-84b74b42b5d3</uuid>
  <name>instance-00000508</name>
  <memory>4194304</memory>
  <vcpu>2</vcpu>
  <os>
    <type>hvm</type>
    <boot dev="hd"/>
  </os>
  <features>
    <acpi/>
  </features>
  <clock offset="utc">
    <timer name="pit" tickpolicy="delay"/>
    <timer name="rtc" tickpolicy="catchup"/>
  </clock>
  <cpu mode="host-model" match="exact"/>
  <devices>
    <disk type="file" device="disk">
      <driver name="qemu" type="qcow2" cache="none"/>
      <source file="/var/lib/nova/instances/instance-00000508/disk"/>
      <target bus="virtio" dev="vda"/>
    </disk>
    <interface type="bridge">
      <mac address="fa:16:3e:5d:0e:6a"/>
      <model type="virtio"/>
      <source bridge="qbrf43e9d83-56"/>
      <filterref filter="nova-instance-instance-00000508-fa163e5d0e6a">
        <parameter name="IP" value="10.253.138.156"/>
        <parameter name="DHCPSERVER" value="10.253.138.51"/>
      </filterref>
    </interface>
    <serial type="file">
      <source path="/var/lib/nova/instances/instance-00000508/console.log"/>
    </serial>
    <serial type="pty"/>
    <input type="tablet" bus="usb"/>
    <graphics type="vnc" autoport="yes" keymap="en-us" listen="0.0.0.0"/>
  </devices>
</domain>

We have a test environment and are willing to assist in debugging. Please let us know what we can do to help.

Cheers,

Revision history for this message

Stephen Gran (sgran) wrote on 2013-09-08:

#35

This is being looked at in an upstream thread at http://lists.gnu.org/archive/html/qemu-devel/2013-07/msg01850.html

Cheers,

Chris J Arges (arges) on 2013-09-25

Changed in qemu-kvm (Ubuntu):
assignee:	nobody → Chris J Arges (arges)

Chris J Arges (arges) on 2013-09-26

Changed in qemu-kvm (Ubuntu):
status:	Triaged → In Progress

Chris J Arges (arges) on 2013-09-26

no longer affects:	linux (Ubuntu)
Changed in qemu-kvm (Ubuntu Precise):
assignee:	nobody → Chris J Arges (arges)
Changed in qemu-kvm (Ubuntu Quantal):
assignee:	nobody → Chris J Arges (arges)
Changed in qemu-kvm (Ubuntu Raring):
assignee:	nobody → Chris J Arges (arges)
Changed in qemu-kvm (Ubuntu Precise):
importance:	Undecided → High
Changed in qemu-kvm (Ubuntu Quantal):
importance:	Undecided → High
Changed in qemu-kvm (Ubuntu Raring):
importance:	Undecided → High
Changed in qemu-kvm (Ubuntu Saucy):
assignee:	Chris J Arges (arges) → nobody
status:	In Progress → Fix Released
Changed in qemu-kvm (Ubuntu Raring):
status:	New → Triaged
Changed in qemu-kvm (Ubuntu Quantal):
status:	New → Triaged
Changed in qemu-kvm (Ubuntu Precise):
status:	New → In Progress

Revision history for this message

Chris J Arges (arges) wrote on 2013-09-26:

#36

From my testing this has been fixed in the saucy version (1.5.0) of qemu. It is fixed by this patch:
f1c72795af573b24a7da5eb52375c9aba8a37972

However later in the history this commit was reverted, and again broke this. The other commit that fixes this is:
211ea74022f51164a7729030b28eec90b6c99a08

So 211ea740 needs to be backported to P/Q/R to fix this issue. I have a v1 packages of a precise backport here, I've confirmed performance differences between savevm/loadvm cycles:
http://people.canonical.com/~arges/lp1100843/precise/

Revision history for this message

Chris J Arges (arges) wrote on 2013-10-07:

#37

fix-lp1100843-precise.debdiff Edit (7.6 KiB, text/plain)

I found that two patches need to be backported to solve this issue:

ad0b5321f1f797274603ebbe20108b0750baee94
211ea74022f51164a7729030b28eec90b6c99a08

I've added the necessary bits into precise and tried a few tests:
1) Measure performance before and after savevm/loadvm.
2) Measure performance before and after a migrate to the same host.

In both cases the performance measured by something like lmbench was the same as the previous run.
A test build is available here:
http://people.canonical.com/~arges/lp1100843/precise_v2/

Chris J Arges (arges) on 2013-10-07

description:

updated

Chris J Arges (arges) on 2013-10-07

description:

updated

Chris J Arges (arges) on 2013-10-07

description:

updated

Revision history for this message

Brian Murray (brian-murray) wrote on 2013-10-10: Please test proposed package

#38

Hello Mark, or anyone else affected,

Accepted qemu-kvm into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/qemu-kvm/1.0+noroms-0ubuntu14.12 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in qemu-kvm (Ubuntu Precise):
status:	In Progress → Fix Committed
tags:	added: verification-needed

Revision history for this message

Chris J Arges (arges) wrote on 2013-10-11:

#39

I have verified this on my local machine using virt-manager's save memory, savevm/loadvm via the qemu monitor , and migrate via qemu monitor.

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-10-24:

#40

This bug was fixed in the package qemu-kvm - 1.0+noroms-0ubuntu14.12

---------------
qemu-kvm (1.0+noroms-0ubuntu14.12) precise-proposed; urgency=low

  * migration-do-not-overwrite-zero-pages.patch,
    call-madv-hugepage-for-guest-ram-allocations.patch:
    Fix performance degradation after migrations, and savevm/loadvm.
    (LP: #1100843)
-- Chris J Arges <email address hidden> Wed, 02 Oct 2013 16:26:27 -0500

Changed in qemu-kvm (Ubuntu Precise):
status:	Fix Committed → Fix Released

Revision history for this message

Brian Murray (brian-murray) wrote on 2013-10-24: Update Released

#41

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Chris J Arges (arges) on 2013-11-08

Changed in qemu-kvm (Ubuntu Quantal):
assignee:	Chris J Arges (arges) → nobody
Changed in qemu-kvm (Ubuntu Raring):
assignee:	Chris J Arges (arges) → nobody

Revision history for this message

Paolo Bonzini (bonzini) wrote on 2013-11-27:

#42

Fix will be part of QEMU 1.7.0 (commit fc1c4a5, migration: drop MADVISE_DONT_NEED for incoming zero pages, 2013-10-24).

Changed in qemu:
status:	New → Fix Committed

Revision history for this message

Bryan Quigley (bryanquigley) wrote on 2014-04-15:

#43

QEMU 1.7 was released, Quantal has 10ish days left of support, and Raring is EOL

Changed in qemu:
status:	Fix Committed → Fix Released
Changed in qemu-kvm (Ubuntu Quantal):
status:	Triaged → Invalid
Changed in qemu-kvm (Ubuntu Raring):
status:	Triaged → Invalid

Ubuntu
qemu-kvm package

Live Migration Causes Performance Issues

Bug Description

Other bug subscribers

Patches

Remote bug watches

	Status	Importance	Assigned to
QEMU	Fix Released	Undecided	Unassigned
qemu-kvm (Ubuntu)	Fix Released	High	Unassigned
Precise	Fix Released	High	Chris J Arges
Quantal	Invalid	High	Unassigned
Raring	Invalid	High	Unassigned
Saucy	Fix Released	High	Unassigned

Ubuntuqemu-kvm package

Live Migration Causes Performance Issues

Bug Description

Other bug subscribers

Patches

Remote bug watches

Ubuntu
qemu-kvm package