Fix KVM live volume migration command payload#13319
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache CloudStack community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
|
|
@vanquyen020920, did you check the final VM XML? |
bernardodemarco
left a comment
There was a problem hiding this comment.
@vanquyen020920, thanks for the PR. I have tested it locally:
-
Created a VM with multiple disks;
virsh domblklist i-2-11-VM --details Type Device Target Source ---------------------------------------------------------------------------------------------------------- file disk hda /mnt/2944d4ce-30c9-3c5f-8379-e49054e7c768/19390437-1503-4034-9033-6f72b094384e file cdrom hdc - file disk vdb /mnt/e6f01a81-688b-384a-83e6-7fa6d4605d62/e13e4f6b-0a48-4001-9da8-c1525deda620
-
Migrated the
hdadevice frompri-stor-02-cluster-01-zn-xap-01(NFS) topri-stor-01-cluster-01-zn-xap-01(NFS)
-
Verified that the VM's domain XML was not updated:
virsh domblklist i-2-11-VM --details Type Device Target Source ---------------------------------------------------------------------------------------------------------- file disk hda /mnt/2944d4ce-30c9-3c5f-8379-e49054e7c768/19390437-1503-4034-9033-6f72b094384e file cdrom hdc - file disk vdb /mnt/e6f01a81-688b-384a-83e6-7fa6d4605d62/e13e4f6b-0a48-4001-9da8-c1525deda620
-
Migrated the
vdbdevice frompri-stor-01-cluster-01-zn-xap-01(NFS) topri-stor-02-cluster-02-zn-xap-01(NFS)
-
Verified that the VM's domain XML was not updated:
virsh domblklist i-2-11-VM --details Type Device Target Source ---------------------------------------------------------------------------------------------------------- file disk hda /mnt/2944d4ce-30c9-3c5f-8379-e49054e7c768/19390437-1503-4034-9033-6f72b094384e file cdrom hdc - file disk vdb /mnt/e6f01a81-688b-384a-83e6-7fa6d4605d62/e13e4f6b-0a48-4001-9da8-c1525deda620
During tests, I also verified that only the metadata of the volumes were updated. This happens because, as indicated by the error message below, Apache CloudStack currently does not support live migration of volumes when using KVM with file-based storage pools (NFS or SharedMountPoint) or Ceph RBD.
cloudstack/server/src/main/java/com/cloud/storage/VolumeApiServiceImpl.java
Lines 3534 to 3536 in 7308dad
Implementing this feature involves addressing a number of requirements and edge cases, including:
- Updating VM domain XML definitions on the fly;
- Handling volume backing chains with multiple deltas;
- Handling volumes with incremental snapshots;
- Handling VMs with incremental snapshots;
- Supporting migrations between file-based storage pools and RBD;
- And other related scenarios.
I am currently working on designing and implementing this functionality. The current plan is to have it available during the third or fourth quarter of this year.
| StoragePool srcPool = (StoragePool)dataStoreMgr.getDataStore(srcData.getDataStore().getId(), DataStoreRole.Primary); | ||
| StoragePool destPool = (StoragePool)dataStoreMgr.getDataStore(destData.getDataStore().getId(), DataStoreRole.Primary); | ||
| MigrateVolumeCommand command = new MigrateVolumeCommand(volume.getId(), volume.getPath(), destPool, volume.getAttachedVmName(), volume.getVolumeType(), waitInterval, volume.getChainInfo()); | ||
| MigrateVolumeCommand command = new MigrateVolumeCommand(srcData.getTO(), destData.getTO(), null, null, waitInterval); |
There was a problem hiding this comment.
this will have huge impact
| throw new InvalidParameterValueException("KVM does not support volume live migration due to the limited possibility to refresh VM XML domain. " + | ||
| "Therefore, to live migrate a volume between storage pools, one must migrate the VM to a different host as well to force the VM XML domain update. " + | ||
| "Use 'migrateVirtualMachineWithVolumes' instead."); | ||
| logger.debug("Allowing KVM live volume migration between different storage pools. VM [{}], volume [{}], source pool [{}], destination pool [{}].", |
There was a problem hiding this comment.
this will have huge impact too
|
Thanks @bernardodemarco for testing and for the detailed feedback. You were right. My initial patch was incomplete: it allowed the API/data-motion flow to continue and fixed the I reworked the KVM agent implementation locally to handle the regular live migration path using libvirt block copy + pivot instead of only relying on The updated flow I tested is:
Test environment:
Test 1: system VM / virtual router root disk migrationBefore migration, the running VM was using the NFS/file-based source: During migration, the agent created the destination RBD volume and completed the live block copy: After migration, I also verified both active and inactive libvirt XML, and both now point to the RBD destination: <disk type='network' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source protocol='rbd' name='CMC-CLOUDSTACK/cbe98e70-0ec8-4dfa-8888-42b38a763672'>
<host name='10.14.5.55'/>
<host name='10.14.5.56'/>
<host name='10.14.5.57'/>
<auth username='admin'>
<secret type='ceph' uuid='a15210c6-c858-3174-a390-183f4ed25096'/>
</auth>
</source>
<backingStore/>
<target dev='vda' bus='virtio'/>
</disk>Test 2: user VM root disk migrationI also tested with a normal user VM. Before migration, the VM root disk was on NFS/file-based storage:
The guest had a test file before migration: After live migration to Ceph RBD, CloudStack UI shows the root volume on The agent log shows the destination RBD volume was created and block copy completed: After migration, I verified the guest remained running and the test file was still readable:
This addresses the specific issue you found where only CloudStack metadata changed while the VM domain XML remained unchanged. With the updated implementation, the running libvirt domain is pivoted to the destination storage. I agree that the broader feature still needs careful validation for additional edge cases, including:
I will push the updated implementation and include this validation evidence so it can be reviewed and tested further. |









Description
This PR fixes KVM live volume migration for storage pools where storage motion is supported.
Test evidence: failure before the fix and successful live migration after applying the code changes.
Environment tested
migrateVolumewithlivemigrate=trueProblem
Currently, KVM live volume migration is rejected early in
VolumeApiServiceImplfor non-PowerFlex pools with the following error:KVM does not support volume live migration due to the limited possibility to refresh VM XML domain. Therefore, to live migrate a volume between storage pools, one must migrate the VM to a different host as well to force the VM XML domain update. Use 'migrateVirtualMachineWithVolumes' instead.When this early guard is bypassed,
AncientDataMotionStrategystill createsMigrateVolumeCommandusing the legacy constructor. That constructor does not populatesrcDataanddestData.As a result, the KVM agent receives a
MigrateVolumeCommandwithsrcData == null, andLibvirtMigrateVolumeCommandWrapperfails with aNullPointerExceptionwhile accessingsrcVolumeObjectTO.getDataStore().Functional change
This PR changes the migration flow to pass the source and destination data objects to the KVM agent and adds defensive null handling in the KVM wrapper.
Changes included:
srcData/destDataawareMigrateVolumeCommandconstructor inAncientDataMotionStrategy.LibvirtMigrateVolumeCommandWrapper.execute()to avoid an NPE if a legacy command path still sends a command withoutsrcData.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
Screenshots were added above showing the failure before the fix and successful live migration after applying the code changes.
How Has This Been Tested?
Tested manually on Apache CloudStack 4.20.0.0 with Ubuntu 22.04 management and KVM hosts.
Test scenario:
migrateVolumewithlivemigrate=true.Before this change:
migrateVolumewithlivemigrate=truefailed at the API layer with the KVM live volume migration restriction.NullPointerExceptionbecausesrcVolumeObjectTOwas null.After this change:
MigrateVolumeCommandis created withsrcDataanddestData.LibvirtMigrateVolumeCommandWrapperno longer throws aNullPointerException.I also verified the bytecode before and after the change:
AncientDataMotionStrategycalled the legacyMigrateVolumeCommand(long, String, StoragePool, String, Volume.Type, int, String)constructor.AncientDataMotionStrategycalls theMigrateVolumeCommand(DataTO, DataTO, Map, Map, int)constructor.How did you try to break this feature and the system with this change?
I tested the failure path and the fixed path by rolling the staging environment back to the original behavior and then applying the fix again step by step.
Validation performed:
srcDatadirectly and could fail withsrcVolumeObjectTO == null.NullPointerExceptionif a legacy command path is still used.