Skip to content

Fix KVM live volume migration command payload#13319

Open
vanquyen020920 wants to merge 3 commits into
apache:mainfrom
vanquyen020920:main
Open

Fix KVM live volume migration command payload#13319
vanquyen020920 wants to merge 3 commits into
apache:mainfrom
vanquyen020920:main

Conversation

@vanquyen020920
Copy link
Copy Markdown

@vanquyen020920 vanquyen020920 commented Jun 2, 2026

Description

This PR fixes KVM live volume migration for storage pools where storage motion is supported.

1failed 2failed 3oke

Test evidence: failure before the fix and successful live migration after applying the code changes.

Environment tested

  • Apache CloudStack: 4.20.0.0
  • Management server OS: Ubuntu 22.04
  • KVM host OS: Ubuntu 22.04
  • Hypervisor: KVM
  • VM state: Running
  • Source primary storage: NetworkFilesystem
  • Destination primary storage: RBD / Ceph
  • API: migrateVolume with livemigrate=true

Problem

Currently, KVM live volume migration is rejected early in VolumeApiServiceImpl for non-PowerFlex pools with the following error:

KVM does not support volume live migration due to the limited possibility to refresh VM XML domain. Therefore, to live migrate a volume between storage pools, one must migrate the VM to a different host as well to force the VM XML domain update. Use 'migrateVirtualMachineWithVolumes' instead.

When this early guard is bypassed, AncientDataMotionStrategy still creates MigrateVolumeCommand using the legacy constructor. That constructor does not populate srcData and destData.

As a result, the KVM agent receives a MigrateVolumeCommand with srcData == null, and LibvirtMigrateVolumeCommandWrapper fails with a NullPointerException while accessing srcVolumeObjectTO.getDataStore().

Functional change

This PR changes the migration flow to pass the source and destination data objects to the KVM agent and adds defensive null handling in the KVM wrapper.

Changes included:

  • Allows the KVM live volume migration flow to continue instead of rejecting it early for non-PowerFlex pools.
  • Uses the srcData / destData aware MigrateVolumeCommand constructor in AncientDataMotionStrategy.
  • Adds defensive null handling in LibvirtMigrateVolumeCommandWrapper.execute() to avoid an NPE if a legacy command path still sends a command without srcData.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • Build/CI
  • Test (unit or integration test code)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

Screenshots were added above showing the failure before the fix and successful live migration after applying the code changes.

How Has This Been Tested?

Tested manually on Apache CloudStack 4.20.0.0 with Ubuntu 22.04 management and KVM hosts.

Test scenario:

  • VM was running on KVM.
  • Source primary storage was NetworkFilesystem.
  • Destination primary storage was RBD / Ceph.
  • Live volume migration was triggered using migrateVolume with livemigrate=true.

Before this change:

  1. migrateVolume with livemigrate=true failed at the API layer with the KVM live volume migration restriction.
  2. After bypassing the API guard, the KVM agent failed with a NullPointerException because srcVolumeObjectTO was null.

After this change:

  1. MigrateVolumeCommand is created with srcData and destData.
  2. The KVM agent receives the source and destination volume objects.
  3. LibvirtMigrateVolumeCommandWrapper no longer throws a NullPointerException.
  4. Live volume migration from NetworkFilesystem to RBD completed successfully for a running VM.

I also verified the bytecode before and after the change:

  • Before: AncientDataMotionStrategy called the legacy MigrateVolumeCommand(long, String, StoragePool, String, Volume.Type, int, String) constructor.
  • After: AncientDataMotionStrategy calls the MigrateVolumeCommand(DataTO, DataTO, Map, Map, int) constructor.

How did you try to break this feature and the system with this change?

I tested the failure path and the fixed path by rolling the staging environment back to the original behavior and then applying the fix again step by step.

Validation performed:

  • Confirmed the original API-level KVM restriction returned before applying the fix.
  • Confirmed the original KVM agent wrapper dereferenced srcData directly and could fail with srcVolumeObjectTO == null.
  • Applied the changes and retested the same live volume migration operation.
  • Verified that the command sent to the agent included source and destination data.
  • Verified that the migration completed successfully.
  • Verified that the fallback null guard in the KVM wrapper prevents a NullPointerException if a legacy command path is still used.

@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented Jun 2, 2026

Congratulations on your first Pull Request and welcome to the Apache CloudStack community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/cloudstack/blob/main/CONTRIBUTING.md)
Here are some useful points:

@GutoVeronezi
Copy link
Copy Markdown
Contributor

@vanquyen020920, did you check the final VM XML?

Copy link
Copy Markdown
Member

@bernardodemarco bernardodemarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanquyen020920, thanks for the PR. I have tested it locally:

  • Created a VM with multiple disks;

    virsh domblklist i-2-11-VM --details
    Type   Device   Target   Source
    ----------------------------------------------------------------------------------------------------------
    file   disk     hda      /mnt/2944d4ce-30c9-3c5f-8379-e49054e7c768/19390437-1503-4034-9033-6f72b094384e
    file   cdrom    hdc      -
    file   disk     vdb      /mnt/e6f01a81-688b-384a-83e6-7fa6d4605d62/e13e4f6b-0a48-4001-9da8-c1525deda620
  • Migrated the hda device from pri-stor-02-cluster-01-zn-xap-01 (NFS) to pri-stor-01-cluster-01-zn-xap-01 (NFS)

    image
  • Verified that the VM's domain XML was not updated:

    virsh domblklist i-2-11-VM --details
    Type   Device   Target   Source
    ----------------------------------------------------------------------------------------------------------
    file   disk     hda      /mnt/2944d4ce-30c9-3c5f-8379-e49054e7c768/19390437-1503-4034-9033-6f72b094384e
    file   cdrom    hdc      -
    file   disk     vdb      /mnt/e6f01a81-688b-384a-83e6-7fa6d4605d62/e13e4f6b-0a48-4001-9da8-c1525deda620
  • Migrated the vdb device from pri-stor-01-cluster-01-zn-xap-01 (NFS) to pri-stor-02-cluster-02-zn-xap-01 (NFS)

    image
  • Verified that the VM's domain XML was not updated:

    virsh domblklist i-2-11-VM --details
    Type   Device   Target   Source
    ----------------------------------------------------------------------------------------------------------
    file   disk     hda      /mnt/2944d4ce-30c9-3c5f-8379-e49054e7c768/19390437-1503-4034-9033-6f72b094384e
    file   cdrom    hdc      -
    file   disk     vdb      /mnt/e6f01a81-688b-384a-83e6-7fa6d4605d62/e13e4f6b-0a48-4001-9da8-c1525deda620

During tests, I also verified that only the metadata of the volumes were updated. This happens because, as indicated by the error message below, Apache CloudStack currently does not support live migration of volumes when using KVM with file-based storage pools (NFS or SharedMountPoint) or Ceph RBD.

throw new InvalidParameterValueException("KVM does not support volume live migration due to the limited possibility to refresh VM XML domain. " +
"Therefore, to live migrate a volume between storage pools, one must migrate the VM to a different host as well to force the VM XML domain update. " +
"Use 'migrateVirtualMachineWithVolumes' instead.");

Implementing this feature involves addressing a number of requirements and edge cases, including:

  • Updating VM domain XML definitions on the fly;
  • Handling volume backing chains with multiple deltas;
  • Handling volumes with incremental snapshots;
  • Handling VMs with incremental snapshots;
  • Supporting migrations between file-based storage pools and RBD;
  • And other related scenarios.

I am currently working on designing and implementing this functionality. The current plan is to have it available during the third or fourth quarter of this year.

StoragePool srcPool = (StoragePool)dataStoreMgr.getDataStore(srcData.getDataStore().getId(), DataStoreRole.Primary);
StoragePool destPool = (StoragePool)dataStoreMgr.getDataStore(destData.getDataStore().getId(), DataStoreRole.Primary);
MigrateVolumeCommand command = new MigrateVolumeCommand(volume.getId(), volume.getPath(), destPool, volume.getAttachedVmName(), volume.getVolumeType(), waitInterval, volume.getChainInfo());
MigrateVolumeCommand command = new MigrateVolumeCommand(srcData.getTO(), destData.getTO(), null, null, waitInterval);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will have huge impact

throw new InvalidParameterValueException("KVM does not support volume live migration due to the limited possibility to refresh VM XML domain. " +
"Therefore, to live migrate a volume between storage pools, one must migrate the VM to a different host as well to force the VM XML domain update. " +
"Use 'migrateVirtualMachineWithVolumes' instead.");
logger.debug("Allowing KVM live volume migration between different storage pools. VM [{}], volume [{}], source pool [{}], destination pool [{}].",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will have huge impact too

@vanquyen020920
Copy link
Copy Markdown
Author

Thanks @bernardodemarco for testing and for the detailed feedback.

You were right. My initial patch was incomplete: it allowed the API/data-motion flow to continue and fixed the srcData == null issue, but it did not update the active libvirt domain disk source. As a result, CloudStack volume metadata could be updated while the running VM was still using the old source disk.

I reworked the KVM agent implementation locally to handle the regular live migration path using libvirt block copy + pivot instead of only relying on copyPhysicalDisk().

The updated flow I tested is:

  1. Detect regular live migration for an attached/running KVM volume.
  2. Prepare the destination disk on the destination storage pool.
  3. Create the destination physical disk if it does not already exist.
  4. Generate the destination libvirt disk XML.
  5. Run libvirt blockCopy.
  6. Wait for the block job to complete.
  7. Pivot the running disk to the destination.
  8. Verify that the active and inactive libvirt domain XML point to the destination disk before returning success.

Test environment:

  • Apache CloudStack 4.20.0.0
  • Ubuntu 22.04
  • KVM
  • Source primary storage: NFS / file-based primary storage
  • Destination primary storage: Ceph RBD

Test 1: system VM / virtual router root disk migration

Before migration, the running VM was using the NFS/file-based source:

file disk vda /mnt/0d464e3f-5176-3ba8-8c5f-01b8f4da5a2d/cbe98e70-0ec8-4dfa-8888-42b38a763672

During migration, the agent created the destination RBD volume and completed the live block copy:

Destination disk [cbe98e70-0ec8-4dfa-8888-42b38a763672] does not exist on pool [a15210c6-c858-3174-a390-183f4ed25096]. Creating it before live block copy.
Attempting to create volume cbe98e70-0ec8-4dfa-8888-42b38a763672 (RBD) in pool a15210c6-c858-3174-a390-183f4ed25096 with size (4.88 GB) 5242880000
Block copy has started for regular volume vda : cbe98e70-0ec8-4dfa-8888-42b38a763672
Block copy completed for the volume vda : cbe98e70-0ec8-4dfa-8888-42b38a763672

After migration, virsh domblklist shows that the active disk source was pivoted to Ceph RBD:

network disk vda CMC-CLOUDSTACK/cbe98e70-0ec8-4dfa-8888-42b38a763672

I also verified both active and inactive libvirt XML, and both now point to the RBD destination:

<disk type='network' device='disk'>
  <driver name='qemu' type='raw' cache='none'/>
  <source protocol='rbd' name='CMC-CLOUDSTACK/cbe98e70-0ec8-4dfa-8888-42b38a763672'>
    <host name='10.14.5.55'/>
    <host name='10.14.5.56'/>
    <host name='10.14.5.57'/>
    <auth username='admin'>
      <secret type='ceph' uuid='a15210c6-c858-3174-a390-183f4ed25096'/>
    </auth>
  </source>
  <backingStore/>
  <target dev='vda' bus='virtio'/>
</disk>

Test 2: user VM root disk migration

I also tested with a normal user VM.

Before migration, the VM root disk was on NFS/file-based storage:

file disk vda /mnt/0d464e3f-5176-3ba8-8c5f-01b8f4da5a2d/65ca9af4-ccb8-43a2-96f1-c91202cd192f
image image image image

The guest had a test file before migration:

cat test.txt
Test Migrate Disk

After live migration to Ceph RBD, CloudStack UI shows the root volume on CPM-CEPH.

The agent log shows the destination RBD volume was created and block copy completed:

Using live block copy path for regular volume migration. VM [i-2-213-VM], source path [65ca9af4-ccb8-43a2-96f1-c91202cd192f], destination path [65ca9af4-ccb8-43a2-96f1-c91202cd192f], destination pool [a15210c6-c858-3174-a390-183f4ed25096].
Preparing destination disk for regular live volume migration. Destination path [65ca9af4-ccb8-43a2-96f1-c91202cd192f], destination pool [a15210c6-c858-3174-a390-183f4ed25096].
Destination disk [65ca9af4-ccb8-43a2-96f1-c91202cd192f] was not found on pool [a15210c6-c858-3174-a390-183f4ed25096]. It will be created before live block copy.
Destination disk [65ca9af4-ccb8-43a2-96f1-c91202cd192f] does not exist on pool [a15210c6-c858-3174-a390-183f4ed25096]. Creating it before live block copy.
Attempting to create volume 65ca9af4-ccb8-43a2-96f1-c91202cd192f (RBD) in pool a15210c6-c858-3174-a390-183f4ed25096 with size (10.00 GB) 10737418240
Block copy has started for regular volume vda : 65ca9af4-ccb8-43a2-96f1-c91202cd192f
Block copy completed for the volume vda : 65ca9af4-ccb8-43a2-96f1-c91202cd192f

After migration, I verified the guest remained running and the test file was still readable:

cat test.txt
Test Migrate Disk
Final
image image image image image

This addresses the specific issue you found where only CloudStack metadata changed while the VM domain XML remained unchanged. With the updated implementation, the running libvirt domain is pivoted to the destination storage.

I agree that the broader feature still needs careful validation for additional edge cases, including:

  • backing chains with multiple deltas;
  • incremental volume snapshots;
  • VM snapshots;
  • file-based to file-based migrations;
  • RBD to file-based migrations;
  • multi-disk VMs;
  • reboot validation after migration.

I will push the updated implementation and include this validation evidence so it can be reviewed and tested further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants