Subject: Replacing SAN in Microsoft Hyper-V cluster
Technology: Microsoft Hyper-V, Windows 2008 R2 cluster, iSCSI SAN
Problem description:
One of our clients made a mistake purchasing a single controller iSCSI SAN device that we wanted to replace. The client needs more storage so purchasing additional small iSCSI dual controller device was not a problem. The challenge was to add dual controller SAN to Hyper-V cluster and then release old single controller SAN.
The new SAN was configured and exposed to the Hyper-V cluster in parallel with old single controller SAN. New storage was also configured as Cluster Shared Storage. The next step was to move VM instances from old storage to the new storage.
First we tried to use VMM to move and it worked fine for small VM instances. We hit the problem when VMM started moving 300G instance.
The operation started but then stuck, the engineer had to cancel it. The VM could not be brought up after the cancelation. Then we noticed that the status of the VM in the cluster manager was “Destroying”. That was an unpleasant.
When the instance was finally “Destroyed” we could start it after fixing broken VM configuration in the cluster.
It looked like VMM did backup and then tried to push that backup to the new storage but timed out.
We tried to just stop the VM instance and then copy it using xcopy or robocopy utility. The copy was going fast in the beginning but then I/O was dropping tremendously. It would take days to copy 300G instance.
After the I/O research using MS performance tools we found the cause of the performance problem.
Solution:
Hyper-V uses Cluster Shared Storage uses sharing techniques that allow Hyper-V running on different physical nodes to access shared storage directly. Before the Cluster Shared Storage, MS Cluster was locking shared storage to a single clustered node and blocked access of other nodes to the shared storage. With the Cluster Shared Storage one node still is the owner of the storage but other nodes can access the storage under C:\ClusterStorage folder.
MS Hyper-V is optimized to access that location however utilities like xcopy and robocopy is not. If the source storage is on Node2, the target storage is on Node 3 and xcopy or robocopy is started on Node1, additional locking requests are made to Node2 and Node3 to coordinate the work. Those locking requests are done through IP network and slow down copy operation.
To resolve the problem, move storage to single node and use xcopy or robocopy or regular Windows Explorer copy to move the VM instance VHD file. Then locking will be performed on local host without any network traffic.