vCloud Director Clone Wars Part 2 (Deep Dive)

Chris Colotti October 28, 2011 vCloud, VMware, vSphere 7 Comments

In part one we looked at the simple challenge of how to try and balance the I/O clone traffic on more than one host. Although the fix is not ideal it does the job as best we can. I decided to take a deeper look at this with a different scenario. How does the cloning traffic flow when you are dealing with not only multiple clusters, but multiple vCenter Servers, and multiple datastores. Let’s review the assumptions from part one:

vApp templates in vCloud Director are not stored in vSphere as “Template”, they are simply a powered off Virtual Machine.
Powered off Virtual Machines are registered to any particular host that was last running them.
DRS will only move powered on Virtual Machines based on the load based algorithm except during a Maintenance Mode process you have the option to move powered off virtual machines.
You can manually migrate powered off virtual machines
When a clone happens on a given host, if the target location is a datastore seen by that host it will be a block level copy, if the target is a host without access to the source files, this will usually result in a network copy over the management interface.
The host that owns the registered virtual machine will be the one that performs the I/O workload of the clone

I wanted to think about this from the standpoint of a provider that will have a Published Catalog, where the vApps templates all reside in the same Org vDC, but are being deployed to completely different Provider vDC’s as well as possibly different vCenter Servers. I set out just to see what happens for the sake of understanding if this would impact any particular design. Below is a diagram of what I have setup in my lab. All the storage is presented only to the individual clusters so there is no cross population and I could isolate the storage for each Provider vDC.

Now there was a few things I wanted to look at to examine this clone function under vCloud Director. What I set out to do was to see what happens for each scenario and how the operation may impact a potential design

Deploying a vApp hosted on Bronze to the Bronze VDC (Same vCenter)
Deploying a vApp from Catalog on Bronze to Sliver or Gold (Separate vCenter)
Deploying a vApp from Catalog on Silver to Gold (Same vCenter separate clusters and vDC’s)

Deploying a vApp hosted on Bronze to the Bronze VDC (Same vCenter/vDC)

This was of course the easy test to simply show that the clone operation does in fact happen on the host the vApp in the catalog is registered to. I imported a vApp into vCloud Director and it was placed arbitrarily on Host 1. When I deployed the vApp as expected Host 1 performed the clone operation and the I/O load was not seen at all on Host 2. When the vApp was powered on, then DRS decided which host was best suited to run the workload. This was pretty simple indeed to see in vCenter as the network interfaces saw the load of the copy.

Deploying a vApp from Catalog on Bronze to Silver or Gold (Separate vCenters/vDC’s)

This is where things started to get a little interesting for sure. Now I was not sure exactly what would happen when I tried to deploy a vApp on one Cluster/vDC to a completely different vCenter all together. What I discovered is that vCloud Director first issues an Export OVF command on the first vCenter. Where does that export land you ask? Well it gets saved to the vCloud Director /opt/vmware/vcloud-director/data/transfer folder on the cell. Better yet in a dual Cell setup like I have that location is an NFS mount on the IX-400d. Once the export is complete it deploys the vApp into vCenter 2 using an import command. So what does this mean for the copy process? Well, to be honest this got me a little bit confused! The transfer space is only seen by the cells, but the export is issued by vCenter server and executed on the host where the Virtual Machine is registered. Well then how is it getting to the transfer space so vCloud Director can import it? I can show you that when the process kicked off, BOTH the ESX host and the vCloud Cell got very busy. As best as I can tell without more details is that the vCloud director cell was getting the data directly from the host through the vCloud Director Agent. The task was issued through vCenter but the only thing that makes sense is that this is a case where the Cell was talking directly to a host to pass the data to the NFS transfer space.

Traffic on vCloud Cell During cross vCenter vApp Deployment

What this tells me is that network speed and communication between the hosts, cells, and the NFS transfer space for this particular operation is key to any design. It is also safe to point out that this took a long time to complete in my lab. This operation is not unlike what vCloud Connector uses to move Virtual Machines between clouds. This is in fact moving Virtual Machines WITHIN the same cloud. What you can see is that network object 4000 (Eth0) was very busy on receive while Object 4002 (Eth2 NFS) was busy on transmit. This means that the ESX host would be pushing the traffic over the Management network to the Cell, then the Cell is dropping it on NFS. The process then reverses as the deploy OVF to the other vCenter begins and the Cell is transmitting the data to the Host So for this use case what I can honestly say is the vCloud Director Cell handling the task, as well as the source ESX host where the vApp is registered, and a random destination host will all be busy and the affected interfaces are:

HTTP Interface on the Cell
NFS Interface on the Cell (NFS Storage as well)
Management interface on the Source and destination ESX Hosts
Storage interfaces on the source and destination ESX hosts.

If you’re hosts are using the software iSCSI initiator that is a consideration as well. Interesting so far right?

Deploying a vApp from Catalog on Silver to Gold (Same vCenter separate clusters/vDC’s)

Now that we also have a vApp template in the master catalog in the same vCenter, but simply hosted on a different Provider vDC, I wanted to just see if the same export import process was used to make that move, or if something else was in play. I knew the original hypothesis that the host where the VM is registered would handle the load, but I just wanted to see if the export/deploy was used. The good news is this does NOT require an export. However, since the source and destination datastores are not seen by the two clusters, this does revert to a standard network copy from the host where the Virtual Machine is registered to the target host. This does mean that the management interfaces of those hosts will handle the copy traffic between the two clusters. The respective storage interfaces will also be used for the read from the source and the write to the destination. This is the standard process that has always been used in vCenter for cross cluster copies.

Other Processes That Involve the vCloud Director Cell (Added 11/3/11)

Some other things came to mind after writing this. There are other functions that will use the vCD Cell’s transfer space just adding not only to the size requirement, but also the bandwidth needed between Cells and NFS storage. Some of these other functions include but are not limited to:

Importing a vApp vCloud Director from vCenter
Importing using an OVF format
Importing via vCloud Connector
Downloading a vApp from vCloud director
Copying a vApp in your cloud to the Catalog Depending on Datastore locale)

I am sure there are others, but this is just what I have seen in the testing for this series. I hope you see that the Cell’s transfer space can be a busy animal for large scale deployments. Part three has some recommendations on how to deal with this, but you definitely need to be aware of what else is passing through the vCloud Director Cell and it’s shared transfer space.

Conclusions

These will come in detail in Part Three of this little experiment, but there are a couple ways I would think about designing a large scale, multi-vCenter based vCloud Director deployment. While we cannot change the way vSphere handles the clone process on a host today, we could design creatively around it. However, we have also seen that separate clusters and vCenters will also create network load on the vCD Cells, as well as the shared transfer space. Although some of this is new, the other process within the same vCenter is not new. That being said, with potentially hundreds or thousands of users deploying vApps we could try and design for it. Some of these vSphere specific considerations are exactly what I have spoken about in my VMWorld sessions.

Jump to Part Three

7 comments

cwjking
November 21, 2011 at 8:38 pm

Hey Chris,
Good write up indeed. This makes me consider alot.

I guess I should say we are very fortunate running CiscoUCS 10GB FCoE Interfaces and on top of that our design is only one cluster.

HOWEVER, this makes me consider our multi site… We dont use two sites for one cloud… but we use two separate sites for location options to our end clients. If we do plan on merging this two sites at a later time this is going to be important. Right now with the 10GB FCoE pipe I think we shouldn’t really seen an issue?

Your thoughts?