Disaster Recovery in vCloud Director

Chris Colotti February 13, 2012 Site Recovery Manager, vCloud, VMware, vSphere 4 Comments

Note: This is a re-post of the article written for the VMware vCloud Blogs.

This article assumes the reader has knowledge of vCloud Director, Site Recovery Manger, and vSphere. It will not go in to depth on some topics, we would like to refer to the Site Recovery Manager, vCloud Director and vSphere documentation for more in-depth details around some of the concepts. This solution is very much about Disaster Recovery OF the Cloud infrastructure itself using Site Recovery Manager and vCloud Director. This is not intended to be a fully integrated solution, but rather a way to use both products together to achieve BCDR of the cloud.

Creating DR solutions for vCloud Director poses multiple challenges. These challenges all have a common theme. That is the automatic creation of objects by VMware vCloud Director such as resource pools, virtual machines, folders, and portgroups. vCloud Director and vCenter Server both heavily rely on management object reference identifiers (MoRef ID’s) for these objects. Any unplanned changes to these identifiers could, and often will, result in loss of functionality. vSphere Site Recovery Manager currently does not support protection of virtual machines managed by vCloud Director.

The vCloud Director and vCenter objects, which are referenced by each product, that are both identified to cause problems when identifiers are changed are:

Folders
Virtual machines
Resource Pools
Portgroups

Besides automatically created objects the following pre-created static objects are also often used and referenced to by vCloud Director.

Clusters
Datastores

Over the last few months we have worked on, and validated a solution which avoids changes to any of these objects. This solution simplifies the recovery of a vCloud Infrastructure and increases management infrastructure resiliency. The amazing thing is it can be implemented today with current products.

In this blog post we will give an overview of the developed solution and the basic concepts. For more details, implementation guidance or info about possible automation points we recommend contacting your VMware representative and you engage VMware Professional Services.

Logical Architecture Overview

vCloud Director disaster recovery can be achieved through various scenarios and configurations. This blog post is focused on a single scenario to allow for a simple explanation of the concept. A white paper explaining some of the basic concepts is also currently being developed and will be released soon. The concept can easily be adapted for other scenarios, however you should inquire first to ensure supportability. This scenario uses a so-called “Active / Standby” approach where hosts in the recovery site are not in use for regular workloads.

In order to ensure all management components are restarted in the correct order, and in the least amount of time vSphere Site Recovery Manager will be used to orchestrate the fail-over. As of writing, vSphere Site Recovery Manager does not support the protection of VMware vCloud Director workloads. Due to this limitation these will be failed-over through several manual steps. All of these steps can be automated using tools like vSphere PowerCLI or vCenter Orchestrator.

The following diagram depicts a logical overview of the management clusters for both the protected and the recovery site.

In this scenario Site Recover Manager will be leveraged to fail-over all vCloud Director management components. In each of the sites it is required to have a management vCenter Server and an SRM Server which aligns with standard SRM design concepts.

Since SRM cannot be used for vCloud Director workloads there is no requirement to have an SRM environment connecting to the vCloud resource cluster’s vCenter Server. In order to facilitate a fail-over of the VMware vCloud Director workloads a standard disaster recovery concept is used. This concept leverages common replication technology and vSphere features to allow for a fail-over. This will be described below.

The below diagram depicts the VMware vCloud Director infrastructure architecture used for this case study.

Both the Protected and the Recovery Sites have a management cluster. Each of these contain a vCenter Server and an SRM Server. These are used facilitate the disaster recovery procedures. The vCloud Director Management virtual machines are protected by SRM. Within SRM a protection group and recovery plan will be created to allow for a fail-over to the Recovery Site.

Please note that storage is not stretched in this environment and that hosts in the Recovery Site are unable to see storage in the Protected Site and as such are unable to run vCloud Director workloads in a normal situation. It is also important to note that the hosts are also attached to the cluster’s DVSwitch to allow for quick access to the vCloud configured port groups and are pre-prepared by vCloud Director.

These hosts are depicted as hosts, which are placed in maintenance mode. These hosts can also be stand-alone hosts and added to the vCloud Director resource cluster during the fail-over. For simplification and visualization purposes this scenario describes the situation where the hosts are part of the cluster and placed in maintenance mode.

Storage replication technology is used to replicate LUNs from the Protected Site to the Recover Site. This can be done using asynchronous or synchronous replication; typically this depends on the Recovery Point Objective (RPO) determined in the service level agreement (SLA) as well as the distance between the two sites. In our scenario synchronous replication was used.

Fail-Over Procedure

In this section the basic steps required for a successful fail-over of a VMware vCloud Director environment are described. These steps are pertinent to the described scenario.

It is essential that each component of the vCloud Director management stack be booted in the correct order. The order in which the components should be restarted is configured in an SRM recovery plan and can be initiated by SRM with a single button. The following order was used to power-on the vCloud Director management virtual machines:

Database Server (providing vCloud Director, vCenter Server, vCenter Orchestrator, and Chargeback Databases)
vCenter Server
vShield Manager
vCenter Chargeback (if in use)
vCenter Orchestrator (if in use)
vCloud Director Cell 1
vCloud Director Cell 2

When the fail-over of the vCloud Director management virtual machines in the management cluster has succeeded, multiple steps are required to recover the vCloud Director workload. These are described in a manual fashion but can be automated using PowerCLI or vSphere Orchestrator.

Validate all vCloud Director management virtual machines are powered on
Using your storage management utility break replication for the datastores connected to the vCloud Director resource cluster and make the datastores read/write (if required by storage platform)
Mask the datastores to the recovery site (if required by storage platform)
Using ESXi command line tools mount the volumes of the vCloud Director resource cluster on each host of the cluster

Using vCenter Server rescan the storage and validated all volumes are available
- esxcfg-volume –m
Take the hosts out of maintenance mode for the vCloud Director resource cluster (or add the hosts to your cluster, depending on the chosen strategy)
In our tests the virtual were automatically powered on by vSphere HA. vSphere HA is aware of the situation before the fail-over and will power-on the virtual machines according to the last known state
- Alternatively, virtual machines can be powered-on manually leveraging the vCloud API to they are booted in the correct order as defined in their vApp metadata. It should be noted that this could possibly result in vApps being powered-on which were powered-off before the fail-over as there is currently no way of determining their state.

Using this vCloud Director infrastructure resiliency concept, a fail-over of a vCloud Director environment has been successfully completed and the “cloud” moved from one site to another.

As all vCloud Director management components are virtualized, the virtual machines are moved over to the Recovery Site while maintaining all current managed object reference identifiers (MoRef IDs). Re-signaturing the datastore (giving it a new unique ID) has also been avoided to ensure the relationship between the virtual machines / vApps within vCloud Director and the datastore remained intact.

Chris Colotti and Duncan Epping
VMware Center of Excellence & Technical Marketing

Note: Although we have not specifically validated it, yes this solution/concept would also apply to VMware View.

Reactions to This Solution From Partners at PEX 2012

4 comments

chadwick king
February 14, 2012 at 4:20 pm

This definitely does assume a lot. I would refer to this as being a very high level overview. It covers some things like the MGMT which for many should be practical enough. I am working on something similar but I am not sure about what DR type solution I want to try. VEEAM, Falconstor, HP Lefthand, and Vmware VSA. This makes it seem somewhat straightforward and simple. I think we both know it gets a lot more complicated. I know in some cases clients are running an active/active multi site VCD deployment. I know for me I think an active/active type of setup is a better option. I don’t know many large companies that want Host just sitting there doing nothing. Things like multi site load balancing, back-end database mirroring, are a few of the things I have liked reading here. In the event of a site going down that is active/active should any DR be needed? Even so, I still enjoy the fact someone is writing some relevant stuff for us vCloud guys to think on. 🙂

- Chris Colotti
  February 14, 2012 at 7:18 pm
  
  Chadwick, the whitepaper will go into a bit more detail in a couple weeks, so yes this is more of the overview. I agree there is more to it and there is a split between DR vs “availability”. We were asked to come up with a pure site based DR solution to failover “the cloud” from one site to another. This makes that happen. There will be more to come I am sure now that we have some basis.
  
Loay
February 25, 2012 at 2:41 am

How would you handle the failover of the Nexus 1Kv switch VSM VMs? It seems that we have to use standard vSwitch port groups for this procedure to work.

- Chris Colotti
  February 25, 2012 at 7:15 am
  
  Since the VSM’s are virtual machines they could simply be added into the SRM failover same as the other supporting virtual machines like vCO. That is the simplicity of the solution is that anything that is part of the “Management” layer can be failed over like the other appliances.