Site icon Chris Colotti's Blog

Gotcha: The Importance of NTP to VMware vCloud Director

So here I am again after working a couple days on a troubleshooting issue for a customer during a VMware vCloud upgrade that we finally got fixed.  I seem to now be the resident expert on getting the entire stack of VMware vCloud products upgraded in the right order and sequence as a matter of fact.  I wanted to really dig into the importance of Network Time Protocol as it relates to VMware vCloud Director and what we learned about a few things.  Now we all know that VMware has long stressed the importance and the issue of Guest OS time keeping as long been debated.  I am not going to deal with that here whatsoever, that horse has been beaten to a bloody pulp.  No instead I want to point out a few things that is called out in the installation documents, but if not validated can cause you to, well lose a lot of TIME troubleshooting.  Let’s start with the symptoms we saw.

Symptoms:

We were following the detailed document I developed for these end to end upgrades, (Not available yet but hopefully soon), although it is based on the high level procedures I wrote about previously.  We finished the vShield Manager upgrades and when we tried to validate a NAT routed network could be deployed, we started getting an error that vCloud Director “Could not find a host to place the Appliance”.  Generally this error is seen is not all the hosts see the same storage or one is not connected to the Distributed Switch.  It can also be seen if thee hosts are not prepared and available.  All of this was checked over and over still nothing to be seen.  Ultimately we tried to “Reconnect” to the vCenter and that is where we saw new errors.  One of the four cells always failed when trying to connect.

Troubleshooting:

So as a good VMware VCDX we began to methodically troubleshoot with the help of  Bangalore, Palo Alto, GSS.  I pretty much called in the full house blitz for a number of reasons.  We dumped the Database before and after the upgrade for review, we dumped vCenter logs, we dumped vCloud Director logs.  There was very few errors other than a few port group errors, some odds and ends etc.  We verified the vShield Manager could create a vShield without vCloud Director and that worked fine too.  Literally we went around in circles.  Now early on we did happen to notice in passing that one cell’s time was way off, but we did not think anything of it at the time……Now you see the moral of my story coming.  We stopped the cell with the bad time, and starting testing cell by cell for both vCenter reconnection and deploying vShield Edge devices.  To our amazement everything started working…until we got to the cell with the wrong time.

We then inspected all four cells and discovered that NTP was not running on any of them…..okay what’s the deal there we thought.  We got the Linux team to get NTP configured, check the time it was dead on with on all the cells so we rebooted them all for giggles.  Upon reboot the cell with the wrong time……STILL SHOWED WRONG!!  Now we were curious what was going on.  Then after about 5 minutes the time showed correct!  I am not one to go without a fight with a machine so we kept digging and here is what we found.

Ultimate Root Cause:

Interesting right?  So what was happening originally is that the bad cell was ALWAYS getting bad time from the host on reboot.  Since NTP was not running it in turn was not getting updated.  Even once we enabled NTP, on reboot the tools still did their nice little one time sync, tossing off the time, until NTP reset it.  As we say in Boston….WICKED MESSY!

We ended up putting the bad HOST in maintenance mode for someone to look at later, migrated the VM to a good host with good NTP updates, ensured all CELLS had NTP running, and all was good with life.  The moral of this whole story is actually a couple things:

If this helps just one person not spin their wheels chasing this down then I have succeeded for the night.  It would have been easy to just tell you all to set time, trust me you need to, and RTFM on the installation guides, but where is the fun in that?  Enjoy, as always comments are welcome good bad or indifferent. 
Exit mobile version