{"id":591,"date":"2011-09-13T21:09:46","date_gmt":"2011-09-14T01:09:46","guid":{"rendered":"http:\/\/chriscolotti.us\/?p=591"},"modified":"2011-09-13T21:09:46","modified_gmt":"2011-09-14T01:09:46","slug":"gotcha-the-importance-of-ntp-to-vmware-vcloud-director","status":"publish","type":"post","link":"https:\/\/chriscolotti.us\/vmware\/gotcha-the-importance-of-ntp-to-vmware-vcloud-director\/","title":{"rendered":"Gotcha: The Importance of NTP to VMware vCloud Director"},"content":{"rendered":"

So here I am again after working a couple days on a troubleshooting issue for a customer during a VMware vCloud upgrade that we finally got fixed. \u00a0I seem to now be the resident expert on getting the entire stack of VMware vCloud products upgraded in the right order and sequence as a matter of fact. \u00a0I wanted to really dig into the importance of Network Time Protocol as it relates to VMware vCloud Director and what we learned about a few things. \u00a0Now we all know that VMware has long stressed the importance and the issue of Guest OS time keeping as long been debated. \u00a0I am not going to deal with that here whatsoever, that horse has been beaten to a bloody pulp. \u00a0No instead I want to point out a few things that is called out in the installation documents, but if not validated can cause you to, well lose a lot of TIME troubleshooting. \u00a0Let’s start with the symptoms we saw.<\/p>\n

Symptoms:<\/h4>\n

We were following the detailed document I developed for these end to end upgrades, (Not available yet but\u00a0hopefully\u00a0soon), although it is based on the high level procedures I wrote about previously<\/a>. \u00a0We\u00a0finished\u00a0the vShield Manager upgrades and when we tried to validate a NAT routed network could be deployed, we started getting an error that vCloud Director “Could not find a host to place the Appliance”. \u00a0Generally this error is seen is not all the hosts see the same storage or one is not connected to the Distributed Switch. \u00a0It can also be seen if thee hosts are not prepared and available. \u00a0All of this was checked over and over still nothing to be seen. \u00a0Ultimately we tried to “Reconnect” to the vCenter and that is where we saw new errors. \u00a0One of the four cells always failed when trying to connect.<\/p>\n

Troubleshooting:<\/h4>\n

So as a good VMware VCDX we began to methodically troubleshoot with the help of\u00a0\u00a0Bangalore, Palo Alto, GSS. \u00a0I pretty much called in the full house blitz for a number of reasons. \u00a0We dumped the Database before and after the upgrade for review, we dumped vCenter logs, we dumped vCloud Director logs. \u00a0There was very few errors other than a few port group errors, some odds and ends etc. \u00a0We verified the vShield Manager could create a vShield without vCloud Director and that worked fine too. \u00a0Literally\u00a0we went around in circles. \u00a0Now early on we did happen to notice in passing that one cell’s time was way off, but we did not think anything of it at the time……Now you see the moral of my story coming. \u00a0We stopped the cell with the bad time, and starting testing cell by cell for both vCenter reconnection and deploying vShield Edge devices. \u00a0To our\u00a0amazement\u00a0everything started working…until we got to the cell with the wrong time.<\/p>\n

We then inspected all four cells and discovered that NTP was not running on any of them…..okay what’s the deal there we thought. \u00a0We got the Linux team to get NTP configured, check the time it was dead on with on all the cells so we rebooted them all for giggles. \u00a0Upon reboot the cell with the wrong time……STILL SHOWED WRONG!! \u00a0Now we were curious what was going on. \u00a0Then after about 5 minutes the time showed correct! \u00a0I am not one to go without a fight with a machine so we kept digging and here is what we found.<\/p>\n

Ultimate Root Cause:<\/h4>\n