Gabe has written up a really nice article recently about HA Admission Control you should read. It made me think of a very recently conversation I had with some folks about HA Admission Control and vCloud Director specifically. Namely, the fact they disabled it entirely. Even when I was a vSphere administrator, before I was more educated, I made the same mistakes of disabling HA admission control. When I talk to people now at least 9 times out of 10 people who have disabled HA Admission Control did it because “No more virtual machines would power on”….Yeah of course! It was doing its job, ensuring that what was running would be able to power on in the event of a failure. The other reason is I think people felt they could be smarter than the feature itself. For whatever reason you basically are turning off the “Virtual Bouncer” as my friend Frank Denneman so eloquently put it the other day on a Skype call.
I have seen it get disabled so many times to “Power on Just one more Machine”, but that excuse gets repeated 100 times over, without actually addressing the real problem which is usually cluster capacity. Or worse yet HA admission control gets disabled on a cluster never to be re-enabled again, because the vSphere administrators got to busy and forgot. Then months later when the cluster is falling down with no more capacity, OR a host actually fails and things don’t power on they blame vSphere. After you investigate you see that HA admission control was in fact disabled, most times without a proper change request. Okay, can you tell I have seen this one too many times?
Now what about vCloud Director and HA admission control? Well the same rules still apply, but it’s even more important in my mind. This is because you don’t only have vSphere administrators adding virtual machines, you have consumers doing it. Without HA admission control, vCloud Director has no idea that the cluster can or cannot support a failure. That is a vSphere function to tell vCloud Director when the tank is full. If you go under the covers and mess with HA admission control under vCloud Director, you add the potential to over subscribe your provider vDC cluster.
Don’t forget in reading Gabe’s article that if you knew it or not HA Admission Control uses per virtual machine reservations. in vCloud director, there are TWO allocation models that assign per virtual machine reservations Specifically. Pay as you go and Allocation pool, do this, and you can read more about that in this article. Per mine and Duncan’s comments in Gabe’s article you also do not want to forget about overhead reservation with or without vCloud Director.
The real trick is to make sure you configure HA admission control policy for the best fit to your needs. Many people still use N+1, but more people are starting to use percentage based. Then be sure to add capacity to those clusters that are the provider vDC’s behind vCloud Director. Simply disabling it will do nothing for you but get you in a situation that will be hard to level back out. I treat HA admission control with vCloud Director like DRS…Don’t Disable DRS….or HA admission control. If someone can provide me with some valid useful reasons why you would want to disable it I’m all ears.
Even in my LAB HA:AC is enabled! .. that give me some exposure to it before start doing some crazy stuff on production clusters.
#DontDisableHA:AC
I agree that it’s more important with vCD since you’re typically allowing users to create their own VMs, thereby having less control over how resources are consumed. However, in a typical vSphere environment, it’s a tradeoff. You’re deciding between whether you want to have sufficient resources for all of your powered on VMs and have some VMs possibly not power on after a host failure, or have everything power back on with some resource contention after the fact. It’s a valid decision to say I want everything to come back up after a failure and live with the resource contention until the failed host is brought back online.
On top of that, you can typically see new requirements coming toward you, and if you’re doing any little bit of capacity planning, you’re not going to be put in a position of having to make this distinction anyway. It’s certainly not as cut and dried as you put it in the article.
Jason thanks for the thoughts. I agree it is not as cut and dry, however the balance part of the equation that I see missin in most cases is the capacity planning. I have not been the one to make it cut and dry. I have seen people turn it off to make up for their lack of capacity planning to feed the influx of new virtual Machines.
We both have to agree that is the wrong reason to disable it for sure. If organizations are needing that many virtual machines that fast, disabling HA Admission Control is not the answer capacity planning is, and unfortunately that sometimes means saying no to a new VM until there is capacity. Many organizations do not want to say no.
I have said many times and have been quoted as saying “If your way of capacity planning with vSphere was to disable HA Admission Control and you want to do vCloud, you better start doing proper capacity planning first”
We’re definitely in agreement about capacity planning. My point with regard to non-vCD environments was that if someone has negative HA slots available, that doesn’t mean they shouldn’t want all of their VMs to power up, even with degraded performance. It’s a design consideration, to be sure.
Jason, I know where you are coming from. There is definitely the right way to do it which is capacity planning but there are a lot of real world scenarios where management forces your hand and demands service availability over service usability.
Totally agree, Admission control should always be enabled.
Sometimes us poor engineers are left with no option if we are not provided with the capacity required to meet customer demands. I agree also. Admission Control should always be enabled, but if you have no choice, only make it a temporay solution while you get that check signed for more hardware 🙂
I can agree, and relate to that as well. I would simply submit that when you do a proper change control process be in place to track that it was disabled, for how long, and that it was RE-enabled at some point. I worked on an enviroment that had it disabled for so long we calculated it needed almost 6 hosts just to get it back to NORMAL, not counting extra capacity.