Search This Blog

Thursday, December 02, 2010

Why you SHOULD NOT deploy an AD domain controller using Azure Connect with VM Role

I’ve heard a lot of talk recently about the forthcoming Windows Azure Connect service, combined with the soon-to-be-released-to-CTP of VM Role giving the possibility of hosting an Active Directory Domain Controller in the cloud. Although technically feasible, this post is designed to tell you why you shouldn’t do that.

The Web Role, Worker Role and VM Role all include local storage. Even a Windows Server at idle is actually doing quite a lot and making constant updates to the disks – and so it is with an instance deployed to the cloud as well. In Windows Azure, the state of a virtual machine (an instance) is not guaranteed if ever it restarts because of some sort of failure.  

Web and Worker Roles do persist the OS state across re-starts generated by automated updates; ones initiated by the fabric. With the VM Role there is no automated update process. It’s down to the owner of the VM Role to keep it up-to-date. Yes – there is a process for this which involves a differencing disk that is added to the base VHD that you supply when you first startup a VM Role in Windows Azure. However, the problem really exists with failures. 

A failure of a VM Role can be caused by any number of things. Power Supply failure to the rack, hard drive head-crash, failure of the hardware server the VM is hosted on plus a decent range of other hardware problems. The same for software. As they say “there’s no such thing as bug-free software”. Every so often the host OS or even the guest OS – the OS running in the VM you created, could happen across an unusual set of conditions while in kernel mode for which there is no handler. 

Well, if there is no handler, it means nobody thought such a condition would ever occur. The default behaviour of the kernel is to assume something has gone wrong with the kernel itself: it’d be dangerous to continue with a kernel in an unknown state. Think of the damage that could be caused. And so control is handed to a special handler – the one that causes the blue-screen fatal bugcheck. The resulting dump file may be useful in debugging what caused the problem after the event, after the operating system was stopped in its tracks by the bugcheck code. But this could happen to either the host OS or your VM. 

When it does happen, the heartbeat that is emitted to the fabric by a special Windows Azure Agent installed in every instance managed by the cloud will stop. Eventually the fabric will recognise a timeout has occurred. It’s first concern is to get a new responsive instance up and running. It is very likely it won’t be on a host even in the same rack, let alone exactly the same host. Therefore, no guarantee is ever given for these situations that state will be preserved.

The fabric will take the base VHD, plus the collection of differencing disks (the ones that contain your OS updates) and “boot” that back in to the configuration specified in your service model. This diagram explains the problem.

image

Use the numbered points in the diagram to follow along:
  1. The base VHD plus the differencing VHD is used to create…
  2. ..a running instance of a Domain Controller as a Windows Azure VM Role
  3. The downward pointing green arrow represents the life of this Domain Controller. Let’s assume the life between instantiating the VM and the catastrophic failure (at step 5) is 61 days (or longer).
  4. As time advances, more and more changes are written to the Domain Controller. In the diagram I have shown this as being performed by a series of administrators. In reality  though, it doesn’t matter how the changes get to the DC. Either directly or because of AD replication, say from an on-premise DC. The rules for hanging on to objects are the same. Deleted objects are tombstoned.
  5. A catastrophic failure of some description occurs and the instance immediately goes offline.
  6. The Windows Azure Fabric recognizes the absence of the heartbeat and builds a new instance from the base and differencing VHDs. These VHDs are used to create a new instance…
  7. …and the result is that all the changes that have accrued in the intervening 61 days are now lost. If there is another online DC, say in an on-premise environment, it will refuse to speak to this “imposter”. The password will have changed twice in the intervening 60 days and the tombstone timeout will have occurred. You therefore cannot rely on replication to get this DC back in to the state it was before the failure.
Essentially, having the fabric fire-up a new DC based on an out-of-date image is a bit like the not-recommended practice of running DCPromo on a Virtual Machine and therefore getting a copy of the domain database on to the machine. Then taking it offline and storing the VHD as the “backup” of AD. Re-introducing that DC back in to the network after a time will cause it to be ignored in the same way for all the same reasons.

The risk with having applications that can use Windows Integrated Authentication in the cloud is that if the network between your on-premise Domain Controllers and the apps you have in the cloud goes down, the apps can’t be used. 

It therefore appears to be the case that a VM Role deployed as a Domain Controller up in the cloud and using Windows Azure Connect to give full domain connectivity is a good idea. And indeed it is – until a failure occurs

But remember, you can domain-join your Azure based apps to a local DC, and on that point, Windows Azure Connect is a great way to be able to quickly deploy AD-integrated apps to Azure without massive re-engineering effort.  Because your local DC is part of your infrastructure it will be managed as such and won’t be subject to the service model that is the basis for anything running in Azure. If anything this scenario is a very practical demonstration of why it is that VM Role !== IaaS.

Of course if somebody could come up with a way for the DC to store its directory data in blob storage, which is persistent across instance reboots, then we’d have a neat solution. Maybe that’s an opportunity for a clever ISV partner to exploit. In the meantime – take the opposite sentiment to Nike’s strapline – “Just Don’t Do It”.

No comments: