What's Wrong with Bridging Datacenters together for DR?

Published: 2014-12-19
Last Updated: 2014-12-19 18:59:20 UTC
by Rob VandenBrink (Version: 1)
0 comment(s)

With two stories on the topic of bridging datacenters, you'd think I was a real believer.  And, yes, I guess I am, with a couple of important caveats.

The first is encapsulation overhead.  As soon as you bridge using encapsulation, the maximum allowed transported packet size will shrink, then shrink again when you encrypt.  If your Server OS's aren't smart about this, they'll assume that since it's all in the same broadcast domain, a full packet is of course OK (1500 bytes in most cases, or up to 9K if you have jumbo frames enabled).  You'll need to test for this - both for replication and the failed-over configuration - as part of your design and test phase.

The second issue si that if you bridge datacenters to a DR or second (active) datacenter site, you are well positioned to fail over the entire server farm, as long as you can fail over your WAN connection and Internet uplink with them.  If you don't, you end up with what Greg Ferro calls a "network traffic trombone".  (http://etherealmind.com/vmware-vfabric-data-centre-network-design/)

If you fail one server over, or if you fail over the farm and leave the WAN links behind, you find that the data to and from the server will traverse that inter-site link multiple times for any one customer transaction.

For instance, let's say that you've moved the active instance of your mail server to the DR Site.  To check an email, a packet will arrive at the primary site, traverse to the mail server at site B, then go back to site A to find the WAN link to return to the client.  Similarly, inbound email will come in on the internet link, but then have to traverse that inter-site link to find the active mail server.  

Multiply that by the typical email volume in a mid-sized company, and you can see why this trombone issue can add up quickly.  Even with a 100mb link, folks that were used to GB performance will now see their bandwidth cut to 50mb or likely less than that, with a comensurate impact on response times.  If you draw this out, you do get a nice representation of a trombone - hence the name.

What this means is that you can't design your DR site for replication and stop there.  You really need to design it for use during the emergency cases you are planning for.  Consider the bandwidth impacts when you fail over a small portion of your server farm, and also what happens when your main site has been taken out (short or longer term) by a fire or electrical event - will your user community be happy with the results?

Let us know in our comment section how you have designed around this "trombone" issue, or if (as I've seen at some sites), management has decided to NOT spend the money to account for this.

Rob VandenBrink

0 comment(s)


Diary Archives