Managing two data centers

Call it paranoia.  Call it being prepared.  Whatever your stance, we are considering using more than one data center for dealnews.com.  It is not a capacity issue.  We can keep growing our current data center without a problem.  But, stories of power outages and power outages we have experience have us wanting to explore the idea.

Here is the problem.  No one in our company has experience with this.  And, there does not seem to be any resources on the internet talking about this.  Our problems are not so much with managing the data between the two.  The problem is failover and how to deal with one data center being out.  Here are some of the ideas that have been thrown on to the wall.

Round Robin DNS

This was the first idea.  It seems simple enough.  We have two data centers.  We publish different DNS for each data center and traffic goes to each one.  The problem here is that it is, well, random.

Global Traffic Management

There are devices that “balance” traffic  across multiple different locations.  But, I am unsure how those deal with outages at one of the locations.  It seems like there is still one point of failure.

BGP Routing

This is the biggest mystery to me.  I know what it is.  I know what it means.  I have no idea how to deploy this type of solution.  I understand that you can “move” your IP addresses with routing changes.  But, that means running routers.  Where are these routers?  Does this happen at some provider?  Is there a provider that handles this?  Does that mean that all of our data centers are with one provider?  I think one more peace of mind feature of this is that we would not be tied to just one vendor.  So, if one vendor had major issues or there was some legal troubles (we lived through the dot come boom and bust) we would have security in knowing we had other equipment that was not affected.

Is there something else?  Are we being way paranoid?  Maybe it is not cost effective in the end.  I/we have no idea really.  Anyone out there that has knowledge on this subject?

Advertisements

10 Responses to Managing two data centers

  1. OnyxRaven says:

    Global Load Balancing is something some hardware loadbalancers (and maybe some software solutions, but I haven’t looked into) does keepalive checks against all its servers, whether they’re local or remote. If the remotes go down, the locals get more traffic, and vice versa. This is either achieved by dynamically modifying DNS responses which it proxies or by actually forwarding traffic. This obviously relies on the load balancer being always available. I believe Akamai offers a solution for this, as well as doing more DNS magic about balancing load between datacenters.

    BGP requires all the routers who might take over the IP space to ‘own’ and AS number (as provisioned by ARIN). These would be your border routers to your upstream provider. Problem here is I believe the smallest space for a BGP AS is a /24? In any case, the BGP enabled routers advertise themselves with a priority – as one goes down, the other will get re-prioritized by other routers on the internet, modifying routing tables on the fly. That can obviously take a while but its really seamless as long as the upstream routers listen to your advertisements. I believe some of the providers that do more ‘management’ like internap might provide this as a service.

    I was involved in a lot of this discussion at my last job, we had datacenters in Denver and Sanfran, and wanted to be able to failover the financial transactions exclusively from one to the other in the case of a database or server going bad, and if all went downhill, to move via BGP. Really, a complete solution that covers all failure points I believe is some combination of all 3.

  2. gslin says:

    Try GeoDNS (it’s ISC-bind9 patch), and maybe you need to write some scripts to transfer traffic to another DC after detecting some DC is down.

  3. Marc Gear says:

    First, read Theo Schlossnagle’s book “Scalable Internet Architectures”

    Then read up on IP Anycast. Its essentially where more than one machine on the internet shares the same IP and it sounds like it might be what you need. I have no experience with it though.

  4. Gaylord Aulke says:

    Theo’s Book covers a lot of topics regarding LAN failover and Load Balancing, especially when using mod_backhand and wackamole. These concepts only work in a LAN if on the same Switch AFAIK. I did not read too many aspects about load distribution and failiover accross between different datacenters.

    I think the Akamai-Direction is the right one to go…

  5. till says:

    Akamai is probably the most expensive route, and probably the easiest since you outsource the hassle to them.

    I am not so sure though if two DCs is really necessary. I think you just need one provider you are really comfortable with – and not two in the blend.
    I have no idea who you host with and how availability is in your area but for example I generally prefer the more expensive because they let me sleep at night. :-) I’ve had very good experience with TeleCity, TeleHouse etc. in Europe. Not a real bargain but also no issues ever. In the U.S. we are with Peer1 (in NYC) which is in the middle between cheap and expensive. We have been with them since November 2007 and it’s been a fun ride. Because there were zero issues so far.

    But in regard to your setup – IMO, you should be able to do the following. E.g. assign http://www.dealnews.com 2 IPs in DNS – this would work as a really inexpensive roundrobin DNS patch. Each of those IPs is an nginx in each datacenter doing the rest – e.g. distributing across local servers, or maybe even the other DC (as backup).

    Problem I see:
    1) DNS could fail ;-)
    2) What happens if an nginx at a DC fails?
    3) One of the locations could have an outage.

    Well, 1 – “just” outsource it to a provider who gives you an SLA etc.. IMHO there is not much more you can do.

    And in regard to 2 – setup a failover for the nginx servers locally (in each DC) across two or three servers and setup heartbeat. I’m sure you heard of that before.

    3 – I think the key would be to have your own IP space (PI) and re-route IPs when it becomes unavailable. I am almost sure that Cisco and Foundry have something available to you as well. Also, speaking of loadbalancers and good deals – check out coyotepoint. We got one of those and it’s been great.

    If you run your own DNS or your ISP offers something like an API, you should be also able to serve your zones with a really low TTL and tie in Nagios checks to change IPs etc..

    Most of those tricks are along the lines of, don’t spend too much money. Which is generally what I go for – if it works. :-) But if you have a budget of course there is always a salesrep waiting for you.

  6. till says:

    I totally forgot (and excuse all the namedropping) – before Peer1 we were with ServerCentral (because the company “was” Chicago-based). Also awesome company and people. I think we had been with them for five or six years and never had a single problem.

    If you can live with the distance – they do the full manage trick and also provide consoles etc..

  7. well, you might be looking for http://www.dnsmadeeasy.com

    very old, very good, very cheap, very easy to round robin cheap servers around the planet.

    Never rely on just one datacenter.

  8. Scott Larson says:

    First off, anycast is really not a good idea if you’re going to be relying on sessions or some other transaction which needs to bounce off the same servers until completed since it is entirely possible that anycast could shift someone to a new set of machines in the process, thus breaking stuff. Of course if you don’t do that then you could look at putting it to use this way.

    If you’re concerned about things like proximity to servers or data center load balancing, then GeoDNS or hardware solutions like F5’s BigIP or the offerings of Coyote Point are something to look at. BigIP’s 3DNS is the one I looked into the most myself. They’re designed to be deployed in hot/cold pairs at each data center and then they talk to each other to keep up to date on service metrics. If a data center went offline they’re supposed to handle taking it out of the loop until things are back to normal.

    Another thing to keep in mind is database synchronization. Keeping db’s in different parts of the country up to date with the same information is sketchy, because if the link between the locations breaks then your db’s are out of whack. It may be that you can get away with some lagging of data replication and if that’s the case I envy you.

    I’ve been dealing with these sorts of issues myself lately and the challenge is simultaneously fun and exhausting.

  9. spenser says:

    For many sites, once you have more than one physical location, it is wasteful to declare one of the locations as a failover location. It is usually better to split the load amongst the locations. If one of the locations fails, the dns records are withdrawn from publication. The load is then spread to the other locations. When things are going well, the site benefits from lower latency for all users because the servers are dedicated to serving specific geographical regions using geolocation. The difference can be dramatic.

    edgedirector.com managed dns implements the required features, including failover, failback, geolocation and server monitoring as a third party package.

  10. Scott,

    Several of the CDN providers use anycast for HTTP. Some of them made a study where they found it to be surprisingly stable (sessions very moved around very infrequently).

    If “keeping the sessions to one datacenter” is an issue then you’ll have problems anyway though. Fix that first!

    – ask

%d bloggers like this: