good failover / high availability solutions for linux?

17,877

Solution 1

http://linux-ha.org/ for all your high-availability needs. Like the song says, the best things in life are free.

Solution 2

Michael is correct that the community is a bit fractured right now, and documentation is a tad sparse.

Actually, it's all there, it's just impossible to understand. What you really want is the "Pacemaker Configuration Explained" ebook... (Link to PDF). You'll want to read it about a dozen times, and then try to implement it, and then read it another dozen times so that you can actually grok it.

The best supported implementation of cluster services for Linux at this point is probably going to be Novell's SLES11 and it's High Availability Extension (HAE). It JUST came out a month or two ago, and it comes with a nice thick 200 page manual that describes how to set it up and get things running. Novell has also been excellent about supporting Pacemaker configurations in various forms.

Beyond that, there's RHEL5's implementation, which has the same package and decent documentation, but I think it's more expensive than SLES. At least, it is for us.

I would avoid Heartbeat right now and go with Pacekmaker/OpenAIS because they're going to be much better supported going into the future. HOWEVER, the current state of the community is such that there are a few experts, there are a few people who are running it in production, and there are a whole ton of people that are completely clueless. Join the Pacemaker mailing list and pay attention to a man named Andrew Beekhof.

Edit to provide requested details:

Pacemaker/OpenAIS uses a 'monitor' operation on a 'primitive resource' (e.g. nfs-server) to keep track of what the resource is doing. If the example NFS server goes unresponsive to the rest of the cluster for X number of seconds, then the cluster will execute a STONITH (Shoot The Other Node In The Head) operation to shut down the primary node, promoting the secondary node to active. You decide in the configuration what to bring up afterward and associated actions to take. Implementation details from there depend on what service you're trying to make fail over, execution windows for certain operations (such as promoting the primary node back to master) and the whole thing's pretty much as configurable as possible.

Solution 3

I have used a variety of cluster solutions on Linux. I'm also a configuration management proponent, so I'll add a bit about that in my descriptions (Chef or Puppet, that is)

Veritas Cluster Server (VCS). It's been awhile, but we deployed a few Linux VCS clusters on RHEL 3.0. I would hope its available on RHEL 5.0. You should be familiar with the difficulty in setting this up, as its familiar territory. As you may be aware, VCS is expensive. Anecdotally, VCS is not well suited to being set up by configuration management.

Speaking of RHEL, Red Hat Cluster Suite has matured a lot since its original release with RHEL 2.1. The setup/configuration phase is pretty straightforward, and the documentation is very complete and helpful, and like VCS you can purchase support from the vendor. For commercial HA products, RHCS is reasonably priced. I would only use configuration management to install the packages, and maintain them "by hand" through the web interface. Also, I've heard of some people using it on non-Red Hat platforms, though I don't have experience with that directly.

Linux-HA (drbd/heartbeat) are great as well, though coming from VCS the configuration may seem simplistic, yet unwieldy. This is pretty easy to automate with a configuration management tool.

As a proof of concept, I've installed a Linux cluster with IBM's HACMP - their AIX clustering software. I would not recommend this, as I recall it is more expensive than even VCS. IBM has specific procedures for installing and maintaining HACMP, I would not use configuration management here.

Solution 4

With Linux we have implemented clustering with heartbeat and drbd. Heartbeat checks the status of the server. DRBD is used for data sync between servers. We have oracle service running on one server and apache on another server. When server running oracle fails, heartbeat senses the same and restores oracle service on server running apache. and vice a versa. Have been using this setup for many other purposes and have been reliable till date.

Solution 5

Red Hat Cluster Suite will do what you want for just about every possible application. In combination with GFS and Cluster LVM you can have solid shared storage.

Maintenance is not much more difficult then keeping the individual boxes running. The application migration makes it easier, actually, to patch the individual boxes.

RHCS comes with a web frontend (Luci) and a GTK frontend (system-config-cluster) to make configuration and migration clickable. It'll let you configure failover domains per application, recovery policies, fencing, all from one central, web-based management console.

Considering the fact that RHCS actually has a pretty solid support option, I'd go for RHCS.

Not sure how much this would cost you, but I figure it's in the range of several thousand dollars.

Share:
17,877

Related videos on Youtube

ericslaw
Author by

ericslaw

Interested in: large (to me) datasets, visualization, computer graphics, human computer interaction, perl, javascript, jquery, UI, new web technologies.

Updated on September 17, 2022

Comments

  • ericslaw
    ericslaw almost 2 years

    I have several cases where I need applications to be migrated from one server to another in the event of a failure (server hang or crash).

    On solaris we do this with VCS (Veritas Cluster Server). What options are available for Linux?

    Please indicate level of effort to setup/maintain or cost (if any) for each.

    -- More details added --

    To give a idea of the complexity level:

    • failing server could hang or crash without notice, may still be 'ping-able'
    • recovery server needs to start up it's applications on failover
    • once failing server boots/power-cycles, it becomes passive as not to intefere with the recovery server.

    This is a data collection or compute node, not a database, so simpler solutions could work.

    -- even more details (sorry) --

    shared storage is not an option, but not much state (if any) needs to migrate from one server to the other. We keep the two servers in sync via rsync.

    Thank you very much for all the posts so far.

  • Matt Simmons
    Matt Simmons about 15 years
    Could you suggest a good book or two on this subject?
  • ericslaw
    ericslaw over 13 years
    not sure why someone marked this down... this looks like a viable solution (though there are always technical gotchas... at least this doesn't look like a 'service' which was my first impression).
  • NickW
    NickW about 11 years
    Is that still updated? the website says: Copyright © 2000-2005, Horms Last Updated: Sat Mar 4 16:33:57 2006 +0900
  • slf
    slf over 10 years
    can you recommend a good book?