How good is failover of the iSCSI target on a two-node linux san?

linux storage-area-network cluster failover iscsi

9,235

Solution 1

SCSI connections time out after 15 seconds (or something) by default. If your home-built solution can't complete a takeover during that time, you'll need to play with that value. Also worth considering is that normal SANs mirror their cache so after a takeover, writes that were acknowledged but not yet committed to disk are not lost. If you can't arrange for that, you risk data corruption or having to avoid caching writes.

Solution 2

We have set up two Linux boxes as iSCSI target cluster. We use DRBD and SCST target and it works fine. (SCST target is better than the old iscsitarget, VMware ESXi can kill that one but not SCST).

The timeout is a client side settings so you can set it lower if you wish.

9,235

Luke404

Self-made sysadmin, specialized on linux systems and trying to grow my own business. Also Python coder out of passion and necessity, learning my way through OOP and frameworks and whatnot.

Updated on September 18, 2022

Comments

Luke404 almost 2 years

I'm evaluating the possibility to use two off the shelf servers to build a cheap iSCSI redundant SAN. The idea is to run linux, pacemaker, and an iSCSI target - something like the SAN Active-Passive on linux-ha-examples.

The same page scares me a little when I read:

During the switchover of the iscsi-target one can detect a gap in the protocol of write-test.log. In our setup we observed a delay of 30s. There are problems reported in connection of ext3 and an iscsi failover This configuration has been tested with ext2 and ext3 and worked with both filesystems.

Has anyone put in production a redundant iSCSI SAN made out of linux boxes? Is a failover event really that bad? A 30 seconds freeze in I/O sounds like a disaster to me, isn't it?
- Naveed Abbas over 12 years
  
  No 30 seconds is not a disaster. Many midrange external (FC) disk arrays have similar I/O freeze in their worst-case failover scenario. Most applications, including databases, happily survive even longer freezes. Just tune the clients' timeouts, test, verify if SCSI commands unfreeze without being failed by client OS.
- pfo over 12 years
  
  FYI: Commercial grade enterprise gear specifies (guaranteed) fail-overs in the order of 180 seconds. The default SCSI timeout for sg layer in the Linux kernel (note that this varies for most distros and what installed drivers are active etc - check in '/sys/block/<DEVICE>/device/timeout' for the current setting) is usually something between 30 and 60s. If you can't tolerate 30s of IO blocking you are probably on the wrong platform and/or approach.
wazoox over 12 years

For your information, IET received many enhancements lately and now supports SCSCI-3 reservations too. I'd say IET or SCST are still the best iscsi target now from a stability and capability point of view.
pfo over 12 years

Good hint! It is quite often forgotten is that you need to disabled write caching on your RAID controller card since you can theoretically and practically lose it's whole content as the two boxes you use for fail-over don't have cache coherent synchronization. This is a huge performance impact.
Luke404 over 12 years

I'm targeting a recent Ubuntu system for the nodes, maybe the next LTS (12.04), and as far as I know the best upstream-included target is IET, so I was thinking about using that one... but I still need to do some more research on the matter...
Luke404 over 12 years

Checking for in-flight writes was already on my list. The broad idea is to use conservative settings whenever possible, including the use of protocol A for drbd and turning off write caches on the underlying block storage. We're targeting a stable solution and we don't need super high performance, luckily. The SAN will run over a dedicated gigabit network with jumbo frames, and the two storage nodes will have separate 2x gigabit bonded crossover links with jumbo frames dedicated to DRBD.
Stone over 12 years

IET sometimes dies under huge IO.
Basil over 12 years

You're spending a lot of config time (and hardware) on re-solving a very old problem. Why not just put in an HP lefthand VM or something? Spend the 10k now and save yourself the hundreds of hours of head-scratching a home-built solution will cause.