ZFS over iSCSI high-availability solution

storage freebsd iscsi zfs

10,658

Solution 1

It's not a direct answer to your question, but a more traditional architecture for this sort of thing would be to use HAST and CARP to take care of the storage redundancy.

A basic outline (see the linked documentation for better details):

Machine A ("Master")

Configure the HAST daemon & create an appropriate resource for each pool-member device.
Create your ZFS mirrored device as you would on any single system, using the HAST devices.

Machine B ("Slave")

Configure the HAST daemon similarly to what you did on Master, but bring it up as a secondary/slave node.
(HAST will mirror all the data from the Master to the Slave for you)

Both Machines

Configure CARP as described in the FreeBSD Handbook's HAST documentation.
All the failover magic will be handled for you.

The big caveat here is that HAST only works on a Master/Slave level, so you need pairs of machines for each LUN/set of LUNs you want to export.

Another thing to be aware of is that your storage architecture won't be as flexible as it would be with the design you proposed:
With HAST you're limited to the number of disks you can put in a pair of machines.
With the ISCSI mesh-like structure you proposed you can theoretically add more machines exporting more LUNs and grow as much as you'd like (to the limit of your network).

That tradeoff in flexibility buys you a tested, proven, documented solution that any FreeBSD admin will understand out of the box (or be able to read the handbook and figure out) -- to me it's a worthwhile trade-off :-)

Solution 2

"zpool status -x" will output whether all pools are healthy or output the status of ones that are not. If a iSCSI LUN vdev goes offline a cron job running a script based around that command should give you a way to have cron alerts on a regular basis.

"zpool import" should be able to import the existing zpool from the iSCSI LUNs vdevs. You may have to force the import if the pool was not exported cleanly but internal metadata should keep the data in a consistent state even if writes were interrupted by the database node failing.

10,658

Author by

oberstet

Updated on September 18, 2022

Comments

oberstet almost 2 years
I am considering a ZFS/iSCSI based architecture for a HA/scale-out/shared-nothing database platform running on wimpy nodes of plain PC hardware and running FreeBSD 9.

Will it work? What are possible drawbacks?

Architecture
1. Storage nodes have direct attached cheap SATA/SAS drives. Each disk is exported as a separate iSCSI LUN. Note that no RAID (neither HW nor SW), partitioning, volume management or anything like that is involved at this layer. Just 1 LUN per physical disk.
2. Database nodes run ZFS. A ZFS mirrored vdev is created from iSCSI LUNs from 3 different storage nodes. A ZFS pool is created on top of the vdev, and within that a filesystem which in turn backs a database.
3. When a disk or a storage node fails, the respective ZFS vdev will continue to operate in degraded mode (but still have 2 mirrored disks). A different (new) disk is assigned to the vdev to replace the failed disk or storage node. ZFS resilvering takes place. A failed storage node or disk is always completely recycled should it become available again.
4. When a database node fails, the LUNs previsouly used by that node are free. A new database node is booted, which recreates the ZFS vdev/pool from the LUNs the failed database node left over. There is no need for database level replication for high-availability reasons.
Possible Issues
- How to detect the degradion of the vdev? Check every 5s? Any notification mechnism available with ZFS?
- Is it even possible to recreate a new pool from existing LUNs making up a vdev? Any traps?
oberstet almost 12 years

Periodically running zpool status, ok. But there is nothing "event driven"? Is there a ZFS Dtrace provider that could emit events like a failed disk? Regarding zpool import: so all the info needed to recreate a pool is stored within the vdevs (and with vdev mirror, hence redundently)? No info from the host (which may have died completely) is needed?
oberstet almost 12 years

Its even more tricky. I am using a test setup of 2 storage VMs and 2 database VMs all FreeBSD9. Created a 2-way mirrored pool over LUNs from the 2 storage nodes. I could verify resilvering after a storage node was gone and reappeared. However: I needed to kill iscontrol on the database node for the left storage node. Without doing that iscontrol tries to silently reconnect forever. It only does start doing so after access of the ZFS pool. But as long as it retries any ZFS pool command just hangs. Need to do more experiments. Can I run iscontrol without it daemonizing itself?