Solaris Cluster 3.x Quorum issue article

Article Listing 1 oct2006.tar
Sun Cluster 3.x Quorum Issue

Peter van der Weerd

Clustering software usually consists of a collection of scripts and binaries that unconfigure an interface, bring down an application, unmount some file systems, give away a group of disks, and reverse this procedure on some other machine. This goes for all Unix cluster solutions. Of course, there are some differences on different levels. Different vendors use different storage products and software to manage devices, have their own ideas about establishing and maintaining membership between the clustered machines, and so on.

So, if Unix clustering is so straightforward and common practice, why am I writing an article on it? Good question. In this article, I will not list all pros and cons of all different cluster products. I will describe one con of one Unix cluster product, elaborate on it a bit, and come up with a script that could help. Specifically, I will cover the quorum issue in Sun Microsystems' Sun Cluster 3.x product. Even though Sun, to my mind, has one of the most advanced cluster solutions in the field, there is a drawback. This drawback is the quorum device issue, or to be a little more exact, the ignored issue of losing the disk that is your quorum device.

Quorum Device

You may already know about the quorum device issue, but just to make sure, here's a short recap. Unix clusters all have some sort of heartbeat protocol that uses either a dedicated network or all available networks to communicate and establish "membership". Membership means that both, or in the case of more than two nodes, all nodes have to be aware of each other at all times. You cannot have one node that is unable to communicate with its cluster buddies and decides to run applications on its own.

A key concept in Unix clusters is high availability, but you do not want your applications to be doubly available. Access and changes to your data should at all times be controlled by one source, or at least monitored by one source to prevent two individual instances from changing your data without knowing about each other. Data integrity is considered more important than availability.

If, at any time, membership can no longer be guaranteed, an important decision should be made -- which part of "broken" cluster should continue, and which part should instantly die to avoid messed up data?

This is where the quorum device comes in. To avoid a so-called partitioned cluster, both take a run for the quorum device, reserve it, and start counting the number of votes they have. For example, in a two-node cluster, the number of votes present when membership is ok, is three. One vote for each node, and one vote for the quorum device. This makes a total of three. When membership is gone, one node will succeed in reserving the quorum device, the other node will not succeed. This will leave the slower node with only one vote. One vote is less than half of all possible votes. The faster node has two votes, which is more than half of the votes. It's easy. If one node has more than half of the votes, it will run all applications; if a node has only half or even less than half of the possible votes, it will panic.

This reservation process differs according to product. Since this article deals with the Sun Microsystems solution, I will concentrate on their method. Sun Cluster quorum disk reservation is a two-step process that begins with SCSI-reservation. But this SCSI-reservation is only valid and ruling for as long as the node is up and running. If the "surviving" node were to be reset, the SCSI-reservation would be gone for a short while. At this point, the other node, being panicked and all, would be sitting there, waiting for the opportunity to be able to reserve the quorum disk. This is not what Sun Cluster wants. Sun Cluster says "the last node to leave the cluster should be the first one to join", meaning that Sun Cluster wants to prevent a node that doesn't have the most recent configuration from starting a cluster on its own. So, SCSI-reservation is not enough.

Reservation Key

Each node joining the cluster puts its reservation key on the quorum disk. Putting reservation keys on disks is supported in SCSI-III and is called "Persistent Group Reservation". Once membership is established, all nodes will have their keys on the quorum disk. In the case of heartbeat/membership loss, the keys of the nodes that panic are removed from the quorum disk. If, at any time, a node that panicked out of the cluster tries to reserve the quorum disk, it will see that its key is not on the disk and politely inform you that the disk is owned by someone else. It will wait until the other nodes want to establish membership and do nothing. This is how Sun Cluster prevents older configurations from starting a cluster.

More than Half of the Votes

What if, in a three-node cluster, two nodes die? Would you want the surviving node to panic? Probably not. But in the scenario of needing more than half of the possible votes, the node would definitely die. In a three-node cluster, there are four possible votes: three nodes and a quorum disk. When two of the three nodes decide to leave, that one remaining node is left with only two out of four votes, which is not more than half. Solution: the quorum disk vote is the number of nodes minus one. In a three-node cluster, the total number of votes would then be five. Now, if two nodes fail, the surviving node would have its own vote plus the two votes of that disk, which amounts to three, which is more than half. Additionally, it removes the other nodes' reservation keys from the disk to make sure that it is boss at all times.

What If the Quorum Disk Dies?

Now, here we have an issue. Some call this a single point of failure, but it is not. If the quorum disk fails, nothing happens. If you set up your storage correctly, your data is mirrored, so losing a single disk will not affect your data. Where vote counts are concerned, you have no problem either: losing the quorum disk will not leave your nodes with half or less than half of the votes. A single point of failure it is definitely not. So what is the problem? The problem is that there is no daemon that monitors your quorum disk and creates a new one when the old one dies. If your quorum disk is dead and you reboot a node, the other node will not have enough votes left and will panic. This is not what you want. You want a new quorum device the minute the original one dies, because the quorum functionality is not something that can be mirrored.

The Solution

Make sure that you do not need a quorum disk. Sun Cluster has recently decided to let you create heartbeat networks the way you want to. So you can use every interface on your cluster nodes to transport heartbeat. It is not very likely that you will lose all network interfaces at the same time. And if you do lose all network connection, what's the availability then? No network means no clients. Unfortunately, Sun Cluster will not let you build a two-node cluster without a quorum disk at this time.

A Cluster

That was the theory bit. Now, let's have a look at a real example. Assume we have a two-node cluster with one quorum disk. But, which disk is the quorum disk? The average customer does not set up a Sun Cluster; Sun sets it up for them. Imagine then that you are the average customer and want to know which disk is your quorum disk. Do the following:

node1#/usr/cluster/bin/scstat -q

-- Quorum Summary --

Quorum votes possible: 3
Quorum votes needed: 2
Quorum votes present: 3

-- Quorum Votes by Node --

Node Name Present Possible Status

--------- ------- -------- ------
Node votes: node1 1 1 Online
Node votes: node2 1 1 Online

-- Quorum Votes by Device --

Device Name Present Possible Status
----------- ------- -------- ------
Device votes: /dev/did/rdsk/d4s2 1 1 Online

Don't let the odd name "d4s2" confuse you. This is not a Solaris Volume Manager device name. It is a so-called "Did" name. In a Sun Cluster environment, all devices have a unique name. So, device d4 is the same device on node1 as it is on node2. It may well be that the original device file name on node1 is c1t3d0 where it is c2t3d0 on node2. This way the controller information on both nodes has become irrelevant.
It is advisable to have data on your quorum disk. Make sure your quorum disk is part of a mirror that is used by a clustered application. Sun Cluster will only detect that a disk is broken when it cannot be accessed anymore. SCSI errors will not occur if you do not access the disk. A quorum device that is part of an active mirror in a clustered environment is a good quorum device. You will see when it is broken because the scstat command will tell you that it is offline.

A quorum device that is offline is a failure, and the vote count is prone to disaster. One less vote (a node that dies) and your applications will not be available anymore because the other node will panic and not be able to achieve membership for as long as the dead node remains dead. You will have to create a new quorum disk before that happens. To do so, you must first choose a new device then remove the old one. Obviously, you cannot remove the only quorum device you have. You might compromise quorum, even though the device is broken. So, just go with the flow and create a new quorum disk once you see that the present quorum disk is offline.

Broken Quorum Disk

We break the disk...it is broken:

node1#/usr/cluster/bin/scstat -q

(output skipped)

-- Quorum Votes by Device --

Device Name Present Possible Status
----------- ------- -------- ------
Device votes: /dev/did/rdsk/d4s2 0 1 Offline

As you can see, the disk is offline and obviously not available to function as a tie breaker in the case of heartbeat loss. We double-check by running the Sun Cluster diskpath monitor "scdpm" and letting it collect all failed devices:
scdpm -p all|grep Fail
node1:/dev/did/rdsk/d4 Fail
node2:/dev/did/rdsk/d4 Fail

Select New Quorum Disk
Before you can select a new quorum disk, you must determine which disks are available as quorum disks. The new quorum disk must be a disk that is accessible by both cluster nodes. It should be a shared disk:

scdidadm -L
1 node2:/dev/rdsk/c0t2d0 /dev/did/rdsk/d1
2 node2:/dev/rdsk/c0t0d0 /dev/did/rdsk/d2
3 node2:/dev/rdsk/c1t9d0 /dev/did/rdsk/d3
3 node1:/dev/rdsk/c1t9d0 /dev/did/rdsk/d3
4 node2:/dev/rdsk/c1t10d0 /dev/did/rdsk/d4
4 node1:/dev/rdsk/c1t10d0 /dev/did/rdsk/d4
5 node2:/dev/rdsk/c1t11d0 /dev/did/rdsk/d5
5 node1:/dev/rdsk/c1t11d0 /dev/did/rdsk/d5
6 node2:/dev/rdsk/c1t12d0 /dev/did/rdsk/d6
6 node1:/dev/rdsk/c1t12d0 /dev/did/rdsk/d6
7 node2:/dev/rdsk/c1t13d0 /dev/did/rdsk/d7
7 node1:/dev/rdsk/c1t13d0 /dev/did/rdsk/d7
8 node2:/dev/rdsk/c1t14d0 /dev/did/rdsk/d8
8 node1:/dev/rdsk/c1t14d0 /dev/did/rdsk/d8
9 node1:/dev/rdsk/c0t0d0 /dev/did/rdsk/d9

This list shows all Device IDs as well as the official device file name of all disks on both nodes. All drives that appear twice in the list are shared devices. This means you can pick any of these disks as long as the diskpath monitor thinks they are ok:
Drive d4 is broken, so we pick the next in the list if it is still ok:

scdpm -l all:all

(output skipped)
node1:/dev/did/rdsk/d4 Fail
node1:/dev/did/rdsk/d5 Ok
node1:/dev/did/rdsk/d6 Ok
(output skipped)
Obviously, drive d5 is ok. So, we select that one:
scconf -a -q globaldev=d5
We check whether d5 is now a valid quorum disk:
scstat -q

(output skipped)
-- Quorum Votes by Device --

Device Name Present Possible Status
----------- ------- -------- ------
Device votes: /dev/did/rdsk/d4s2 0 1 Offline
Device votes: /dev/did/rdsk/d5s2 1 1 Online
(output skipped)
Now, we can remove the old quorum device:
scconf -r -q globaldev=d4
At this point, we are back to where we were before the disk broke. Until this new quorum disk breaks, we can rest assured that in the case of heartbeat loss, one node will survive by grabbing the quorum disk and establishing maximum vote. Unfortunately, we cannot go around checking the health of the quorum disk all the time. So, it might be an idea to write a script to do this for us (see Listing 1).
This script will check whether your quorum disk is still okay, every 5 minutes. If the quorum disk is not okay, it will take the first shared disk from a list, until all disks are gone. And, if all disks are gone, you will very likely have other things to worry about than your quorum disk.

Conclusion

Since virtually all Unix cluster solutions use a voting mechanism in combination with quorum disks, there is probably something about quorum disks that makes them either popular or inevitable. To avoid disappointments when such a disk fails, it is good to take precautions. I hope this article will help a bit.

Peter van der Weerd works as a freelance Solaris, HP-UX, and Linux trainer in Europe.

Solaris Cluster 3.x Quorum issue article

Recent Posts

Archives

Categories