High Availability for SARK

From sailpbx
Revision as of 09:56, 10 July 2011 by Adminwiki (talk | contribs) (History)
Jump to: navigation, search

History

HA was originally developed for a customer in South Africa and it has since proved very popular with users who place a high value on the availability of their PBX. The goal was, and is, to provide an Asterisk-based PBX with very near 100% availability. SARK HA gets pretty close to this goal. The original SARK Version 2 offering is called sarkha V1 but this has now been augmented as part of the V3 development by a newe general purpose Asterisk module called asha (Asterisk High Availability).

How it works

Asha works by using the proven standard Linux HA cluster technology to run two Asterisk servers side-by-side; one active, or primary, server and one passive, or standby, server. In the event of a failure of either Asterisk or Linux on the active server, the standby server will automatically assume control of all the resources and continue the service until such time as the primary is once again ready to assume responsibility. The time it takes to fail-over can vary depending upon how you set the Heartbeat parameters but it is quite common to have large systems (250->300 phones) set-up to fail over in under 20 seconds. That means from a hard down on the primary to being able to place outbound calls on the standby.

There are three components to the solution,-

  • The regular Linux Heartbeat modules, which you can install with rpm or yum and can usually be found in the extras repo of the CentOS 5 and SME 8.0 distribution libraries.
  • The asha rpm which manage the synchronisation of data between nodes and adds some ancillary functions such as the Asterisk watchdog task.
  • The sarkha V3 environment rpms which manage the SARK specific integration with asha.

Installation

Installation is straightforward in an EL5 environment (RHEL5, C5 or SME8.0). Heartbeat can be installed directly from the yum repos and the other modules can be fetched from the download site.

Here is a regular EL5 install

yum install heartbeat --enablerepo=*
rpm -Uvh asha-1.0.0-2.noarch.rpm
rpm -Uvh el5sarkha-3.1.0-2.noarch.rpm
reboot

Here is an smeserver 8 install

yum install heartbeat --enablerepo=*
rpm -Uvh asha-1.0.0-2.noarch.rpm
rpm -Uvh smesarkha-3.1.0-2.noarch.rpm
signal-event post-upgrade
signal-event reboot

Configuration

Once installed, you will find some new options and buttons on the SARK global settings panel.

SarkhaV3 globals.png


As you can see, there is a new tab "High Availability" and an additional Start/Stop button. The extra start stop is to control the high-availability engine (heartbeat). You can decide whether to enable it or not from Globals. You also have some new variables to fill out beforte you can start your HA image. Do be aware that you will need to fill this information out (at least initially) on BOTH systems (primary and standby).

  • HA Synch Mode (LAZY|LOOSE)

Dictates whether or not automatic synchronisation will occur between the two nodes. LAZY means that it will sync, LOOSE means nosync.

  • HA IP Address

The Virtual IP address used by Heartbeat to hand control back and forth between the nodes. This can be any free static IP address on your subnet.

  • HA Primary Node (uname -n)

Heartbeat uses node names to find its partners. The primary node is the name given by uname -n on the primary node.

  • HA Failover Node (uname -n)

Heartbeat uses node names to find its partners. The primary node is the name given by uname -n on the standby node.

  • Provision with Cluster IP? (YES/NO)

This tells the provisioning subsystem whether or not it should provision the phones to register to this server's "real" IP address or to the cluster's virtual IP address. It is usual to set this to YES when the cluster is active.

  • HA Auto-Failback? (off/on)

This tells the HA component whether or not to automatically fail back to the primary after an initial failover to the standby. Most users run with this off.

Initial Start Up

The initial system state should be with asterisk and the HA engine turned off on BOTH nodes. The sequence is as follows...

  • Start the HA engine on the primary node by clicking the HA engine start button ONLY.

SarkhaV3globals confirm1.png




Press confirm and you should see this screen....

SarkhaV3globals running1.png

Notice the start buttons have disappeared and all you now have is a stop button for HA. This is because the HA engine is waiting for the standby node to come online. If the standby node does not come on-line within 120 seconds, the primary will assume it isn't coming up and it will go ahead and start Asterisk anyway. If the standby node does come up then all is good and the primary will start Asterisk. You will then see this screen on the primary...

SarkhaV3globals up1.png

Notice that you now have the option to stop either the HA engine or Asterisk on the primary server (two STOP buttons). If you stop Asterisk then the system will fail-over to the standby. This is a manual fail-over. It uses the same logic path an automatic fail-over takes in the event of an Asterisk crash. A watchdog daemon looks for the Asterisk running task every few seconds. If it doesn't find it, it invokes a fail-over. It doesn't actually matter if Asterisk failed or it was manually stopped, the process is the same.

Finally, once HA is up and Asterisk is running normally on the primary, if you look at globals panel on the standby node you will see that there is no longer a start/stop button for Asterisk. This is to prevent accidentally starting Asterisk on the standby while it is also running on the primary.

Synchronisation

By default, when the system HASYNC parameter is set to LAZY, the sync task will run every minute on the standby node and use rsync to move interesting data from the primary node to the standby node. This ensures that the standby node is always within a minute at most of being in sync with the primary. The only exception to this rule is if Asterisk is running on the standby node. Asterisk should only ever run on the standby if the primary has failed, so if the sync task finds Asterisk running it will exit without attempting synchronisation. The files and directories which by default are synchronised are as follows

  • The SARK SQLite database
  • /var/lib/asterisk/sounds
  • /etc/asterisk
  • /var/spool/asterisk/voicemail

You may wonder why /var/lib/asterisk/sounds is included; this is because SARK stores its "greeting" sound files into this directory.