High Availability for SARK

From sailpbx
Revision as of 21:23, 9 July 2011 by Adminwiki (talk | contribs) (Configuration)
Jump to: navigation, search

History

HA was originally developed for a customer in South Africa and it has since proved very popular with customers who place a high value on the availability of their PBX. The goal was, and is, to provide an Asterisk-based PBX with very near 100% availability. SARK HA gets pretty close to this goal. The original SARK Version 2 offering is called sarkha V1 but this has now been augmented as part of the V3 development by a newe general purpose Asterisk module called asha (Asterisk High Availability).

How it works

Asha works by using the proven standard Linux HA cluster technology to run two Asterisk servers side-by-side; one active, or primary, server and one passive, or standby, server. In the event of a failure of either Asterisk or Linux on the active server, the standby server will automatically assume control of all the resources and continue the service until such time as the primary is once again ready to assume responsibility. The time it takes to fail-over can vary depending upon how you set the Heartbeat parameters but it is quite common to have large systems (250->300 phones) set-up to fail over in under 20 seconds. That means from a hard down on the primary to being able to place outbound calls on the standby.

There are three components to the solution,-

  • The regular Linux Heartbeat modules, which you can install with rpm or yum and can usually be found in the extras repo of the CentOS 5 and SME 8.0 distribution libraries.
  • The asha rpm which manage the synchronisation of data between nodes and adds some ancillary functions such as the Asterisk watchdog task.
  • The sarkha V3 environment rpms which manage the SARK specific integration with asha.

Installation

Installation is straightforward in an EL5 environment (RHEL5, C5 or SME8.0). Heartbeat can be installed directly from the yum repos and the other modules can be fetched from the download site.

Here is a regular EL5 install

yum install heartbeat --enablerepo=*
rpm -Uvh asha-1.0.0-2.noarch.rpm
rpm -Uvh el5sarkha-3.1.0-2.noarch.rpm
reboot

Here is an smeserver 8 install

yum install heartbeat --enablerepo=*
rpm -Uvh asha-1.0.0-2.noarch.rpm
rpm -Uvh smesarkha-3.1.0-2.noarch.rpm
signal-event post-upgrade
signal-event reboot

Configuration

Once installed, you will find some new options and buttons on the SARK global settings panel.

SarkhaV3 globals.png


As you can see, there is a new tab "High Availability" and an additional Start/Stop button. The extra start stop is to control the high-availability engine (heartbeat). You can decide whether to enable it or not from Globals. You also have some new variables to fill out beforte you can start your HA image. Do be aware that you will need to fill this information out (at least initially) on BOTH systems (primary and standby).

  • HA Synch Mode (LAZY|LOOSE)

Dictates whether or not automatic synchronisation will occur between the two nodes. LAZY means that it will sync, LOOSE means nosync.

  • HA IP Address

The Virtual IP address used by Heartbeat to hand control back and forth between the nodes. This can be any free static IP address on your subnet.

  • HA Primary Node (uname -n)

Heartbeat uses node names to find its partners. The primary node is the name given by uname -n on the primary node.

  • HA Failover Node (uname -n)

Heartbeat uses node names to find its partners. The primary node is the name given by uname -n on the standby node.

  • Provision with Cluster IP? (YES/NO)

This tells the provisioning subsystem whether or not it should provision the phones to register to this server's "real" IP address or to the cluster's virtual IP address. It is usual to set this to YES when the cluster is active.

  • HA Auto-Failback? (off/on)

This tells the HA component whether or not to automatically fail back to the primary after an initial failover to the standby. Most users run with this off.

Initial Start Up

The initial system state should be with asterisk and the HA engine turned off on BOTH nodes. The sequence is as follows...

  • Start the HA engine on the primary node by clicking the HA engine start button ONLY. After a few moments you should see this screen...