OmniCenter is capable of high availability (HA) operations through the deployment of a cluster of multiple, independent appliances. Although, HA capability is technically possible using a cluster only two OmniCenters, a three-node cluster configuration is highly recommended. Both to avoid the quorum issues associated with even-numbered clusters (“split-brain”), and because three-node OmniCenter clusters are more tolerant to network problems.
A standard OmniCenter HA setup consists of three deployed OmniCenter appliances: the master, the arbitrator, and the slave. (A two-node setup omits the arbitrator.) The master is the “main” OmniCenter appliance that would normally be providing production services. The slave is the “backup” OmniCenter appliance, that will be activated in the event that the master fails. The arbitrator is a third OmniCenter appliance (with much lower resource requirements than the master/slave appliances) that acts to provide quorum for the cluster in the event of a failure. The arbitrator also helps to reduce the stress of database replication on the master during initial HA data synchronization.
An OmniCenter HA setup is not the same as other network HA setups you may be familiar with. It is not a group of “equal” nodes sharing common storage memory and negotiating which node should be in charge and provide production services. The terms master, slave, and arbitrator are used for a reason. OmniCenter HA is very much focused on the master being the only intended working OmniCenter, with the slave acting specifically as a backup to the master. The intention is that if the master fails, monitoring will temporarily be handled by the slave. And, then, only until such time as the master can be brought back up to resume its role. (This is a manual process that typically requires the assistance of a Netreo support engineer. There is no automatic failback process to reinstate the master as the production server.) The slave should never be in a position in which it is acting as the “main” OmniCenter, because it does not synchronize its SQL database with the other nodes. (The SQL database contains all of OmniCenter’s configuration data.) The arbitrator, on the other hand, will never even act as an OmniCenter at all. Its only operational function is to provide third-party arbitration to determine if a master has actually failed and whether the slave should take over monitoring duties. From this, it should be clear that the HA nodes are not equal. When the master goes down, HA capability is lost.
In order for the HA cluster nodes to communicate properly they must be able to connect with each other using the ports in the following table.
|TCP||443||High Availability Communication|
|TCP||48100||Instance Backing File Replication|
It is important to remember that in order for HA to function properly, all managed devices in the customer environment must allow access from the IP addresses of both the master and the slave systems. Typical protocols to consider here are SNMP, WMI, WinRM and SSH. However, depending upon the configuration of particular environments—and the feature sets in use—that list could be different. Contact Netreo Support for more information.
OmniCenter High Availability configurations do not provide virtual IP functionality. Both the master and the slave systems must have static/permanent IP addresses. In the event of a failover, end-users must access the slave system via its natural IP address. VIP functionality is possible. However, the setup and configuration of that architecture (i.e. DNS, load balancing, etc) are the responsibility of the customer. Contact Netreo Support for more information about this topic.
All three appliances should be deployed and running, and the master configured for production services, before attempting to configure HA capability. Deploy the master appliance first, according to the standard instructions. Then deploy the arbitrator and slave. The arbitrator should be deployed physically near the master, and the two should share the fastest link practical. The slave may be deployed anywhere required (typically offsite for disaster prevention and recovery), but the bandwidth latency between the master and the slave must be no more than 20ms to ensure reliable operation. To deploy an arbitrator or slave, follow the same installation instructions above until the setup wizard starts. Then only complete the Network Configuration and License Activation sections (be sure to note the IP address for each appliance, you will need these to initialize HA in the master). Enter the necessary information along with the correct pin for the arbitrator/slave. Once the appliance has been licensed, you may close the setup wizard. Once all three appliances have been deployed, you’re ready to configure HA in the master.
Once the master, arbitrator and slave appliances have all been deployed and started, you’re ready to configure HA. In the UI of the master OmniCenter appliance, navigate to the “High Availability Configuration” page (Administration → System → High Availability). Enter the IP addresses of the arbitrator and slave into the appropriate fields and click the Add button. The HA initialization process (see below) will immediately begin.
An Initialize HA button will also appear, in the event that you need to re-initialize HA after it has already been configured (such as after a failover). Click this button after a failback has been performed on the system to restart the HA initialization process.
The HA Initialization Process
When the Add or Initialize HA buttons are pressed, the master appliance will begin synchronizing its MySQL configuration database to the arbitrator. The master appliance is typically providing production services at this point, so initial synchronization is done only with the arbitrator to reduce the load on the system of the master. When the synchronization from master to arbitrator is finished, the arbitrator will then synchronize the database to the slave. Due to the size of the database, and because the slave appliance may be located anywhere, this synchronization can potentially take in excess of 24 hours. While the synchronization is taking place, the “# OF CLUSTER MEMBERS” display on the “High Availability Configuration” page will begin counting up. When it reaches the total number of OmniCenter appliances in the HA cluster, synchronization is complete. At that point, HA is now running.
When OmniCenter is configured for HA you will see a new icon in the icon group at the top right of OmniCenter. This icon displays the current HA status of the appliance. It can be seen in the UI of both the master and the slave (arbitrators are not operational OmniCenters and have no UI to display states). The following tables show the icons and respective states for the master and slave appliances. Some states apply only to the master or the slave.
Master HA Link Status
Slave HA Link Status
While the HA icon is green, updates to the SQL database of the master are being replicated synchronously to the slave and the arbitrator (keeping the configuration of the slave identical to that of the master). Updates to historical data collected by the master are written asynchronously to the RRD files of both itself and the slave. No RRDs are replicated to the slave during initialization, so historical data on the slave begins at the point that HA becomes fully operational. The basic difference between synchronous and asynchronous replication is that synchronous guarantees that if changes happened on one node of the cluster, they happened on all other nodes at the same time. The consequence of this is that, if the connection between the master and other the nodes (typically the slave) is slow, the performance of the master will be adversely affected, as it waits for the write operation to finish on the other appliances before continuing. This condition applies to the replication of the MySQL database only, as no other data is replicated synchronously (historical data updates are written to the slave’s RRDs asynchronously). SQL updates to the arbitrator are purely for backup purposes.
Traffic Flow Monitoring
If you are using OmniCenter in an HA configuration and you wish to collect traffic flow statistics (NetFlow, J-Flow, etc.), this must be done using a service engine running the OmniCenter Traffic Collector service. Additionally, the deployed service engine must not be within the HA cluster.
If you are using OmniCenter in an HA configuration and you wish to collect log statistics (syslog, event logs, etc.), this must be done using a service engine running the OmniCenter Log Collector service. As with traffic flow monitoring, the deployed service engine must not be within the HA cluster.
While HA is operational, the master will be continuously synchronizing its configuration settings to the slave so that the slave will be ready to take over production services immediately (called a failover), should the master fail. A failure of the master will immediately cause the slave to check and see if the master is still a member of the HA cluster. If not, the slave will wait for 60 seconds and then check again. If the master is still not present, the slave will check for quorum with the arbitrator. If quorum exists, the slave enters the TAKEOVER state and takes over production services. Once a failover has occurred, even if the master rejoins the cluster, it will not resume control of production services. HA must be reinitialized manually (repair the issue, bring up the master and click the Initialize HA button). Note: This process runs the entire HA initialization process over again. So, however long it took the first time, expect initialization to take at least that long now.
While the master is in the ACTIVE state (above), a failure of the slave won’t affect monitoring, but will cause HA capability to be lost (since the slave is the backup appliance). A failure of the arbitrator won’t affect anything, until a failure of the master also occurs. At which point, there will be a total HA failure—since the master will have failed and the slave will not attempt to take over if it is the only member of the cluster.
If, during a failure, the master is not available on the network (link down, crash, etc.), the slave will not be able to write updates to the historical data files of the master. This means that if a failure occurs, there will typically be a gap in historical data for the entire time the master was down. However, if the master again becomes available on the network, the slave will try to write updates to its RRDs. The master will not, however, attempt to resume control of production services.