Netreo is capable of high availability (HA) operations through the deployment of a cluster of multiple, independent appliances. Although, HA capability is technically possible using a cluster only two Netreo systems, a three-node cluster configuration is highly recommended. Both to avoid the quorum issues associated with even-numbered clusters (“split-brain”), and because three-node Netreo clusters are more tolerant to network problems.
A standard Netreo HA setup consists of three deployed Netreo appliances: the master, the arbitrator, and the slave. (A two-node setup omits the arbitrator.) The master is the “main” Netreo appliance that would normally be providing production services. The slave is the “backup” Netreo appliance, that will be activated in the event that the master fails. The arbitrator is a third Netreo appliance (with much lower resource requirements than the master/slave appliances) that acts to provide quorum for the cluster in the event of a failure. The arbitrator also helps to reduce the stress of database replication on the master during initial HA data synchronization.
A Netreo HA setup is not the same as other network HA setups you may be familiar with. It is not a group of “equal” nodes sharing common storage memory and negotiating which node should be in charge and provide production services. The terms master, slave, and arbitrator are used for a reason. Netreo HA is very much focused on the master being the only intended working Netreo system, with the slave acting specifically as a backup to the master. The intention is that if the master fails, monitoring will temporarily be handled by the slave. And, then, only until such time as the master can be brought back up to resume its role. (This is a manual process that typically requires the assistance of a Netreo support engineer. There is no automatic failback process to reinstate the master as the production server.) The slave should never be in a position in which it is acting as the “main” Netreo system, because it does not synchronize its SQL database with the other nodes. (The SQL database contains all of Netreo’s configuration data.) The arbitrator, on the other hand, will never even act as a Netreo system at all. Its only operational function is to provide third-party arbitration to determine if a master has actually failed and whether the slave should take over monitoring duties. From this, it should be clear that the HA nodes are not equal. When the master goes down, HA capability is lost.
In order for the HA cluster nodes to communicate properly they must be able to connect with each other using the ports in the following table.
|TCP||443||High Availability Communication|
|TCP||48100||Instance Backing File Replication|
It is important to remember that in order for HA to function properly, all managed devices in the customer environment must allow access from the IP addresses of both the master and the slave systems. Typical protocols to consider here are SNMP, WMI, WinRM and SSH. However, depending upon the configuration of particular environments—and the feature sets in use—that list could be different. Contact Netreo Support for more information.
Netreo High Availability configurations do not provide virtual IP functionality. Both the master and the slave systems must have static/permanent IP addresses. In the event of a failover, end-users must access the slave system via its natural IP address. VIP functionality is possible. However, the setup and configuration of that architecture (i.e. DNS, load balancing, etc) are the responsibility of the customer. Contact Netreo Support for more information about this topic.
All three appliances should be deployed and running, and the master configured for production services, before attempting to configure HA capability. Deploy the master appliance first, according to the standard instructions. Then deploy the arbitrator and slave. The arbitrator should be deployed physically near the master, and the two should share the fastest link practical. The slave may be deployed anywhere required (typically offsite for disaster prevention and recovery), but the bandwidth latency between the master and the slave must be no more than 20ms to ensure reliable operation. To deploy an arbitrator or slave, follow the same installation instructions above until the setup wizard starts. Then only complete the Network Configuration and License Activation sections (be sure to note the IP address for each appliance, you will need these to initialize HA in the master). Enter the necessary information along with the correct pin for the arbitrator/slave. Once the appliance has been licensed, you may close the setup wizard. Once all three appliances have been deployed, you’re ready to configure HA in the master.
Once the master, arbitrator and slave appliances have all been deployed and started, you’re ready to configure HA. In the UI of the master Netreo appliance, navigate to the “High Availability Configuration” page (Administration > System > High Availability). Enter the IP addresses of the arbitrator and slave into the appropriate fields and click the Add button. The HA initialization process (see below) will immediately begin.
An Initialize HA button will also appear, in the event that you need to re-initialize HA after it has already been configured (such as after a failover). Click this button after a failback has been performed on the system to restart the HA initialization process.
The HA Initialization Process
When the Add or Initialize HA buttons are pressed, the master appliance will begin synchronizing its MySQL configuration database to the arbitrator. The master appliance is typically providing production services at this point, so initial synchronization is done only with the arbitrator to reduce the load on the system of the master. When the synchronization from master to arbitrator is finished, the arbitrator will then synchronize the database to the slave. Due to the size of the database, and because the slave appliance may be located anywhere, this synchronization can potentially take in excess of 24 hours. While the synchronization is taking place, the “# OF CLUSTER MEMBERS” display on the “High Availability Configuration” page will begin counting up. When it reaches the total number of Netreo appliances in the HA cluster, synchronization is complete. At that point, HA is now running.
When Netreo is configured for HA you will see a new icon in the icon group at the top right of Netreo. This icon displays the current HA status of the appliance. It can be seen in the UI of both the master and the slave (arbitrators are not operational Netreo systems and have no UI to display states). The following tables show the icons and respective states for the master and slave appliances. Some states apply only to the master or the slave.
Master HA Link Status
|Green||ACTIVE||Master is polling and its MySQL database is correctly synchronizing with slave.|
|Gray||INACTIVE||(Two node HA setup only.) SQL synchronization failure probably causing split brain. Master is still polling but cannot sync with slave, which is likely now also polling.|
|Red||FAILED||Master has failed, causing it to stop polling. It will remain in the FAILED HA state until HA is re-initialized.|
Slave HA Link Status
|Green||ACTIVE||Slave is passively running and its SQL database is correctly being synchronizing to the master.|
|Gray||INACTIVE||Slave is licensed for HA but hasn’t been configured, or HA has been stopped from the master’s UI.|
|Red||FAILED||(Two node HA setup only.) SQL synchronization failure probably causing split brain. Slave is polling. Master may also still be polling.|
|Red||TAKEOVER||Master has (at some point) failed. Slave is now polling, authorized by the arbitrator. There is at least a 60 second delay while the slave confirms the loss of the master before it takes over polling.|
While the HA icon is green, updates to the SQL database of the master are being replicated synchronously to the slave and the arbitrator (keeping the configuration of the slave identical to that of the master). Updates to historical data collected by the master are written asynchronously to the RRD files of both itself and the slave. No RRDs are replicated to the slave during initialization, so historical data on the slave begins at the point that HA becomes fully operational. The basic difference between synchronous and asynchronous replication is that synchronous guarantees that if changes happened on one node of the cluster, they happened on all other nodes at the same time. The consequence of this is that, if the connection between the master and other the nodes (typically the slave) is slow, the performance of the master will be adversely affected, as it waits for the write operation to finish on the other appliances before continuing. This condition applies to the replication of the MySQL database only, as no other data is replicated synchronously (historical data updates are written to the slave’s RRDs asynchronously). SQL updates to the arbitrator are purely for backup purposes.
Traffic Flow Monitoring
If you are using Netreo in an HA configuration and you wish to collect traffic flow statistics (NetFlow, etc.), this must be done using a service engine running the Netreo Traffic Collector service. Additionally, the deployed service engine must not be within the HA cluster.
If you are using Netreo in an HA configuration and you wish to collect log statistics (syslog, event logs, etc.), this must be done using a service engine running the Netreo Log Collector service. As with traffic flow monitoring, the deployed service engine must not be within the HA cluster.
While HA is operational, the master will be continuously synchronizing its configuration settings to the slave so that the slave will be ready to take over production services immediately (called a failover), should the master fail. A failure of the master will immediately cause the slave to check and see if the master is still a member of the HA cluster. If not, the slave will wait for 60 seconds and then check again. If the master is still not present, the slave will check for quorum with the arbitrator. If quorum exists, the slave enters the TAKEOVER state and takes over production services. Once a failover has occurred, even if the master rejoins the cluster, it will not resume control of production services. HA must be reinitialized manually (repair the issue, bring up the master and click the Initialize HA button). Note: This process runs the entire HA initialization process over again. So, however long it took the first time, expect initialization to take at least that long now.
While the master is in the ACTIVE state (above), a failure of the slave won’t affect monitoring, but will cause HA capability to be lost (since the slave is the backup appliance). A failure of the arbitrator won’t affect anything, until a failure of the master also occurs. At which point, there will be a total HA failure—since the master will have failed and the slave will not attempt to take over if it is the only member of the cluster.
If, during a failure, the master is not available on the network (link down, crash, etc.), the slave will not be able to write updates to the historical data files of the master. This means that if a failure occurs, there will typically be a gap in historical data for the entire time the master was down. However, if the master again becomes available on the network, the slave will try to write updates to its RRDs. The master will not, however, attempt to resume control of production services.