The Short Version
A “host check” is the out-of-band execution of an existing availability service check used to check host status before sending alert notifications for service failures.
The term “host check” simply refers to the unscheduled execution of a host’s existing availability service check. This availability check—typically added to a host through the “Default” device template and made using the connection method specified in its configuration (ping, check TCP, etc.)—is normally executed on a regular schedule to monitor the availability of that host. But, when some other service check on a given host fails, that failure could potentially be caused by the host actually being down, and not because of a problem with the monitored service itself.
In that case, you don’t want to be alerted about the failed service, you want to be alerted about the downed host. Since service checks don’t necessarily all execute at the same time, it’s possible that the service check monitoring said service could fail, open an incident and send out an alert notification before the downed host is detected. This could even happen for multiple individual service checks on that host before the host’s downed condition is detected—resulting in any number of redundant alert notifications.
To prevent this from happening; any time a service check fails, it must first trigger a “host check” (to check the availability of the host) before it’s allowed to open an incident. This determines whether or not the host of a failed service is even reachable on the network. If the host is then determined to be down, all of the resulting alarms from any failed service checks on that host can then be bundled into a single incident (the host down incident) as “related alarms.” This way, only an alert notification about the downed host gets sent out, greatly simplifying the problem (aiding in root cause detection) and dramatically reducing the pressure on support personnel to respond.
If, however, the host is determined to be up, the failed service check is then allowed to open its own incident to report its failure. Since, in this case, you do want to be alerted about a failed service.
This is called “incident management” in OmniCenter.
It should be noted that a host must already have an availability service check of some kind (ping, check tcp, etc.) configured for it for a host check to happen. So, be sure to configure your device templates appropriately. A service check must then fail to a HARD CRITICAL state (see the entry for “Service Check” for more information about states) to trigger a host check for the host device.
Even More Explanation
When executing a host check, OmniCenter ignores the alarm configuration of the availability service check used (all host-down alert notification go to the device’s configured “host alert contacts”). Only the configured connection type and host IP address settings are used. If the host does not respond, it will be immediately marked as down. Individual downed hosts are indicated visually in the dashboards with the “HOST DOWN” label.
In the event a that a host becomes unreachable due to a cascade of host failures in the network hierarchy, host checks have built-in logic to assist with root cause analysis and prevent excessive host-down alert notifications. When a failed host check generates an alarm, before creating its incident, OmniCenter will immediately perform additional host checks on the immediate parents of the failed host. If the parents also fail their host checks, OmniCenter will continue performing host checks up the hierarchy of parents until an operational host is found. The alarm for the last failed host in the chain of failures (that is, the highest-level unresponsive host in the hierarchy) will become the primary alarm in a newly opened incident. All the other alarms will then be bundled into that incident as related alarms (similar to what happens to failed service check alarms on a single downed host). This allows one incident to manage all the alarms related to the outage and helps to identify the device responsible. It also prevents an excess of redundant host-down alert notifications from being sent.
If an incident already exists for the top-level downed host, the additional alarms will simply be added to that incident as related alarms, with no further actions taken.
Obviously, device parenting must be properly configured within OmniCenter for this system to provide maximum benefit.
Host and service check alarms can be disabled both globally and per-device. Global disabling/enabling is done from the the Device Polling & Monitoring page, which is found in the main menu (Administration → Change Devices → Turn Polling & Monitoring On/Off). Per-device disabling/enabling is done on a given device’s Main device administration page (accessible from its Device Dashboard). The HOST & SERVICE ALARMS switch is located in the “Device Details” section.
Action groups for host check alert notifications and device commands are specified on each device in the “Host Alert Contacts” section of the respective device’s Alerting device administration page. Action groups can also be assigned as host alert contacts for devices automatically, through the use of device templates.