(For a short video about this topic, click here.)
OmniCenter allows you to monitor the status of individual processes or resources (applications, interfaces, etc.) running on a host through the use of “service checks.” (Only administrators may add service checks to devices.)
A “service check” is an application level check that queries a respective process or resource according to its configured schedule (default is 3-minutes). Each type of process or resource that you want to monitor will have its own specific service check. Service checks are categorized in OmniCenter according to their nature (web checks, system checks, interface checks, etc.). Service checks are also classified as either active or passive. Active checks create their own processes in memory while they do their work, and follow their own schedule timing. Whereas passive checks wait for some other process to update them (usually an active service check or some other active process). This means that passive service checks update according to the schedule of the process that updates them.
The response code provided to the service check dictates its status, and what behavior it will follow. If the response is a success code, the check maintains an “OK” state and continues to run according to its configured schedule (again, default is 3-minutes). If the response is a failure code, the service check enters what is called a “SOFT CRITICAL” state. While in a SOFT CRITICAL state, the service check will continue to retry its query until it reaches a specified number of failures (default is 3), at which point it enters what is called a “HARD CRITICAL” state and generates an alarm. (The number of failures and the check/retry schedule is adjustable in the check’s configuration options.) If, while still in a SOFT CRITICAL state, the check again receives a success code, it will return to an OK state. The purpose of this arrangement is to prevent service checks from generating alarms at the slightest temporary glitch.
Another difference between active and passive service checks is the method they use to determine how long a failed check will wait before transitioning from a SOFT CRITICAL state to a HARD CRITICAL state. Active service checks create their own scheduling, and use a configurable timer measured in minutes. Passive service checks—since they have no control over their timing and must wait to be updated—use a configurable “number of exceptions” (failures). See the section below on configuration options for more information on how the various settings work.
Service checks are entirely dependent on the status of their host. That is to say, if a host goes down, no alarms will be generated for any of its service checks. Since the services for that host would obviously not be available and the host being down would be the first priority.
The status of the various OmniCenter checks are a core element of many of OmniCenter’s dashboard displays. Status is typically indicated by color or icon design (depending upon context and OmniCenter version). The following table outlines the states and meanings of the colors for service checks.
|green||OK||Indicates that the check query has successfully returned a code that is within normal operating parameters.|
|yellow||WARNING||Very rare. Indicates that the check query has returned a warning code, likely generating an alarm.|
|red||CRITICAL||Indicates that the check query has returned a value that is far outside of normal operating parameters, requiring attention. Alarm generated.|
|orange||UNKNOWN||Indicates that the check query has returned a value that it cannot understand. Likely due to a user configuration error in the check.|
|blue||ACKNKOWLEDGED||A failed check has become CRITICAL and generated an alarm, which has opened an incident. That incident has since been acknowledged by a user as currently being worked on. This state is technically an incident state, not a check state, but its display in the dashboards helps to differentiate between problems that are new and problems that are already being addressed.|
Administrators can switch service check functionality on/off for devices individually or in groups. (See the entry for “Device Administration” for more information.) Additionally, if the “HOST & SERVICE ALARMS” and “POLL DEVICE” settings are both turned off for a managed device, it will no longer be considered monitored, and will stop consuming a license slot in OmniCenter. All history for the device—up to that point—will continue to be stored for the normal length of time. If a device then has those settings switched back on, it will again consume a license slot, and OmniCenter will automatically resume monitoring it. There will, however, be a gap in the device’s history equivalent to the time that it was unmonitored.
Service checks can only be configured by an administrator. They are configured in either the device administration pages (for checks added directly to devices), or the device template in which they are included (the recommended way to add service checks). See the entries for those items for further information.
The configuration options for service checks break down into three basic parts.
- Any required configuration of the query that the check performs.
- Alarm configuration and timing.
- Selection of action groups that get assigned to incidents opened as a result of an alarm from the check.
Once a service check is added to a host (by any of the above methods), it may take several minutes to show up on the dashboards and show data, as the check must be added to OmniCenter’s schedule before it gets run for the first time.
The configuration of the myriad check-specific commands available in an OmniCenter service check is currently beyond the scope of this wiki. If an entry for the individual check is available, any query configuration options will be covered there. However, one field in this part of the configuration options is universal to all service checks and will be discussed here: the “DESCRIPTION” field.
The DESCRIPTION field is for the administrator configuring the check to enter a unique name for the service check. Unlike other OmniCenter check types, it is possible to assign more than one service check of the same type to a single device (this is particularly common when using device templates). In order for OmniCenter to be able to distinguish between one service check and another of the same type when sending alert notifications (which always include the check name), it needs a way to uniquely identify each check on each device. Best practice here is to enter a descriptive name that indicates what the check is doing along with any specifics of what it’s doing it to. As an extremely basic example; suppose you have two TCP port checks on the same device, one checking port 80 and the other check port 110. Best practice would be to name the first check “TCP port 80 check” and the other check “TCP port 110 check”. This way each check will be clearly identifiable in both alert notifications and incidents. There are no issues with assigning identical check names to service checks on different devices.
As briefly mentioned above, unique service check names are particularly important if you intend to override service checks applied using device templates. A service check from one device template will only override a service check from another device template if the DESCRIPTION field matches exactly. So, be aware of this when configuring your device templates.
Alarm Configuration Options
A failing service check will always eventually generate an alarm. This section allows you to control the timing of when that happens and the timing of the execution of assigned action groups. The fields and their meanings are explained below.
This section controls how quickly an alarm will be generated after an initial failure. Do you want the check to generate an alarm at the first inkling of trouble (likely meaning more alert notifications), or do you want to be sure of trouble before generating an alarm (likely meaning fewer alert notifications). For active service checks, this field shows a number of minutes. The check query will have to return a failure code for the specified number of minutes before an alarm is generated. Higher settings mean the check will have to fail longer, but will definitely indicate a failure. Lower settings will allow personnel to respond more quickly, but there will be a greater number of false alarms that were really only momentary glitches. Active service checks also have the option of selecting “Custom”, which allows you to fine-tune individual timing settings (see “Custom Alarm Timing” below). For passive service checks, since the check must wait to be updated by another process, this field shows a number of exceptions (failures). If the query returns a failure code this many consecutive times, an alarm is generated.
This field specifies the number of minutes that an incident, once opened, will wait to run its action groups again (see “Action Group Selection” below). The incident will continue to run its assigned action groups at this interval until the incident is either acknowledged or the alarms have cleared. The term “renotification” is a legacy term from when the only actions that incidents ran were to send alert notifications. For OmniCenter version 9 and up, the action groups selected in the ACTION GROUP field are run when an incident is first opened by an alarm from this check, and at each RENOTIFICATION INTERVAL. Setting a value of zero (0) will disable renotification.
This field specifies the number of times the RENOTIFICATION INTERVAL must pass before the action groups selected in the ESCALATION GROUP field (see “Action Group Selection” below) are run. Once the escalation action groups are run, they also run every time the RENOTIFICATION INTERVAL passes—along with the ACTION GROUP action groups. This term, like “renotification,” also refers to the legacy system—when only alert notifications were sent when incidents were opened. The intention was that alerts would initially be sent to first-response personnel for action, but that if the response took too long, alerts would also be sent to a group of escalation personnel (e.g. managers) to determine why action hadn’t been taken for the problem.
By default (the “5 Minutes” ALERT AFTER selection), OmniCenter will execute a service check query every 3 minutes, until a failure is detected. Once a failure is detected, it will recheck the query two more times at 1-minute intervals—leading to a worst-case alarm response time of five minutes. Although you certainly can use the “Custom” selection for this field, it’s highly recommended that you do not do so without a very specific reason. The selection of choices available for the ALERT AFTER field should be adequate for most situations. However, if you choose to use “Custom”, the settings are explained below.
Custom Alarm Timing
When selecting the “Custom” option in the ALERT AFTER field above, the ADVANCED OPTIONS panel will be displayed at the bottom of the CONFIGURATION OPTIONS panel. These settings allow you to customize the timing of alarm generation for that particular service check. The fields and their meanings are explained below.
This field defines how often (in minutes) this service check will be executed under normal circumstances. After every successful query, OmniCenter will wait this interval before it executes the query again. There is a significant performance consideration for this field, in that, if you’re executing 10,000 service checks at 1-minute intervals, OmniCenter will have to execute 167 checks per second—adding significant network traffic and system load. Use common sense, and try to select a reasonable interval. OmniCenter will try to spread the queries out anyway—so they don’t all run at the same time, but you can still overwhelm your network by overdoing the number of configured service checks.
ON FAILURE, RETRY EVERY
This field defines the amount of time (in minutes) OmniCenter will wait to retry the query after an initial failure (during the so-called SOFT state). This period should generally be considerably shorter than the CHECK INTERVAL. OmniCenter will continue to retry the query at this interval even after an alarm has been generated. If any of the retry queries return an “OK” code, the check will stop retrying, clear any current alarm and return to the normal CHECK INTERVAL schedule.
TOTAL FAILURES BEFORE ALERT
This field defines the maximum number of failed queries allowed to qualify for an alarm. When the total number of failed queries (initial failure plus retries) reaches this number, the service check enters a HARD state. At this point, an alarm will be generated and an immediate host check performed. The service check will continue to query according to the ON FAILURE, RETRY EVERY timer above. Netreo recommends that you do not set this option to 1, as that will generate a significantly higher number of false alarms.
When configuring custom alarm timing, the lowest that you’ll generally ever want to set the CHECK INTERVAL setting to is 3-minutes (especially if the system is very heavily utilized). You can go lower, but the more frequently you execute the query, the heavier the load on the system—and the more network overhead required perform them.
If you need to be alerted to an outage immediately, you’ll probably want to go with the following settings.
- CHECK INTERVAL = 3
- ON FAILURE, RETRY EVERY = 1
- TOTAL FAILURES BEFORE ALERT = 1
However, such a configuration means no SOFT state. This means that OmniCenter won’t do any verification to ensure that a problem is real before it generates an alarm. Users have done this in the past, and then complained that OmniCenter was spamming them with alert notifications. So be careful.
Another common configuration is as follows.
- CHECK INTERVAL = 2
- ON FAILURE, RETRY EVERY = 1
- TOTAL FAILURES BEFORE ALERT = 2
If you do the math for such a configuration, the maximum possible time between a service outage and a alarm, is 3-minutes. It works well, but remember the potential load problems of setting the CHECK INTERVAL to three minutes or below.
It’s always recommended to avoid a TOTAL FAILURES BEFORE ALERT setting of 1. As any little hiccup on the network (like a lost ping packet) will immediately generate an alarm—which is probably not what you want if you’re looking to minimize false alarms.
Action Group Selection
When any failed check generates an alarm, that alarm tries to open an incident. (See the entry for “Incidents” for more information about how alarms open incidents.) This section primarily allows you to select the action groups to be run by that incident. Any action groups selected in the fields below are assigned to a newly opened incident and executed at the appropriate times. This section also has options for assigning the check to a statistical group, and provides a place to add additional notes that you would want included in any alert notifications that get sent regarding this service check. The fields and their meanings are explained in the table below.
This field allows you to select the action group(s) that should be assigned to the incident opened by an alarm from this check. Multiple selection is allowed. The incident will run these groups when opened, at the RENOTIFICATION INTERVAL (above), and every time that incident changes state. (See the entries for “Incidents” and “Actions” for more information about incidents, actions and action groups.)
This field allows you to select the action group(s) that should be run if an opened incident goes unacknowledged beyond the ESCALATE AFTER period (above). Multiple selection is allowed.
This field determines which statistical calculations this check will contribute to for reports.
If any of the contact or escalation action groups have been configured to send alert notifications, any notes entered here will be included within those notifications.