(For a short video about threshold checks, click here.)
(For a short video about anomaly checks, click here.)
(Anomaly checks are configured within threshold checks, and they both utilize the same data set for the given statistic that they monitor.)
OmniCenter collects a wide variety of statistics from the managed devices on your network depending on their device type and subtype (cpu, memory, bandwidth, etc.). “Threshold checks” allow you to monitor those collected statistics for undesirable behavior and alert you, if necessary.
By default, OmniCenter polls all managed devices for statistical data approximately every 5 minutes and stores that data in individual databases for each statistic. Threshold checks can monitor for both high and low values simultaneously, or either one independently.
Threshold checks can be configured to monitor collected statistics in either, or both, of two ways:
- Using static threshold values to monitor for absolute high and low values.
- Using dynamic threshold values to monitor for anomalous deviations from the norm.
Only users with Admin privileges or higher may add threshold checks to devices.
Threshold check functionality can be switched on/off for individual devices or groups of devices using the “POLL DEVICE” device setting. Each individual threshold check on each device can also be switched on/off independently. (See the Instances section of the Device Administration Dashboard for more information.)
Additionally, if the “HOST & SERVICE ALARMS” and “POLL DEVICE” settings are both turned off for a managed device, that device will no longer be considered monitored and will stop consuming a license slot in OmniCenter. All history for the device (up to that point) will continue to be stored for the normal length of time.
If a device then has both of the aforementioned settings switched back on, it will again consume a license slot and OmniCenter will automatically resume monitoring it. There will, however, be a gap in the device’s stored history equivalent to the time that it was unmonitored.
Static Threshold Checks
Configuring the static threshold settings in a threshold check causes it to query the database of polled data for its target statistic and average the collected data samples over a period of time (typically 5 minutes, but configurable in the check settings). It then compares that average against its static threshold settings.
If the averaged result does not exceed any of the configured high or low threshold limits, the check remains in an OK state.
If the averaged result exceeds the high or low WARNING threshold limit, the check enters a WARNING state. This state displays in the dashboards, but no other action is taken by OmniCenter.
If the averaged result exceeds the high or low CRITICAL threshold limit, the check enters a CRITICAL state and generates an alarm.
A threshold check will continue to average the most recently collected statistic values and compare the computed average against the static threshold limits even after an alarm has been generated. If the value of the computed average returns to “normal” at any time, the check will again enter an OK state and any alarm will be cleared.
Dynamic (Anomaly) Threshold Checks
Configuring the anomaly threshold settings in a threshold check causes it to query the database of polled data for its target statistic and look for regular patterns over time in the averaged values, and then check for unusual deviations in those patterns.
For example, using its anomaly settings, the check might observe that every day of the past week at this time, a server was at 90% CPU utilization (which is the normal expected behavior for that particular server). But the CPU utilization at that time today was only at 40%. That could indicate that an important process on that server had stopped running, or that clients were unable to connect and utilize the server resources. The static threshold settings might not see any problem with those figures if they’re within configured limits, but the anomaly settings see those figures in terms of what’s been normal for that particular server statistic at that particular time.
When planning to use the anomaly settings in a threshold check it’s important to understand what “normal” means. A statistic with an undesirable value that never gets addressed or corrected will quickly become the new “normal” for that statistic in the eyes of the anomaly settings. So it’s important that you know what kind of an anomaly is useful to you and why—and that you’re prepared to address it promptly—before attempting to use the anomaly settings.
Both static and anomaly settings utilize the same data set for the given statistic that they monitor. However, the two checks operate completely independently of each other. For any given monitored statistic, it’s perfectly reasonable to add a threshold check without configuring anomaly detection, or add an anomaly check without configuring static threshold values, or to add and have both checks working simultaneously. It is important to note however, that switching off a threshold check for a statistic or disabling polling for a device, will also disable any anomaly checking for that statistic or device.
Threshold Check Status
The status of threshold checks are a core element of many of OmniCenter’s dashboard displays. Check status for any OmniCenter check is typically indicated by color. The following table outlines the states and meanings of the colors for threshold checks.
Threshold checks can only be configured by a user with Admin privileges or higher. They are configured either directly, in the Instances tab of the device administration dashboard (for checks added directly to devices), or indirectly, in a device template (the recommended way to add threshold checks).
Most statistics in OmniCenter are configured and monitored in pairs (with some exceptions). For example, bandwidth utilization pairs “inbound traffic” with “outbound traffic.” Threshold checks are also configured in these pairs, which are identified as VARIABLE ONE and VARIABLE TWO (the actual statistic will also be identified alongside the variable label). The settings for each variable are independent, but the pair shares a time period setting and alerting configuration (below).
The configuration options for threshold checks break down into two basic parts: Alerting configuration and alarm configuration.
When any failed check generates an alarm, that alarm tries to open an incident. This section allows you to configure the alerting options for the incident. Any action groups selected in the fields below are assigned to a newly opened incident and executed at the appropriate times. You also have options for assigning the check to a statistical group and providing any additional notes that you would want included in any alert notifications that get sent regarding this check.
The fields and their meanings are explained below.
- ACTION GROUP
This field allows you to select the action group(s) that should be assigned to the incident opened by an alarm from this check. Multiple selection is allowed. The incident will run these groups when opened, at the RENOTIFICATION INTERVAL (below), and every time that incident changes state.
- ESCALATION GROUP
This field allows you to select the action group(s) that should be run if an opened incident goes unacknowledged beyond the ESCALATE AT (below) period. Multiple selection is allowed.
- RENOTIFICATION INTERVAL
This field specifies the number of minutes that an incident, once opened, will wait to run its assigned action groups again. The incident will continue to run its assigned action groups at this interval until the incident is either acknowledged or the alarms have cleared. The term “renotification” is a legacy term from when the only actions that incidents ran were to send alert notifications. For OmniCenter version 9 and up, the action groups selected in the ACTION GROUPS field are run when an incident is first opened by an alarm from this check, and at each RENOTIFICATION INTERVAL.
- ESCALATE AT
This field specifies the number of times the RENOTIFICATION INTERVAL must pass before the action groups selected in the ESCALATION GROUP field are run. Once the escalation action groups are run, they also run every time the RENOTIFICATION INTERVAL passes—along with the ACTION GROUP action groups. This term, like “renotification,” also refers to the legacy system—when only alert notifications were sent when incidents were opened. The intention was that alerts would initially be sent to first-response personnel for action, but that if the response took too long, alerts would also be sent to a group of escalation personnel (e.g. managers) to determine why action hadn’t been taken for the problem.
- STATISTICAL CATAGORY
This field determines the column in the Tactical Overview widget in which an alarm will be displayed to indicate that there’s a problem. The alarm displays as a problem with the device group (category, site or business workflow) of which the device generating the alarm is a part. It also determines which statistical calculations this check will contribute to for reports.
Optional. Only available when configuring a threshold check in a device template. This field allows you to filter the list of device interfaces that the template will apply the configured threshold check to. Filtering is based on interface descriptions. Enter a regular expression to include or exclude interfaces. If this field is left empty, the device template will attempt to add the configured threshold check to every interface of the matching type on every device it is applied to.
If any of the action groups selected above have been configured to send alert notifications, any notes entered here will be included within those notifications.
Static Threshold Alarm Configuration
The settings for the static threshold limits are fairly simple and easy to set. They consist of fields to specify warning and critical limits for both high and low statistic values. If the target statistic value exceeds those limits (high or low), it will trigger the corresponding state in the check (warning or critical). The fields for these values are color-coded in accordance with the threshold check WARNING and CRITICAL state colors.
Any field may be omitted when configuring the check, but be aware of the effects omitting that field will have on the check results (e.g. setting a critical limit, but omitting a warning limit, means that the check will never warn of unusual resource usage until it becomes critical). Try to assign sensible and realistic values to prevent excessive false alarms.
The units for the high and low fields will automatically be appropriate for the type of statistic selected (e.g. CPU utilization would show percent, while latency would show seconds). A drop-down selector next to the value allows you to specify a prefix multiplier for the value. This allows you to configure the check using easier to comprehend values and have OmniCenter do the math for you.
The fields and their meanings are explained below.
- HIGH (Yellow)
This field represents the upper warning limit for this statistic. An averaged statistic value at or above the value set will cause the check to enter a WARNING state.
- HIGH (Red)
This field represents the upper critical limit for this statistic. An averaged statistic value at or above the value set will cause the check to enter a CRITICAL state and generate an alarm.
- LOW (Yellow)
This field represents the lower warning limit for this statistic. An averaged statistic value at or below the value set will cause the check to enter a WARNING state.
- LOW (Red)
This field represents the lower warning limit for this statistic. An averaged statistic value at or below the value set will cause the check to enter a CRITICAL state and generate an alarm.
- TIME PERIOD
This is the amount of time (in minutes) over which to average the value of a statistic before comparing it to the LOW and HIGH values. OmniCenter polls and records the statistic value every five minutes; so, selecting a TIME PERIOD of “5 Min” means that it would only take one poll that exceeded the WARNING or CRITICAL values to trigger a change in state, whereas a value of “15 Min” would take three consecutive polls (with an average value that exceeded the WARNING or CRITICAL values) to trigger a state change. This field is an important adjustment for reducing false alarms.
Anomaly Alarm Configuration
An anomaly check compares the most recently polled value of a polled statistic against the computed average of eight previously polled samples. Depending on the sensitivity configured for the check, the deviation from the average may be considered anomalous and trigger a WARNING or CRITICAL state in the check.
Although statistics are polled and recorded every five minutes, the eight previous samples used are not the last eight samples taken. Each sample is from at least one hour earlier than the next. The range of time between the samples is adjustable using the “Season” field. The eight samples are always taken from the same relative timestamp as the current sample. So, an anomaly check with a season setting of “Hour” that polls a resource at 8:05 p.m. will average samples from 7:05 p.m., 6:05 p.m., 5:05 p.m., 4:05 p.m., 3:05 p.m., 2:05 p.m., 1:05 p.m., and 12:05 p.m., and compare that average to the current sample. These samples are all one “Hour” apart. Five minutes later, when the resource is polled at 8:10 p.m., samples will be averaged from 7:10 p.m., 6:10 p.m., 5:10 p.m., 4:10 p.m., 3:10 p.m., 2:10 p.m., 1:10 p.m., and 12:10 p.m. This is called a rolling average. Selecting a different season simply changes the amount of time between the samples, from one hour, to one day, to one week.
Each sample must also be of at least a minimum value to be checked for anomalous behavior. This value can be specified using the “Min Value” field. Values below this setting will never be considered by the anomaly engine, even as a previous data sample (it will simply be dropped from the data set). This is to prevent a sequence of extremely low values from producing false positives from only minor deviations (changing from an average of 0.001 to 1 is a 1000% deviation, but still not likely to be any kind of problem). Additionally, any value below this setting will cause a currently open anomaly incident to move into the ALARMS CLEARED state to begin incident recovery. This is for housekeeping purposes to prevent an upper boundary anomaly check from being stuck in an alarm state due to never processing the current data point if it remains below the Min Value. However, this same logic would also cause a lower boundary anomaly alarm to recover, negating the purpose of setting a lower boundary check. It is therefore recommended to exercise caution when using a Min Value with a lower boundary check, as values below this setting will both prevent a lower boundary anomaly check from triggering an alarm and cause any existing alarm to recover. The Min Value setting only affects anomaly checks, and will not interfere with normal threshold check operation.
As long as a statistic is being polled (i.e. a threshold check has been added to it—even if not configured), an anomaly check can be configured for it. It’s important to realize that threshold and anomaly checks share all configuration fields in the alerting section of the configuration options. And, that if the threshold check for a statistic is switched off, the anomaly check will also be disabled—even though anomaly settings function independently of the other threshold check settings, and neither require the other to be configured in order to work.
Anomalies are easy to configure, as there are only five settings. The fields and their meanings are explained below.
This field determines whether the check will look for anomalous values that are above normal (upper boundary), below normal (lower boundary) or both
- Sensitivity (Yellow)
This field determines how sensitive the check is for the purposes of entering a WARNING state. This field should always be set to be more sensitive than the CRITICAL (red) field, so that a WARNING state occurs first (unless you plan on not using a WARNING state). This field is highly subjective, depending on circumstances. Trial and error will be required for each situation to achieve optimal performance. Hint: A higher sensitivity can detect smaller deviations, while a lower sensitivity can only detect larger deviations.
- Sensitivity (Red)
This field determines how sensitive the check is for the purposes of entering a CRITICAL state. This field should always be set to be less sensitive than the WARNING (yellow) field, so that a CRITICAL state occurs last (unless you plan on not using a CRITICAL state). An anomaly check entering a CRITICAL state will generate an alarm. This field is highly subjective, depending on circumstances. Trial and error will be required for each situation to achieve optimal performance. Hint: A higher sensitivity can detect smaller deviations, while a lower sensitivity can only detect larger deviations.
This field specifies the range of time between the eight previous data samples, the average of which the current sample value is compared against. A setting of “Days” means that the current value is compared against the average of samples from the exact same time from the previous eight days.
- Min Value
This field specifies the lowest acceptable value for the current data sample that will trigger an anomaly check. When the current sample value is below this number, the anomaly check will be skipped. The value entered in this field should be specified in the same base unit displayed in the static threshold configuration (above) without the prefix (for example: bytes, not megabytes, seconds, not milliseconds). Note: For bandwidth monitoring (only), the value must be specified in bits per second, and not as a percentage.