Threshold Checks

Threshold Check Description

(For a short video about static thresholds, click here.)

A threshold check allows you to monitor a single device or application statistic (CPU utilization, hardware temperature, SQL deadlocks, etc.) and measure it against a set of high and/or low threshold values, providing insight into imminent failures and detection of actual failures. Threshold checks provide two modes of monitoring – static threshold evaluation and anomaly detection. Either may be used independently of the other, or both may be used simultaneously.

Threshold checks may be configured to monitor for both high and low values, only high values or only low values. A threshold check detecting unacceptable values for its monitored statistic displays as failed in OmniCenter dashboards and (typically) sends an alert notification to users. A threshold check can even be configured to run commands on the device experiencing the problem (such as reboot or restart service commands).

By default, OmniCenter automatically adds several preconfigured threshold checks to each device to provide basic monitoring services. However, there are many other statistics collected by OmniCenter to which you can add a threshold check to suit your specific monitoring needs.

Static Thresholds

The primary function of a threshold check is to measure a statistic against a set of static threshold values and determine if the value of the statistic has exceeded any of those thresholds. It is concerned with what the statistic is experiencing right now. This functionality provides basic performance monitoring of a statistic. For more advanced monitoring, see anomaly detection below.

Anomaly Detection

Anomaly detection is a more advanced function of a threshold check. It uses adaptive, dynamic threshold values to determine if a statistic is experiencing values that are inconsistent with what is normally expected from that statistic at this time, given its history.

As an example, say CPU utilization for server X has been 85% at this time of the day, for this day of the week, for the past eight weeks, but is currently showing 55%. Given its history, that’s not the expected utilization value for this time. This could indicate that an important process on that server has stopped running, or that clients are unable to connect and utilize the server resources.

Since neither of the reported values are particularly high or low, static threshold monitoring would not indicate anything unusual happening. But, the dynamic threshold values of anomaly detection look at the collected values in the context of their history, and develop an understanding of what is “typical” for that statistic at any given time.

When used together, static thresholds and anomaly detection are extremely powerful. However, you are free to use them independently of each other in any given threshold check. Either may be configured alone without requiring the other to be configured.

Threshold Check States

The Tactical Overview dashboard widget is useful for displaying threshold check status for devices and groups of devices.

Threshold checks always display one of the following states when viewed in dashboard widgets.

State Description
OK (Green) The value of the statistic is within the user-determined acceptable operating range.
WARNING (Yellow) The value of the statistic is higher than the configured high WARNING value or lower than the configured low WARNING value, but has not yet exceeded the configured CRITICAL value for either.
CRITICAL (Red) The value of the statistic is higher than the configured high CRITICAL value or lower than the configured low CRITICAL value.
ACKNOWLEDGED (Blue) Indicates a threshold check in a CRITICAL state that has been acknowledged by a user. This state is technically an incident state, not a check state, but its display in the dashboards helps to distinguish between problems that are new and problems that are already being addressed.
The Tactical Overview dashboard widget showing aggregated statuses of different checks for different device groups. The THRESHOLDS column for each row shows the total number of threshold checks in each state for that group. The ANOMALIES column reflects the statuses of the anomaly checks for the devices in that group.

How Threshold Checks Work

OmniCenter collects and stores the retrieved values for each device and application statistic to which it has access, maintaining a historical record of the performance of that statistic. This record provides the data set that a threshold check uses to evaluate the current state of the statistic. This data set is used differently by static thresholds and anomaly detection, as explained below.

(The performance history of each statistic is also graphed on the Performance tab of the Device Dashboard for any given device. So, even without a threshold check, a statistic could still be monitored manually.)

A threshold check continues to retrieve and evaluate statistic values according to its configuration even after an alarm has been generated and an alert sent. If the value returns to “normal” at any time, the check immediately recovers to the OK state, clears its alarm and signals any opened incident that it has recovered.

Static Thresholds

When the latest value for a statistic is collected, the threshold check calculates an average value made up of the most recently collected values. (The number of values averaged is determined when configuring the check.) This averaged value is then compared against the configured static high and low threshold values to see if any have been exceeded.

The reason for averaging the most recent values instead of directly using the last raw value collected is to avoid generating an unnecessary alarm due to a momentary spike in the value. Averaging smooths the values into a more reliable indicator of that statistic’s current condition.

Static thresholds include independent values to trigger both WARNING and CRITICAL states for both high and low value conditions.

If the averaged value exceeds any high or low configured warning thresholds, the check enters the WARNING state. This state displays in the dashboards, but no other action is taken by OmniCenter.

If the averaged value exceeds any high or low configured critical thresholds, the check enters the CRITICAL state. This state displays in the dashboards and generates an alarm (see threshold check alarms below).

Anomaly Detection

An anomaly check compares the most recently polled value of a statistic against a set of dynamic threshold values computed from a data set that samples eight previously polled values.

The check may be configured to look for upper boundary and/or lower boundary anomalies (similar to high and low static threshold values). These upper and lower boundaries represent dynamic threshold values that are computed as deviations from the mean of the eight sampled values. As each execution of the check drops older samples from the data set and adds newer samples, these boundaries are continuously recomputed to establish what should be considered “normal” for the statistic at the time of polling.

The amount that the upper and lower boundary thresholds deviate from the mean is controlled by the check’s anomaly sensitivity. A lower sensitivity causes the boundary values to be further from the computed mean of the samples, meaning the more abnormal a polled value must be to be considered an anomaly. A higher sensitivity causes the boundary values to be closer to the mean, meaning the less abnormal a polled value must be to be considered an anomaly.

The appropriate sensitivity for anomaly detection is highly subjective, depending on circumstances. Trial and error will be required for each situation to achieve optimal performance. (Hint: A higher sensitivity can detect smaller deviations, while a lower sensitivity can only detect larger deviations.)

Anomaly detection includes independent sensitivity settings to trigger both WARNING and CRITICAL states.

If the current value of the statistic exceeds the computed warning upper or lower boundary values, the current value is considered a potential anomaly and the check enters the WARNING state. This state displays in the dashboards, but no other action is taken by OmniCenter.

If the current value of the statistic exceeds the computed critical upper or lower boundary values, the current value is considered an anomaly and the check enters the CRITICAL state. This state displays in the dashboards and generates an alarm (see threshold check alarms below).

Anomaly Detection Samples

The eight previous data values sampled are not simply the eight most recent values polled. The range of time between the samples is adjustable, but each sample is always from at least one hour earlier than the next. These eight samples are always taken from the same relative timestamp as the current polled value, and are called a season.

So, an anomaly check with a season setting of Hour that polls a statistic at 8:05 p.m. samples values from 7:05 p.m., 6:05 p.m., 5:05 p.m., 4:05 p.m., 3:05 p.m., 2:05 p.m., 1:05 p.m., and 12:05 p.m. These are all exactly one hour apart.

Five minutes later, when the statistic is polled again at 8:10 p.m. values from 7:10 p.m., 6:10 p.m., 5:10 p.m., 4:10 p.m., 3:10 p.m., 2:10 p.m., 1:10 p.m., and 12:10 p.m. are used. Again, all exactly one hour apart.

Selecting a different season simply changes the amount of time between the sampled values. Your choices are Hour, Day or Week.

Previous Anomaly Samples

If one of the previous eight data values being sampled was itself an anomaly, that sample is excluded from the detection calculations. However, if more than one of the previous data values was an anomaly, those values are used in the detection calculations. This is how the threshold check dynamically adapts to gradual changes in behavior that are, in fact, perfectly normal.

Minimum Sample Values

Each sampled data point must also be of at least a minimum value to even be checked for anomalous behavior (configured in the check using the Min Value field). Values below this setting will never be used by the anomaly engine, even as a previous data sample (it will simply be dropped from the data set). This is to prevent a sequence of extremely low values from producing false positives from only minor deviations. (Changing from an average of 1 to 0.001 is a 1000% deviation, but still not likely to be any kind of problem.)

However, if a polled value is below the minimum setting after an anomaly is detected and an alarm generated, the current alarm is automatically cleared and the opened incident notified. This is for housekeeping purposes to prevent a detected upper boundary anomaly from causing the check to be stuck in an alarm state due to never processing the current data point if the polled values remain below the minimum.

Unfortunately, this same logic also causes a lower boundary anomaly alarm to clear, negating the value of setting a lower boundary check. It is therefore recommended to exercise caution when using a minimum value with a lower boundary check, as polled values below this setting will both prevent a lower boundary anomaly from triggering an alarm and cause any existing alarm to be cleared.

(The minimum value setting is for anomaly detection only, and does not interfere with static threshold check operations.)

Threshold Check Alarms

A threshold check alarm always attempts to open a new incident in OmniCenter (although this may be prevented by OmniCenter’s incident management system for housekeeping purposes, such as if an incident already exists for the current issue). Incidents typically send alert notifications to users (or alert external systems) through the use of action groups.

Unlike service checks, threshold checks generate an alarm immediately upon reaching the CRITICAL state.

Threshold Check Management

Only users with Admin access level or higher may manage threshold checks.

In OmniCenter, many collected statistics are configured and monitored as pairs (for example, bandwidth utilization pairs inbound traffic with outbound traffic). Threshold checks are also configured in these same pairs, with the opposing statistics identified as VARIABLE ONE and VARIABLE TWO in the check (the actual statistic name is also identified alongside the variable label). The static threshold and anomaly detection settings for each variable are independent, but all other configuration settings for the check are shared between the two.

When configuring static threshold values, the units for the high and low fields will automatically be appropriate for the type of statistic selected (e.g. CPU utilization would show percent, while latency would show seconds). A pull-down selector next to the value allows you to specify a multiplier prefix for the entered value. This allows you to configure the check using values with which you are comfortable and have OmniCenter do the math for you.

Errors Per Second Configuration

OmniCenter always measures errors per second values as milli-errors per second. This allows for err/sec measurements of less than one. See Understanding Errors per Second for more information on calculating err/sec values that can be used in threshold checks. When entering err/sec values remember to select the milli (m) multiplier prefix.

Add a threshold check to a device template

  1. Go to the OmniCenter main menu and select Administration > Templates to open the Device Templates Administration page.
  2. Locate the device template to which you would like to add a threshold check and select its edit icon in the ACTIONS column, or create a new device template.
  3. In the Threshold Checks section of the Template Components panel select the add threshold check button (+).
    1. Select the general group type of statistic that you would like to monitor.
    2. Select the specific statistic that you would like to monitor.
    3. Select Select Threshold.
  4. In the ACTION GROUP field select the action group(s) to receive alert notifications before escalation.
  5. In the ESCALATION GROUP field select the action group(s) to receive alert notifications after escalation.
  6. In the RENOTIFICATION INTERVAL field enter the number of minutes for OmniCenter to wait before sending another alert notification if the problem is not acknowledged by a user.
    • Alert notifications are sent to the action groups in the ACTION GROUP field.
    • The default value of 1440 minutes (24 hours) is recommended to minimize alert noise.
    • Setting a value of 0 (zero) will disable renotifications.
  7. In the ESCALATE AT field enter the number of alert notifications after the first for OmniCenter to wait before sending alert notifications to the action groups in the ESCALATION GROUP field, as well as to the groups in the ACTION GROUP field.
    • The default value of 1 means that a total of 2 alerts must be sent before escalation groups start receiving them.
  8. In the STATISTICAL GROUP field select the type that has the greatest relevance to the check. This field determines which statistical calculations this check contributes to for reports.
  9. (Optional) In the SUBSTRING field, enter a string or regular expression to include or exclude specific interfaces from this check using a match to the interface name/description.
    • If this field is left empty, the device template attempts to add the configured threshold check to every interface of every device it is applied to.
  10. (Optional) If you would like to configure static threshold monitoring (repeat these steps for each variable if two variables are present):
    1. In the HIGH warning field (yellow) enter the exact value at which the check should enter the WARNING state for high values.
      • Next to the value type, select the multiplier prefix.
    2. In the HIGH critical field (red) enter the exact value at which the check should enter the CRITICAL state for high values.
      • Next to the value type, select the multiplier prefix.
    3. In the LOW warning field (yellow) enter the exact value at which the check should enter the WARNING state for low values.
      • Next to the value type, select the multiplier prefix.
    4. In the LOW critical field (red) enter the exact value at which the check should enter the CRITICAL state for low values.
      • Next to the value type, select the multiplier prefix.
    5. In the TIME PERIOD field select the time period over which data values will be sampled for the calculated average.
      • See Best Practices below.
  11. (Optional) If you would like to configure anomaly detection (repeat these steps for each variable if two variables are present):
    1. In the Boundary field select whether to check for upper boundary anomalies, lower boundary anomalies or both.
    2. In the Sensitivity warning field (yellow) select the desired sensitivity. (This should always be at least one setting higher than the critical sensitivity field, so that the warning state occurs first.)
    3. In the Sensitivity critical field (red) select the desired sensitivity. (This should always be at least one setting lower than the warning sensitivity field, so that the warning state occurs first.)
    4. In the Season field select the desired season for the data samples.
    5. (Optional) In the Min Value field set the minimum value that a polled value must be to qualify for anomaly detection.
      • The value entered in this field should be specified in the same base unit displayed in the static threshold configuration, without the prefix (for example: bytes, not megabytes; seconds, not milliseconds). Note: For bandwidth monitoring (only), the value must be specified in bits per second, and not as a percentage.
  12. Select Create/Edit Thresholds.

If you’ve added a threshold check to a device template that is already applied to any devices, navigate back to the Device Templates Administration page using the arrow icon at the top left of the page and reapply your device templates.

Add a threshold check to a single device

Once a threshold check is added to a device it cannot be removed, only disabled.

  1. Locate the device to which you would like to add a threshold check and select it to open its device dashboard.
    • Specific devices can be located in OmniCenter by either drilling in to a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  2. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  3. Select the Instances tab.
  4. Locate the panel for the statistic type containing the statistic you would like to monitor and select it to open the panel.
    • If the statistic you would like to monitor is in the Network panel, use the pull-down menu at the top right and select Thresholds to display the network interfaces.
  5. Locate the specific statistic to which you would like to add a threshold check and select its add threshold icon (+) in the ACTIONS column.
  6. In the ACTION GROUP field select the action group(s) to receive alert notifications before escalation.
  7. In the ESCALATION GROUP field select the action group(s) to receive alert notifications after escalation.
  8. In the RENOTIFICATION INTERVAL field enter the number of minutes for OmniCenter to wait before sending another alert notification if the problem is not acknowledged by a user.
    • Alert notifications are sent to the action groups in the ACTION GROUP field.
    • The default value of 1440 minutes (24 hours) is recommended to minimize alert noise.
    • Setting a value of 0 (zero) will disable renotifications.
  9. In the ESCALATE AT field enter the number of alert notifications after the first for OmniCenter to wait before sending alert notifications to the action groups in the ESCALATION GROUP field, as well as to the groups in the ACTION GROUP field.
    • The default value of 1 means that a total of 2 alerts must be sent before escalation groups start receiving them.
  10. In the STATISTICAL GROUP field select the type that has the greatest relevance to the check. This field determines which statistical calculations this check contributes to for reports.
  11. (Optional) In the SUBSTRING field, enter a string or regular expression to include or exclude specific interfaces from this check using a match to the interface name/description.
    • If this field is left empty, the device template attempts to add the configured threshold check to every interface of every device it is applied to.
  12. (Optional) If you would like to configure static threshold monitoring (repeat these steps for each variable if two variables are present):
    1. In the HIGH warning field (yellow) enter the exact value at which the check should enter the WARNING state for high values.
      • Next to the value type, select the multiplier prefix.
    2. In the HIGH critical field (red) enter the exact value at which the check should enter the CRITICAL state for high values.
      • Next to the value type, select the multiplier prefix.
    3. In the LOW warning field (yellow) enter the exact value at which the check should enter the WARNING state for low values.
      • Next to the value type, select the multiplier prefix.
    4. In the LOW critical field (red) enter the exact value at which the check should enter the CRITICAL state for low values.
      • Next to the value type, select the multiplier prefix.
    5. In the TIME PERIOD field select the time period over which data values will be sampled for the calculated average.
      • See Best Practices below.
  13. (Optional) If you would like to configure anomaly detection (repeat these steps for each variable if two variables are present):
    1. In the Boundary field select whether to check for upper boundary anomalies, lower boundary anomalies or both.
    2. In the Sensitivity warning field (yellow) select the desired sensitivity. (This should always be at least one setting higher than the critical sensitivity field, so that the warning state occurs first.)
    3. In the Sensitivity critical field (red) select the desired sensitivity. (This should always be at least one setting lower than the warning sensitivity field, so that the warning state occurs first.)
    4. In the Season field select the desired season for the data samples.
    5. (Optional) In the Min Value field set the minimum value that a polled value must be to qualify for anomaly detection.
      • The value entered in this field should be specified in the same base unit displayed in the static threshold configuration, without the prefix (for example: bytes, not megabytes; seconds, not milliseconds). Note: For bandwidth monitoring (only), the value must be specified in bits per second, and not as a percentage.
  14. Select Create Threshold.

Edit a threshold check in a device template

  1. Go to the OmniCenter main menu and select Administration > Templates to open the Device Templates Administration page.
  2. Locate the device template which contains the threshold check you would like to edit and select its edit icon in the ACTIONS column.
  3. In the Threshold Checks section of the Template Components panel locate the threshold check you would like to edit and select its edit icon in the ACTIONS column.
  4. Edit the threshold check as desired.
  5. Select Create/Edit Threshold to save the your settings.

If you’ve edited a threshold check in a device template that is already applied to any devices, navigate back to the Device Templates Administration page using the arrow icon at the top left of the page and reapply your device templates.

Edit a threshold check on a single device

  1. Locate the device for which you would like to edit a threshold check and select it to open its device dashboard.
    • Specific devices can be located in OmniCenter by either drilling in to a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  2. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  3. Select the Instances tab.
  4. Locate the panel for the statistic type containing the statistic for the threshold check you would like to edit and select it to open the panel.
    • If the statistic you would like to monitor is in the Network panel, use the pull-down menu at the top right and select Thresholds to display the network interfaces.
  5. Locate the specific threshold check you would like to edit and select its edit icon in the ACTIONS column.
    • If the edit icon is not present but instead shows a lock icon and a control icon, this threshold check has been added by a device template and is controlled there. Either:
      • Select the control icon to open the device template controlling the check and see Edit a threshold check in a device template above.
      • Or, select the lock icon to disable device template control for this specific threshold check, select the edit icon and proceed to the next step.
  6. Edit the threshold check as desired.
  7. Select Save Threshold to save the your settings.

Disable a threshold check in a device template

Follow the steps for Edit a threshold check in a device template above. When editing the device template, select the Enable/Disable switch at the top of the page to disable the threshold. (Select again to re-enable.)

Disabling a threshold check in a device template still causes the check to be added to all devices affected by the template. The check is simply added disabled.

Disabling a threshold check prevents that specific check from monitoring its statistic, but the statistic is still polled for values and those values are still recorded.

Disable a threshold check for multiple devices

Disabling a threshold check prevents that specific check from monitoring its statistic, but the statistic is still polled for values and those values are still recorded.

  1. Go to the OmniCenter main menu and select Administration > Change Devices > Turn On/Off Thresholds to open the Deactivate Thresholds page.
  2. Select a functional group that contains the devices you would like to affect.
  3. Place a check next to the specific devices on which you would like to disable specific threshold checks.
  4. Select Select Device.
  5. Place a check next to the specific threshold checks that you would like to disable.
  6. Select Update Thresholds.

Disable a threshold check on a specific device

Follow the steps for Edit a threshold check on a single device above. When editing the device template, select the Enable/Disable switch at the top of the page to disable the threshold. (Select again to re-enable.)

Disabling a threshold check prevents that specific check from monitoring its statistic, but the statistic is still polled for values and those values are still recorded.

Turn off all threshold checks for multiple devices

Turning off threshold checks for a device prevents OmniCenter from polling any new values for any statistics from the device. Any existing history is preserved, however.

Turning off threshold checks for a device, and then later turning them back on, will produce a gap in the history of the statistic values of the length of time that polling was turned off.

  1. Go to the OmniCenter main menu and select Administration > Change Devices > Turn Polling & Monitoring On/Off to open the Device Polling & Monitoring page.
  2. Select a functional group that contains the devices you would like to affect.
  3. Place a check next to the specific devices on which you would like to turn off threshold checks.
  4. Select Turn Polling OFF. (Select Turn Polling ON to turn threshold checks back on for those devices.)

Turn off all threshold checks for a specific device

Turning of threshold checks for a device prevents OmniCenter from polling any new values for any statistics from the device. Any existing history is preserved, however.

Turning off threshold checks for a device, and then later turning them back on, will produce a gap in the history of the statistic values of the length of time that polling was turned off.

  1. Locate the device for which you would like to turn off threshold checks and select it to open its device dashboard.
    • Specific devices can be located in OmniCenter by either drilling in to a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  2. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  3. On the Main tab locate the Poll Device panel.
  4. Select the toggle to switch it to Disabled. (Select again to reactivate.)
  5. Select Apply Changes.

Best Practices

Device Templates

It is highly recommended that threshold checks be added to devices and managed through device templates, and not directly on devices. Even in unique device-specific circumstances, threshold checks for that device can still be managed using a device template that includes the desired threshold checks and is assigned directly to the device.

The only circumstance under which a threshold check should ever be added to a device directly is when that device has had its device template functionality turned off completely.

Static Threshold Check Time Periods

Since Netreo polls and records a statistic’s value every five minutes, selecting a TIME PERIOD of 5 Min when configuring a static threshold check means that it would only take one poll that exceeded the warning or critical threshold values to trigger a change in state. However, selecting a period of 15 Min would require three consecutive polls (with an average value exceeding the warning or critical threshold values) to trigger a state change. This field is an important adjustment for reducing false alarms.

Divide the TIME PERIOD value configured in the check by 5 to figure out the number of recent samples that will be averaged before being compared to the threshold values.

Updated on March 27, 2020

Was this article helpful?

Need Support?
Can’t find the answer you’re looking for? Don’t worry we’re here to help!
Contact Support

Leave a Reply