Service Checks

Description

A service check is an application-level check that allows you to monitor the status of an individual process or resource (application, interface, etc.) on a managed device according to a configurable schedule (default is 3 minutes).

A service check queries its target for a response code according to its configured schedule. If the target returns a failure code (or no code) the check displays as failed in OmniCenter dashboards and (typically) sends an alert notification to users. A service check can even be configured to run commands on the device experiencing the problem (such as reboot or restart service commands).

By default, OmniCenter automatically adds several different types of service check to every managed device to provide basic monitoring services. However, there are many more service checks that may be added to devices to suit your specific monitoring needs.

Additional service checks are available from the Netreo cloud libraries.

Service checks are categorized into the following basic types.

  • Cloud Checks – For monitoring cloud-based resources.
  • Firewall Checks – For monitoring firewall resources.
  • Generic Passive Checks – For monitoring resources via an indirect source.
  • HP Insight Manager Agent Checks – For monitoring certain hardware systems.
  • Interface Checks – For monitoring interfaces.
  • Network Application Checks – For monitoring network application resources.
  • Network Connectivity Checks – For monitoring various types of connectivity.
  • OmniCenter Checks – For monitoring elements of OmniCenter itself.
  • System Checks – For monitoring core processes.
  • Web Checks – For monitoring web-based resources.

The Host Availability Check

By default, all devices automatically get a “Ping this host” Network Connectivity service check added to them for host availability monitoring. Whenever you see a reference to a host availability check, this is the check to which it is referring. This service check is added by the “Default” device template.

Service Check States

The Tactical Overview dashboard widget is useful for displaying service check status for devices and groups of devices.

Service checks always display one of the following states when viewed in dashboard widgets.

State Description
OK (Green) The check query has returned a success code.
WARNING (Yellow) Very rare. The check query has returned a warning code. For service checks, warning codes are treated as failure codes.
CRITICAL (Red) The check query has returned a failure code or no code.
ACKNOWLEDGED (Blue) Indicates a service check in a CRITICAL state that has been acknowledged by a user. This state is technically an incident state, not a check state, but its display in the dashboards helps to distinguish between problems that are new and problems that are already being addressed.
UNKNOWN (Orange) The check query has returned a value that the check cannot understand. This is likely due to a configuration error in the check.
The Tactical Overview dashboard widget showing aggregated statuses of different checks for different device groups. The SERVICES column for each row shows the total number of service checks in each state for that group. The HOSTS column reflects the statuses of the host availability service checks for the devices in that group.

How Service Checks Work

Service checks are either active or passive.

Active service checks create their own processes in memory while they do their work and follow their own timing schedule for their query.

Passive service checks wait for some other process to update them (usually an active service check or some other active process). This means that passive service checks update according to the schedule of whatever process it is that updates them.

Active Service Checks

Active service checks query their specific process or resource for a response. The response code returned to the service check determines the state of that check.

If the response is a success code, the check remains in the OK state and continues to run its query according to its configured schedule.

If the response is a failure code, the service check enters what is called a soft CRITICAL state. While in this soft state the service check retries its query several times, typically at a faster frequency. When it reaches a set number of failure responses (default is 3), the check enters what is called a hard CRITICAL state and generates an alarm (new alarms open an incident in OmniCenter). The same applies to warning codes and the WARNING state.

A service check in a hard state continues to retry its query at the increased frequency. If at any time it again receives a success code it immediately recovers to the OK state, clears its alarm and signals any opened incident that it has recovered.

The reason service checks (typically) retry their query several times is to prevent them from immediately generating an alarm and alerting users at the slightest temporary glitch. The retry schedule and the number of failures required to generate an alarm is adjustable in the configuration options of each service check.

Passive Service Checks

Passive service checks do nothing until they receive a response code from the active process that updates them. This means that they remain in their current state (whatever that state may be) until the active process updates them with a success or failure code.

If a passive service check is updated with a failure code, it increments its exception counter and enters the soft CRITICAL state (same as active service checks). When a set number of exceptions occurs (default is 3), the check enters the hard CRITICAL state (again, same as active service checks) and generates an alarm (new alarms open an incident in OmniCenter).

If a passive service checks is updated with a success code, it immediately recovers to the OK state, clears it exception counter to zero, clears its alarm and signals any opened incident that it has recovered.

Like active service checks, the reason passive service checks require a number of exceptions is to prevent them from immediately generating an alarm and alerting users at the slightest temporary glitch. The number of exceptions required to generate an alarm is adjustable in the configuration options of each passive check.

Service Check Alarms

It is important to remember that though a service check may be showing as CRITICAL in the dashboards, an alarm is not generated (and thus an incident is not opened) until the check reaches the hard CRITICAL state.

A service check alarm always attempts to open a new incident in Netreo (although this may be prevented by the check’s host checking logic or Netreo’s incident management system for housekeeping purposes, such as if an incident already exists for the current issue).

Host Checking

The term host checking refers to the out-of-band execution of the host availability service check of a particular host for the purposes of incident management and root cause analysis.

Host checking is automatically triggered for a host when any of its assigned service checks have become critical and generated an alarm. (Host checking applies to service checks only. Due to their natures, no other monitoring check types use host checking logic.)

When a service check on any given host fails, that failure could potentially be caused by the host itself being down and not because of any actual problems with the monitored service. If the host does turn out to be down, sending alert notifications for every down service on that host in addition to the (far more important) alert about the downed host itself would be redundant, and would result in an unnecessary flood of alerts for what is essentially a single issue. By suppressing the alerts for the (obviously) downed services, host checking aids in root cause discovery.

If host checking determines that the host of a failed service is down, it then also runs the host availability service check of that host’s immediate parent on the network. Something such as a failing switch could be what’s responsible for the initial host being unavailable, and thus, be the actual problem. If the parent of the host is also down, the alert for the initial host being down is also suppressed and host checking then moves on to the next higher parent in the network hierarchy.

Host checking continues to run each higher parent’s host availability service check (and suppressing alert notifications for every problem caused by a higher downed parent) until an available host is found. The downed host that is the child of that available host is the root cause of the problem. The alert notification for that downed host is the only one that is sent, and its host availability service check alarm is labeled as the primary alarm in the incident opened as a result of this whole process.

Suppressing redundant alerts is an excellent way to zero-in on an issue’s root cause, but at the same time you don’t want to completely ignore the other alarms caused by the primary alarm. OmniCenter’s incident manager therefore bundles all of the lesser alarms into the opened incident as related alarms for reference purposes. This allows you to see exactly what is being affected by the primary alarm.

Of course, if the host of a failed service check is available, the resulting alarm is permitted to open an incident for the service as expected and send an appropriate alert notification.

Cloud Library Integration

Service checks are fully integrated with the Netreo cloud library. Service checks designed in OmniCenter may be uploaded to the Netreo cloud library and made available to the public after being approved by Netreo. New or updated service checks designed by Netreo may also occasionally be available for download.

If you are using the OmniCenter Overview product, service checks are fully integrated with the Overview cloud library. Service checks that are built in Overview (or an Overview client OmniCenter) may be uploaded to the Overview cloud library for administrator approval and then pushed to all client OmniCenters to enforce consistency in alert notifications.

Service Check Management

Only users with Admin access level or higher may manage service checks.

Add a service check to a device template

  1. Go to the OmniCenter main menu and select Administration > Templates to open the Device Templates Administration page.
  2. Locate the device template to which you would like to add a service check and select its edit icon in the ACTIONS column, or create a new device template.
  3. In the Service Checks section of the Template Components panel select the add service check button (+).
    1. Select the category containing the desired service check.
    2. Select the service check to be added.
    3. Select Select Service Check.
  4. If required by the check, in the check parameter fields enter appropriate values for check execution. These fields vary widely depending on the service check selected. Please contact a Netreo support engineer if you need assistance.
  5. In the DESCRIPTION field (if editable) enter a descriptive name for the check.
    • It is recommended that check names be unique. See Best Practices below.
  6. In the CONFIGURATION OPTIONS area configure the alert timing options.
    • If the check is a passive service check, in the ALERT AFTER field select the number of failures the check is allowed to experience before sending an alert notification (default is 3).
    • If the check is an active service check, in the ALERT AFTER field select one of the preset timers that determine how long OmniCenter will wait after the first detection of a problem by this check before sending an alert notification. (The default value of 5 Minutes is recommended.) Or select Custom to use custom alert timing.
      • To use custom alert timing:
        1. Select Custom in the ALERT AFTER field to display the ADVANCED OPTIONS panel.
        2. In the CHECK INTERVAL field enter the number of minutes to wait between execution of the service check under normal conditions (default is 3).
        3. In the ON FAILURE, RETRY EVERY field enter the number of minutes to wait between execution of the service check after a failure (default is 1).
        4. In the TOTAL FAILURES BEFORE ALERT field enter the total number of failures the check is allowed to experience before sending an alert notification (default is 3).
  7. In the RENOTIFICATION INTERVAL field enter the number of minutes for OmniCenter to wait before sending another alert notification if the problem is not acknowledged by a user.
    • Alert notifications are sent to the action groups in the ACTION GROUP field.
    • The default value of 1440 minutes (24 hours) is recommended to minimize alert noise.
    • Setting a value of 0 (zero) will disable renotifications.
  8. In the ESCALATE AT field enter the number of alert notifications after the first for OmniCenter to wait before sending alert notifications to the action groups in the ESCALATION GROUP field, as well as to the groups in the ACTION GROUP field.
    • The default value of 1 means that a total of 2 alerts must be sent before escalation groups start receiving them.
  9. In the ACTION GROUP field select the action group(s) to receive alert notifications before escalation.
  10. In the ESCALATION GROUP field select the action group(s) to receive alert notifications after escalation.
  11. In the STATISTICAL GROUP field select the type that has the greatest relevance to the check. This field determines which statistical calculations this check contributes to for reports.
  12. In the NOTES field enter any notes that you would like included in an alert notification about this check.
  13. Select Add to Template.

If you’ve added a service check to a device template that is already applied to any devices, navigate back to the Device Templates Administration page using the arrow icon at the top left of the page and reapply your device templates.

Add a service check to a single device

  1. Locate the device to which you would like to add a service check and select it to open its device dashboard.
    • Specific devices can be located in OmniCenter by either drilling in to a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  2. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  3. Select the Service tab to view the service check management area.
  4. From the Actions pull-down menu select Add Service Check.
    1. Select the category containing the desired service check.
    2. Select the service check to be added.
    3. Select Add Command.
  5. If required by the check, in the check parameter fields enter appropriate values for check execution. These fields vary widely depending on the service check selected. Please contact a Netreo support engineer if you need assistance.
  6. In the DESCRIPTION field (if editable) enter a descriptive name for the check.
    • It is recommended that check names be unique. See Best Practices below.
  7. In the CONFIGURATION OPTIONS area configure the alert timing options.
    • If the check is a passive service check, in the ALERT AFTER field select the number of failures the check is allowed to experience before sending an alert notification (default is 3).
    • If the check is an active service check, in the ALERT AFTER field select one of the preset timers that determine how long OmniCenter will wait after the first detection of a problem by this check before sending an alert notification. (The default value of 5 Minutes is recommended.) Or select Custom to use custom alert timing.
      • To use custom alert timing:
        1. Select Custom in the ALERT AFTER field to display the ADVANCED OPTIONS panel.
        2. In the CHECK INTERVAL field enter the number of minutes to wait between execution of the service check under normal conditions (default is 3).
        3. In the ON FAILURE, RETRY EVERY field enter the number of minutes to wait between execution of the service check after a failure (default is 1).
        4. In the TOTAL FAILURES BEFORE ALERT field enter the total number of failures the check is allowed to experience before sending an alert notification (default is 3).
  8. In the RENOTIFICATION INTERVAL field enter the number of minutes for OmniCenter to wait before sending another alert notification if the problem is not acknowledged by a user.
    • Alert notifications are sent to the action groups in the ACTION GROUP field.
    • The default value of 1440 minutes (24 hours) is recommended to minimize alert noise.
    • Setting a value of 0 (zero) will disable renotifications.
  9. In the ESCALATE AT field enter the number of alert notifications after the first for OmniCenter to wait before sending alert notifications to the action groups in the ESCALATION GROUP field, as well as to the groups in the ACTION GROUP field.
    • The default value of 1 means that a total of 2 alerts must be sent before escalation groups start receiving them.
  10. In the ACTION GROUP field select the action group(s) to receive alert notifications before escalation.
    • Action groups may also run commands on the affected device. See the article Action Group for more information about action groups and their uses.
  11. In the ESCALATION GROUP field select the action group(s) to receive alert notifications after escalation.
    • Action groups may also run commands on the affected device. See the article Action Group for more information about action groups and their uses.
  12. In the STATISTICAL GROUP field select the type that has the greatest relevance to the check. This field determines which statistical calculations this check contributes to for reports.
  13. In the NOTES field enter any notes that you would like included in an alert notification about this check.
  14. Select Create Service Check.

Your new service check is created and added to the device. It may take a few minutes before you start seeing results.

Device Template Overrides

If a device template being applied to a device includes a service check with an identical description to a directly added service check, the template check settings will override the check settings on the device.

Add a WMI service check to a single device

Only available for Windows-based devices.

  1. Locate the device to which you would like to add a service check and select it to open its device dashboard.
    • Specific devices can be located in Netreo by either drilling in to a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  2. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  3. Select the Service tab to view the service check management area.
  4. From the Actions pull-down menu select WMI Service Check Wizard.
    • Netreo will query the Windows device using WMI. You may proceed after a notification is shown that it has successfully retrieved the server name.
  5. In the ACTION GROUP field select the action group(s) to receive alert notifications before escalation (multiple selection is allowed).
    • Action groups may also run commands on the affected device. See the article Action Group for more information about action groups and their uses.
  6. In the ESCALATION GROUP field select the action group(s) to receive alert notifications after escalation (multiple selection is allowed).
    • Action groups may also run commands on the affected device. See the article Action Group for more information about action groups and their uses.
  7. A list of currently running services is displayed for the device. (Service checks cannot be added for services that are not active when the wizard is run.)
    • Any service that already has a service check monitoring it is highlighted in green and cannot be selected.
    • All services that are auto-started are automatically selected.
  8. Select the services that you wish to monitor (multiple selection is allowed).
  9. Select Add WMI Service Checks.
  10. The service check is added and appears in the Service Checks table.

Edit a service check in a device template

  1. Go to the OmniCenter main menu and select Administration > Templates to open the Device Templates Administration page.
  2. Locate the device template which contains the service check you would like to edit and select its edit icon in the ACTIONS column.
  3. In the Service Checks section of the Template Components panel locate the service check you would like to edit and select its edit icon in the ACTIONS column.
  4. Edit the service check as desired.
  5. Select Edit Template to save the your settings.

If you’ve edited a service check in a device template that is already applied to any devices, navigate back to the Device Templates Administration page using the arrow icon at the top left of the page and reapply your device templates.

Edit a service check on a single device

  1. Locate the device to which you would like to add a service check and select it to open its device dashboard.
    • Specific devices can be located in OmniCenter by either drilling in to a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  2. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  3. Select the Service tab to view the service check management area.
  4. Locate the service check you would like to edit and select its edit icon in the ACTIONS column.
    • If the only icon present is a lock icon, it means that this service check is being managed by one or more device templates. Select the lock icon to open the device template controlling the current settings on the device. See “Edit a service check in a device template” above for more information.
  5. Edit the service checks as desired.
  6. Select Commit Changes to save the your settings.

Turn off all service checks for multiple devices

  1. Go to the OmniCenter main menu and select Administration > Change Devices > Turn Polling & Monitoring On/Off to open the Device Polling & Monitoring page.
  2. Select a functional group that contains the devices you would like to affect.
  3. Place a check next to the specific devices on which you would like to turn off service checks.
  4. Select Turn Monitoring OFF. (Select Turn Monitoring ON to turn service checks back on for those devices.)

Be aware that turning service checks off for a device also disables host availability monitoring for that device.

Turn off all service checks for a single device

  1. Locate the device for which you would like to turn off service checks and select it to open its device dashboard.
    • Specific devices can be located in OmniCenter by either drilling in to a Tactical Overview dashboard widget or searching for the device by name using the search feature at the top of the main menu.
  2. Select the gear icon in the top right of the dashboard to open the dashboard administrative view.
  3. On the Main tab locate the Host & Service Monitoring panel.
  4. Select the toggle to switch it to Disabled. (Select again to reactivate.)
  5. Select Apply Changes.

Be aware that turning service checks off for a device also disables host availability monitoring for that device.

See the administrative view section of the Device Dashboard article for more information about turning off service checks.

Upload a service check to a cloud library

  1. Go to the OmniCenter main menu and select Administration > Change Devices > Manage Service Checks to open the Service Checks Administration page.
  2. Locate the service checks that you wish to upload in the service check table.
  3. Select the upload icon for that service check.
  4. The service check is uploaded to the connected cloud library.

Note: If your OmniCenter is an Overview (or a client of an Overview) then service checks are uploaded to that Overview’s cloud library. Otherwise, they are uploaded to the Netreo cloud library, where they are subject to approval by Netreo before becoming available for download to the public.

Download a service check from a cloud library

  1. Go to the OmniCenter main menu and select Administration > Change Devices > Manage Service Checks to open the Service Checks Administration page.
  2. Select Service Checks Cloud Library to open the Service Check Cloud Library page.
  3. Locate the service check that you wish to download in the Service Checks Library panel.
  4. Select the download icon for that service check.
  5. The service check is downloaded from the connected cloud library and becomes available in the service check table.

This operation is primarily for downloading service checks from the Netreo cloud libraries, as OmniCenter Overview clients generally will have uploaded service checks pushed to them from the Overview.

Approve an uploaded service check in Overview

Note: This operation applies to OmniCenter Overview deployments only.

  1. Go to the OmniCenter main menu and select Administration > Change Devices > Manage Service Checks to open the Service Checks Administration page.
  2. Select Service Checks Cloud Library to open the Service Check Cloud Library page.
  3. Locate the service check that you wish to approve in the Service Checks Library panel.
  4. Select the edit icon for that service check.
  5. Optional: Enter a note for this check and select Save Note. The note is visible only in the edit screen for the check.
  6. Select Approve.
  7. The service check shows as Approved in the Service Checks Library panel and may now be downloaded or pushed to client OmniCenters.

Push a custom alert template to clients of an Overview

Note: This operation applies to OmniCenter Overview deployments only.

  1. Go to the OmniCenter main menu and select Administration > Change Devices > Manage Service Checks to open the Service Checks Administration page.
  2. Select Service Checks Cloud Library to open the Service Check Cloud Library page.
  3. Locate the service check that you wish to push to your client OmniCenters in the Service Checks Library panel.
  4. Select the export icon for that service check.
  5. The service check is pushed from the Overview cloud library to all clients and becomes available in the service check table of each client.

Best Practices

Device Templates

It is highly recommended that service checks be added to devices and managed through device templates, and not directly on devices. Even in unique device-specific circumstances, service checks for that device can still be managed using a device template that includes the desired service checks and is assigned directly to the device.

The reason for this is that any service check added directly to a device runs the risk of being overridden by any device template applied to that device that includes a service check with an identical description field. If this occurs, the service check added to the device directly will be overridden. Device template settings always override settings made directly on a device.

The only circumstance under which a service check should ever be added to a device directly is when that device has had its device template functionality turned off completely.

Service Check Names

It is not allowed for two or more service checks on a single device to have the same DESCRIPTION field value. This value acts as the service check name in dashboards and alert notifications. So, be sure to provide unique names for your service checks when creating them.

Best practice here is to enter a descriptive name that indicates what the check is doing along with any specifics of what it’s doing it to. As an extremely basic example, suppose you have two TCP port checks being added to the same device, one checking port 80 and the other checking port 110. Best practice would be to name the first check “TCP port 80 check” and the other check “TCP port 110 check.” This way each check will be clearly identifiable in everything from the dashboards to alert notifications.

Unique service check names are particularly important if you intend to override service check settings using device templates. A service check in a device template will only override another service check if the DESCRIPTION field matches exactly. So, be aware of this when configuring service checks in your device templates.

Custom Alert Timing

The following only applies to active service checks.

Setting ALERT AFTER
Using the default 5 Minutes selection OmniCenter will execute the service check query every 3 minutes until a failure is detected. Once a failure is detected, it will execute the query two more times at 1-minute intervals, leading to a worst-case alert notification response time of five minutes. Although you certainly may use the Custom selection for this field, it’s highly recommended that you do not do so without a very specific reason. The selection of choices available for the ALERT AFTER field should be adequate for most situations.

Setting CHECK INTERVAL
This field defines how often (in minutes) this service check will be executed under normal circumstances. After every successful query, OmniCenter will wait this interval before it executes the query again. There is a significant performance consideration for this field in that, if you’re executing 10,000 service checks at 1-minute intervals, OmniCenter will have to execute 167 checks per second—adding significant network traffic and system load. Use common sense and try to select a reasonable interval. OmniCenter will try to spread the queries out anyway—so they don’t all run at the same time, but you can still overwhelm your network by overdoing the number of configured service checks.

The lowest that you’ll generally ever want to set the this setting to is 3 Minutes (especially if the system is very heavily utilized). You may go lower, but the more frequently you execute the query the heavier the load on the system, and the more network overhead required to perform them.

Caution

It is recommended to limit the number of service checks configured with CHECK INTERVAL settings below 3 minutes, especially in large environments. A few 1 minute CHECK INTERVAL settings on a moderately-loaded server is no big deal, but configuring 10,000 service checks at one minute intervals is going to create massive load and network traffic. That’s 167 queries per second!

Setting ON FAILURE, RETRY EVERY
This field defines the amount of time (in minutes) OmniCenter will wait to retry the query after an initial failure (during the soft state). This period should generally be considerably shorter than the CHECK INTERVAL period. OmniCenter will continue to retry the query at this interval even after an alert has been sent. If any of the retry queries return a success code the check will stop retrying, clear any current alarm and return to the normal CHECK INTERVAL schedule.

Setting TOTAL FAILURES BEFORE ALERT
This field defines the maximum number of failed queries allowed to qualify for an alert. When the total number of failed queries (initial failure plus retries) reaches this number, the service check enters a hard state. At this point, an alert notification is sent. The service check will continue to query according to the ON FAILURE, RETRY EVERY timer value. It is recommended that you do not set this option to 1, as that will generate a significantly higher number of false alarms.

Common Values
If you need to be alerted to an outage immediately, you’ll probably want to go with the following custom settings.

  • CHECK INTERVAL = 3
  • ON FAILURE, RETRY EVERY = 1
  • TOTAL FAILURES BEFORE ALERT = 1

However, such a configuration means no soft state. This means that OmniCenter won’t do any verification to ensure that a problem is real before it sends an alert notification. Users have done this in the past and then complained that OmniCenter was spamming them with alert notifications. So be careful.

Another common configuration is as follows.

  • CHECK INTERVAL = 2
  • ON FAILURE, RETRY EVERY = 1
  • TOTAL FAILURES BEFORE ALERT = 2

If you do the math for such a configuration, the maximum possible time between a service outage and an alert is 3 minutes. It works well, but remember the potential load problems of setting the CHECK INTERVAL to three minutes or below.

It is always recommended to avoid a TOTAL FAILURES BEFORE ALERT setting of 1. As any little hiccup on the network (like a lost ping packet) will immediately send an alert notification—which is probably not what you want if you’re looking to minimize false alarms.

Updated on May 22, 2020

Was this article helpful?

Need Support?
Can’t find the answer you’re looking for? Don’t worry we’re here to help!
Contact Support

Leave a Reply