When a Netreo check assigned to a managed device experiences a state that causes an alarm, Netreo needs a way to centralize and monitor all of the information surrounding that event. This is what an incident is for. When an alarm is generated by a failed check in Netreo, it opens a new incident to act as a record of the event which contains all of the associated data and history relevant to it. When an incident is opened, it collects in one place any and all information related to the alarm event. This information remains archived in the Netreo logs for a period of three years before it is deleted. Any incident from within that period can be looked up in Netreo using its unique ID number.
Incidents exist in one of four states:
An incident in the OPEN state indicates that the alarm that opened it is ongoing and has not been addressed. When an incident is first opened, it immediately attempts to run all of the actions in the action groups assigned to the respective check in its alarm configuration. The device for which the incident has been opened continues to show the current alarm state in any dashboards it is included in, and the incident continues to run relevant actions and perform escalation in accordance with the settings in the alarm configuration.
The ACKNOWLEDGED state indicates that although an alarm is ongoing, someone is aware of the problem and is currently working on it. Once an incident has been acknowledged, it is this state that is displayed in the dashboards, instead of the current alarm state. ACKNOWLEDGED incidents also never escalate.
If the condition that caused an alarm clears by itself and the generating check returns to an OK state, the connected incident enters the ALARMS CLEARED state. It remains in that state for a specified period of time (default is 5-minutes, but this is configurable) until Netreo is sure that the check which generated the original alarm is in a stable OK state — at which point the connected incident is automatically CLOSED (the incident is still archived for historical and recording purposes, however). The ALARMS CLEARED period helps prevent additional new incidents being created due to a flapping alarm condition. If the same alarm reoccurs while the incident is in the ALARMS CLEARED state, the incident returns to either the OPEN or ACKNOWLEDGED state — whichever state the incident was last in before changing to ALARMS CLEARED.
If the check whose alarm opened an incident had action groups assigned to it when the incident was opened, those action groups are executed every time the incident changes state (including CLOSED). The original actions from those groups also become locked to that incident, and any changes to their methods only affect future incidents (see action groups for more information). The Incident View dashboard for an incident also provides a means to run arbitrary action groups on it manually (see actions for more information). It is important to note that any “active response” methods contained in any executed actions are only run when the incident first opens or if the group is run manually.
To keep incidents as efficient as possible, Netreo includes a useful set of incident management tools, allowing multiple alarms to be correlated within a single incident. If layer 3 parenting is properly set up within the network, an alarm condition that directly causes other alarm conditions (such as a host-down event, when all child devices of the host become unreachable) is automatically recognized by the incident management system. All subsequent alarms are then automatically bundled into the initial alarm’s incident — rather than allowing each alarm to open its own separate incident. The incident then refers to the originating alarm as the “primary alarm” and the subsequent alarms as “related alarms.” Any related alarms which have been correlated into an existing incident always have the actions of their assigned action groups suppressed. This avoids executing redundant actions, such as alert notifications. Acknowledging an incident acknowledges all of the alarms contained within that incident, as well as the primary alarm. Users can also manually add rules to the incident management system to forcibly correlate alarms into the same incident.
Any incident that is not CLOSED is considered to be an “active incident.” All active incidents can be viewed on the “Active View” tab of the Incident Dashboard (Quick Views > Active Incidents in the main menu).