The Secret Diary of Han, Aged 0x29

twitter.com/h4n

Archive for September 2005

Improving alert handling

Alert types

  • Problem Alerts
    • Single – only one alert is generated for the problem examples: SNMP traps, SSM log alerts
    • Repeat – alerts keep on coming until the problem goes away examples: SSM genalarms, SSM process alerts, ping, ISM alerts
  • Resolution Alerts
    • Single – only one resolution alert comes to when a problem condition is resolved
    • Repeat – alerts keep on coming to signify that a certain condition is OK.

Problem alerts may or may not have corresponding resolution alerts.

  • repeat problem alerts always have corresponding resolution alerts * but, asymmetric thresholds are a problem (should be avoided)
  • single problem alerts usually don’t have corresponding resolution alerts

Alert lifecycle

All single resolution alerts are deleted from Netcool after a few minutes

Repeat resolution alerts are not deleted (this would be pointless)

Problem alerts signal a problem. The problem alerts are “cleared” if the problem goes away.

Problem alerts can be associated with Siebel tickets

  • by opening a ticket based on an alert
  • by adding alerts to already open tickets

Problem alerts are never deleted, unless they are cleared and no open ticket is associated

–> Problem alerts associated with an open ticket are never deleted, even if the alert is cleared

Cleared problem alerts with no open ticket associated are deleted after a few minutes

Problem alerts can be cleared by

  • a corresponding resolution alert
  • an expiry time
  • the operator (manually)

Note that there will be no manual delete of alerts, only a manual clear of the problem.

A problem that is cleared and that has an associated open ticket, will cause a notification to be sent to Siebel, and added to the ticket as a private note. The note says that the alert has cleared.

Operation

If a new problem alert appears, the operator does some diagnostics. Tier 2 engineers may be involved in this process. The goal is to determine if there is a real problem

  • If there is no problem, and this is a single alert, it can be cleared manually
  • Ongoing repeat alerts always means there is a problem. Manually clearing is meaningless, because a new repeat alert comes in every minute
  • (Note: Currently asymmetric thresholds exist for repeat alerts. This means they are not repeating anymore, but no resolution alerts are coming in either. This is difficult to detect, so this condition should be avoided. There is a risk for flapping if thresholds for problem and resolution are the same, so in some cases, e.g. CPU usage, some asymmetry is OK. In other cases (e.g. disk space) the risk for flapping is not big, so equal thresholds should be used)

  • Alerts that are under investigation can be moved out of the primary view to the pending view
  • Repeat alerts that are not important, can be removed from the primary view (again, clearing is meaningless).

Change from current situation:

  • Cycle 13:
    • Current “remove” item in menu will be replaced by a “clear” item. This is a more correct way of describing what is happening.
  • Future cycle:
    • Problem alerts will show whether they are single or repeat.
    • Auto detection of stop of incoming repeat alerts

Alert views

Primary view definition

  • Primary view shows those alerts that operators must pay attention to
  • Based on automatic base rules, and permanent, temporary and manual overrides of those rules
  • Base rules include all problem alerts that are not indeterminate or cleared or have an open ticket, that are not test alerts, and that have an Opsware deployment state of “live” or “unknown”.
  • Resolution alerts and problem alerts that are cleared or inderminate are not shown in the primary view. Neither are alerts for servers that are “in deployment” or “off line” in Opsware
  • These base rules can be refined using permanent or temporary overrides to include or exclude alerts. These rules are based on any combination of alert attribute values.For example, new customer compartments that are under construction can be excluded from the primary view based on IP address ranges.

Changes:

  • Limited filters (exclusion only) in cycle 12
  • Base rules, with permanent, temporary and manual inclusions and exclusions in cycle 13
  • Configuration GUI in operator portal in future cycle.
  • Partitioning of primary view (if multiple operators each handle part of the alerts), in future cycle (if needed).

Assigning alerts

To assist in escalating alerts to Tier 2 the following improvements are made:

  1. Alert category is populated in alerts based on rules. The category can help the operator to see quickly for which engineering group the alert will beExamples:
    • Even though an alert comes from the LDAP server (dc1) which is a tools server, the alert may signify a security problem, and may be better assigned to security team instead of the tools team.
    • A “heartbeat missed” alert can be assigned to the tools team even if it occurs on a non tools server.

    There will be many rules like this, and even if the rules themselves are simple, if there are many, it will be difficult for operators to find the correct engineering group.

  2. On call system.Once a group is determined, the OEs have to quickly know who to call. Fixed schedules do not work, as people are not available. An on call system can show quickly who can be called. This system needs maintenance by each team, so that schedules are up to date. However, since each team can be responsible for its own schedule, it is easier to keep up to date than a centralized system.
  3. Notification.Assigned to group and managers can be notified by e-mail. Assigned to person will be called.

    Notification e-mails can also be sent for notes, so that people have an up-to-date picture of a problem. There is a risk of over notification though.

Customer assignment of alerts

  • Currently only based on Opsware information. This excludes network devices, and alerts that cannot be deduced back to a server
  • Cycle 13: Customer information added to alerts based on rules.

Urgent development issue: Netcool data server should handle failover of objectserver. It is enough to return an error code when connection fails to database. Loadbalancer can do the rest. Netcool data server should be configured as a single IP address in client software. This IP is the loadbalancer’s vip for the dataserver.

Written by Han

September 16, 2005 at 00:01

Posted in Uncategorized

new Tags

I created the production cycle 12 tag yesterday:

1-7-0

This tag is on the 1-7-0-branch branch.

The branch for cycle 13 was also created:

Cycle13-branch

Head is now for cycle 14. Modifications and bug fixes to 13 should go on this branch.

As you can see, the naming convention for tags and branches changed. Since nobody can remember which version belongs to which branch, it seemed better to use the branch in the tag name, instead of the weird version number. D. once did this in the past. I don’t know why I didn’t think this was a good idea….

Written by Han

September 6, 2005 at 16:12

Posted in Uncategorized

Javascript Alert Viewer

Please check out my hobby project for the holidays:

Alerji, a javascript netcool eventviewer (update: now also in production). It runs completely in the browser and refreshes alerts dynamically.

Feedback and comments are welcome.

Written by Han

September 5, 2005 at 22:13

Posted in Uncategorized