Archive for February 2003
Further classification of the graph concept
In order to further the development of the Graph Consolidation feature, it is useful to update the terminology a bit. From now on we shall consider the following three distinct entities that were all known as “graph” before:
- Data stream. This the ever updating stream of time-value points for a particular measurement. It only connotes the values without implying anything about graphical representation
- Plot. This refers to the rendering of one or more data streams into a single graphical representation, either a line or surface chart, barchart or other. If multiple data streams are involved in a single plot, then they are consolidated by averaging, summing or otherwise calculating a single value for each plotted time point out of those multiple data streams
- Graph. This refers to the rendering of one or more plots into a single 2D chart, complete with axes, legends, etc. If two or more plots are to be drawn into a single graph, their y-quantities have to be compatible.
URL representation of a graph
A graph is to be represented by a URL in the following fashion:
for a graph containing a single plot, consisting of a single datastream:
For a graph possibly containing multiple plots and datastreams:
gxy denotes the stream id of the yth datastream in the xth plot
In order to enable the user to create graph combinations, rich metadata about datastreams can be obtained using a different service, which enhances the current GetGraphInformation service. What is returned is not information about the graph, but about the datastream (using the new terminology). This and component information can be used to co-plot and consolidate datastreams in meaningful manners.
Finally, the datastreams themselves will be available and URL addressable too, in order to facilitate their use by other graphing and facilities tools in the future. Their URL should be predictable from graph URL’s, perhaps as follows:
[baseUrl]/datastreams/[streamid].xml, or even
This allows the administrator to set up “templates” for each kind of device, in order to ease the provisioning of all the subcomponents and graphs of similar devices. An Alteon, for instance, might consists of a toplevel component (representing the Alteon box itself) and numerous subcomponents, representing interfaces, ports, etc. Each of these subcomponents that can have any number of graphs. Since all Alteons are similar, we can reuse a template for all of these.
A template can consist of fixed component and graph definitions.
It can also contain other templates.
Provisioning then is a two step process.
- Step one expands the template into a complete ”instantiated template”
- Step two uses the information in the instantiated template to provision all the components and graphs contained in it.
Template expansion needs input. This input can be of at least two kinds:
- Fixed values (like an IP address)
- Repetition values (telling us how many times a contained template should be instantiated)
These values can be input by a user, or obtained by other means. For instance, the number of interfaces on a router might be obtained through an SNMP call.
Provisioning the instantiated template should be transactional (atomic). If something goes wrong half way, the whole provisioning action should be rolled back. This can be implemented through a true database transaction, or through defined contingency actions that are executed on roll back. If we limit this batch provisioning to new components and graphs only, contingency steps are simply deletions of records in databases and deletions of files (rrd files) on disk.
It basically means that the dependency viewer application made by S (we need a new name for this…), not only updates the O and RRD databases, but also writes a line in the measurements.conf file of the SNMP poller. Since there can be many such pollers, and since they usually are on different machines, it is best to write a small webservice that does the actual writing, and the dependency viewer can access that web service. It means that each component needs to know somehow, which poller is going to poll. Initially, we can just configure a list in the dependency viewer and choose manually for each component. However, ultimately we should do this automagically (based on IP address, address space, etc.).
The Measurements.conf file contains a single line for each measurement. This line basically contains component, measurement type, and poll interval. The component is specified using address, class and instance. address space is not needed here since it is assumed that a single poller works within one address space. Therefore, the address space is statically configured for each poller.
Also, it is assumed that only the measurements file is autoprovisioned. The OID and instance mappings still have to be done by hand, but since there will be only a modest number of each of these, that is OK.
The Translation module (Ntt.Oss.Pm.Translator) i PMSupport was enhanced to allow unknown values to map to a standard default. Consider the following:
–<Attribute down=”Category“ up=”sourceId“>
<ValueMap down=”Net“ up=”101426” />
<ValueMap down=”Unix“ up=”101427” />
<ValueMap down=”Win“ up=”101428” />
<ValueMap down=”Apps“ up=”101429” />
This maps <Category>Net</Category> to <sourceId>101426</sourceId>.
However, if the original value is not in the given set (Net, Unix, Win or Apps), it is simply not translated. Onyx, however, is very unforgiving about unknown values. Instead of ignoring it, it will abort the transaction and return an error. Therefore it would be useful to map unspecified values to a given default. This has been added as by allowing “*” for the down attribute. The following:
<Attribute down=”Category“ up=”sourceId“>
<ValueMap down=”Net“ up=”101426” />
<ValueMap down=”Unix“ up=”101427” />
<ValueMap down=”Win“ up=”101428” />
<ValueMap down=”Apps“ up=”101429” />
<ValueMap down=”*“ up=”101454” />
Maps all values other than Net, Unix, Win and Apps to 101454, which happens to be the default category in Onyx.
At a future date, the mapping will be extended to allow general regular expressions.
Tested flow through of events from CA Unicenter TNG to Onyx helpdesk today, through Micromuse and our tools. After fixing some minor problems this works.
In order to test this I ran two testscripts on ip.ad.dr.ess that simulate an open and close event on Unicenter by issuing a cawto into the console. Ticket opening worked flawlessly from the start. However, tickets would not be closed. The following problems were solved:
- The TNG rules file incorrectly looked for lower case Status fields in the Unicenter message. Up, Down and Critical start with a capital. The default is Down, which is why opening a ticket worked
- The test script set the Category to “none”. This turns out not to be valid category in Onyx. Onyx is very unforgiving concerning unknown values and returns an error (without telling us which of the zillion fields contains the error). Changing the Category to “Net” ofcourse solved the problem. However, in order to prevent further problems, the XmlTranslation module was enhanced with a default translation if none of the given values matches.
Now closing the event through TNG works. The event changes status from Ticket Opened to Ticket Closed and changes Severity from 5 to 0. When a new down event comes in, the count is increased, and due to our “UPDATE ON DEDUPLICATION” modifier, the Severity is changed as well, and the event is reprocessed. Closing the open ticket in Onyx works in the same way.
Finished implementation of the integration.
Added 4 fields to the alerts.status table:
- Ticket: To hold ticket number issued by helpdesk
- ComponentClass: part of component identification. AddressSpace will be added by the adapter
- Cat: The category that is used to route events in the helpdesk
- OrigSeverity: Closed tickets will be subject do a severity downgrade. The original severity will be stored here
Additionally, another 4 fields were added to the generic_clear_problem_events table. This is a temporary table that is used during up/down correlation:
- ComponentClass. See above.
- UpIdentifier. This is the Identifier field of the incoming up-event. Stored here so that we can find the event back later
- Severity. Used to later give the up-event the same severity as the original down event
- Ref. The up-event needs the tools reference number so that it knows which ticket to close
In addition, the Severity field of alerts.table was modified with “UPDATE ON DEDUPLICATION”. This implies an event in closed status with a down graded Severity can be revived if a new down event of the same kind and component comes in.
The implemenation consists mainly of 7 object server automations:
- AA_Raise. This selects down events (Type = 1) that are in not-raised or closed status (Status = 0 or 4), and that are of Severity major or critical (4 or 5, or in Tools terms, Critical or Fatal). It then gives these events a unique Ref number, and bumps up their status to raising (Status = 1). This status is picked up by the Socket adapter (configuration in GW.conf) that filters for this status value. After sending it to the socket gateway listener, the status is bumped up to raised (2).
- AB_OpenedNotification. This selects ticket open notification events injected into object server by the event adapter (socket probe). These events have “Tools″ as their manager value and a status of ticket opened Status = 3). The action finds the corresponding down event that is still in “raised” status, based on theReference number and bumps up it’s status to “ticket opened” as well. The injected event is then deleted.
- AC_ClosedNotification. This event is also injected by the tools, now as a response to closing events. It sets the status of the original down event to “ticket closed” and deletes the injected event, and the up-event that caused the ticket close (if any).
- AD_GenericClear_Populate_Problems_Table. This, together with the AE and AF automations, takes care of up/down correlation. They are modifications of the generic events shipped with object server. This automation selects Type=1 (down) events that are currently open, based on whether there might be matching Type 2 (up) events, by checking for the following up-event attributes: Manager, AlertGroup, AlertKey, Node, and ComponentClass. Recall that these attributes define the component/event combination. If this is the case, a record is populated in the temporary generic_clear_problem_events table.
- AE_GenericClear_Correlate_Problems_Table. Selects up events of non-raised status and matches with open down events in the temporary table, based on the same 5 attributes described above for the AD event. It then marks the down event as resolved, and stores the identifier of the up event in the UpIdentifier field of the down event.
- AF_GenericClear_Correlate_Status_Table. Selects the resolved events in the trigger. Then raises the corresponding up-event in the action by setting it’s status to raising (Status=1). It also copies the down-event’s reference number and Severity. Note that AD to AF could have been coded in one rule, without a temporary table. Apparantly, Micromuse coded the rules this way for speed. It might be wise, however, to do some measurements to determine if the extra complexity is worth it.
- CleanStaleUpEvents. This automation removes up-events that are stuck in the raised state. This can happen if they try to close a non-exisitent or already closed ticket. They are removed based on how long they are in this state (currently set to 5 minutes).
The probe files (currently Simnet, TNG and Tivoli) have been updated so that Type is set correctly for up and down events. Previously, events were considered down events if their Severity was equal to 0. Now an explicit Type=2 is required. The probe files were also updated to make use of the generalized component / event identification using the 5 attributes described above.
The two perl soap servers used to communicate to the tools (SocketGatewayListener.pl and EventAdapter.pl) were updated to reflect the changes and the new attributes that were passed in. The same holds for the socket probe rules file and the gateway configuration file.
Finally, previously Status was known as STATUS, and Ref was called REF. All references to these fields were renamed as well.
- Retry down events stuck in the raised state. They might be stuck due to a communication failure with tools. Specify a max number of retries, and inject an event if they cannot be sent even after retries.
- Write general clean-up automations for events that are not raised or closed, and that are around for a specified time.
- Category field is not yet filled in for Tivoli events
- OpenView events are not yet adapted
- Make sure, by adapting the probe rules files if necessary, that Type = 1 is set for problems and Type = 2 for resolutions.
- Also make sure the value for the Status attribute is 0 (not raised)
- Have an automation that selects all Type = 1 and Status = 0 and Severity >= 4 events.
Action: Set Status to 1 (raising), and generate reference number in Ref
- Have a filter on the socket adapter that forwards to tools which selects on Status=1 and updates Status to 2
- The Socket Probe will inject “Ticket Opened” events into Object Server. They have Status=3 and Manager = Tools
- Have an automation that selects these events on Status = 3 (opened Ticket) and Manager is Tools. Then find events with the same Ref and Status=2 (raised). Delete the Tools injected event and raise Status of the original event to 3 (Opened ticket)
- Modify the standard up/down correlation automations to select on all component / monitor fields.
- Modify the last standard up / down correlation automation to set Status to raised of the incoming Type 2 event, and to set Ref to the value from the correlated Type 1 event.
- Have an automation that selects events on Status = 4 (closed Ticket) and Manager is Tools. Then find events with the same Ref and Status=3 (opened Ticket). Delete the Tools injected event and raise Status of the original event to 4 (Closed ticket)
- Have an automation that periodically cleans up non-raisable events, and closed events.