The Secret Diary of Han, Aged 0x29

twitter.com/h4n

Archive for February 2003

Graph Consolidation (continued). Graph classification and URLS

Further classification of the graph concept

In order to further the development of the Graph Consolidation feature, it is useful to update the terminology a bit. From now on we shall consider the following three distinct entities that were all known as “graph” before:

  1. Data stream. This the ever updating stream of time-value points for a particular measurement. It only connotes the values without implying anything about graphical representation
  2. Plot. This refers to the rendering of one or more data streams into a single graphical representation, either a line or surface chart, barchart or other. If multiple data streams are involved in a single plot, then they are consolidated by averaging, summing or otherwise calculating a single value for each plotted time point out of those multiple data streams
  3. Graph. This refers to the rendering of one or more plots into a single 2D chart, complete with axes, legends, etc. If two or more plots are to be drawn into a single graph, their y-quantities have to be compatible.

URL representation of a graph

A graph is to be represented by a URL in the following fashion:

for a graph containing a single plot, consisting of a single datastream:

[baseUrl]/graphs/[streamid].png

For a graph possibly containing multiple plots and datastreams:

[baseUrl]/graph.png?s11=[streamid]&s12=[streamid]&s21=[streamid]

gxy denotes the stream id of the yth datastream in the xth plot

In order to enable the user to create graph combinations, rich metadata about datastreams can be obtained using a different service, which enhances the current GetGraphInformation service. What is returned is not information about the graph, but about the datastream (using the new terminology). This and component information can be used to co-plot and consolidate datastreams in meaningful manners.

Finally, the datastreams themselves will be available and URL addressable too, in order to facilitate their use by other graphing and facilities tools in the future. Their URL should be predictable from graph URL’s, perhaps as follows:

[baseUrl]/datastreams/[streamid].xml, or even
[baseUrl]/graphs/[streamid].xml

Written by Han

February 23, 2003 at 20:47

Posted in Uncategorized

Batch provisioning of devices

This allows the administrator to set up “templates” for each kind of device, in order to ease the provisioning of all the subcomponents and graphs of similar devices. An Alteon, for instance, might consists of a toplevel component (representing the Alteon box itself) and numerous subcomponents, representing interfaces, ports, etc. Each of these subcomponents that can have any number of graphs. Since all Alteons are similar, we can reuse a template for all of these.

A template can consist of fixed component and graph definitions.

It can also contain other templates.

Provisioning then is a two step process.

  1. Step one expands the template into a complete ”instantiated template”
  2. Step two uses the information in the instantiated template to provision all the components and graphs contained in it.

Template expansion needs input. This input can be of at least two kinds:

  1. Fixed values (like an IP address)
  2. Repetition values (telling us how many times a contained template should be instantiated)

These values can be input by a user, or obtained by other means. For instance, the number of interfaces on a router might be obtained through an SNMP call.

Provisioning the instantiated template should be transactional (atomic). If something goes wrong half way, the whole provisioning action should be rolled back. This can be implemented through a true database transaction, or through defined contingency actions that are executed on roll back. If we limit this batch provisioning to new components and graphs only, contingency steps are simply deletions of records in databases and deletions of files (rrd files) on disk.

Written by Han

February 20, 2003 at 03:45

Posted in Uncategorized

Creating a single point of provisioning for reporting between layer 3 and 2

It basically means that the dependency viewer application made by S (we need a new name for this…), not only updates the O and RRD databases, but also writes a line in the measurements.conf file of the SNMP poller. Since there can be many such pollers, and since they usually are on different machines, it is best to write a small webservice that does the actual writing, and the dependency viewer can access that web service. It means that each component needs to know somehow, which poller is going to poll. Initially, we can just configure a list in the dependency viewer and choose manually for each component. However, ultimately we should do this automagically (based on IP address, address space, etc.).

The Measurements.conf file contains a single line for each measurement. This line basically contains component, measurement type, and poll interval. The component is specified using address, class and instance. address space is not needed here since it is assumed that a single poller works within one address space. Therefore, the address space is statically configured for each poller.

Also, it is assumed that only the measurements file is autoprovisioned. The OID and instance mappings still have to be done by hand, but since there will be only a modest number of each of these, that is OK.

Written by Han

February 20, 2003 at 03:43

Posted in Uncategorized

Enhancement of Ntt.Oss.Pm.Translator

The Translation module (Ntt.Oss.Pm.Translator) i PMSupport was enhanced to allow unknown values to map to a standard default. Consider the following:

<Attribute down=”Category up=”sourceId“>

  <ValueMap down=”Net up=”101426” />

  <ValueMap down=”Unix up=”101427” />

  <ValueMap down=”Win up=”101428” />

  <ValueMap down=”Apps up=”101429” />

  </Attribute>

 

      This maps <Category>Net</Category> to <sourceId>101426</sourceId>.

      However, if the original value is not in the given set (Net, Unix, Win or Apps), it is simply not translated. Onyx, however, is very unforgiving about unknown values. Instead of ignoring it, it will abort the transaction and return an error. Therefore it would be useful to map unspecified values to a given default. This has been added as by allowing “*” for the down attribute. The following:

 

<Attribute down=”Category up=”sourceId“>

  <ValueMap down=”Net up=”101426” />

  <ValueMap down=”Unix up=”101427” />

  <ValueMap down=”Win up=”101428” />

  <ValueMap down=”Apps up=”101429” />

  <ValueMap down=”* up=”101454” />

  </Attribute>

     

      Maps all values other than Net, Unix, Win and Apps to 101454, which happens to be the default category in Onyx.

      At a future date, the mapping will be extended to allow general regular expressions.

 

Written by Han

February 20, 2003 at 03:15

Posted in Uncategorized

Smart24 – Micromuse integration is working

Tested flow through of events from CA Unicenter TNG to Onyx helpdesk today, through Micromuse and our tools. After fixing some minor problems this works.

In order to test this I ran two testscripts on ip.ad.dr.ess that simulate an open and close event on Unicenter by issuing a cawto into the console. Ticket opening worked flawlessly from the start. However, tickets would not be closed. The following problems were solved:

  • The TNG rules file incorrectly looked for lower case Status fields in the Unicenter message. Up, Down and Critical start with a capital. The default is Down, which is why opening a ticket worked
  • The test script set the Category to “none”. This turns out not to be valid category in Onyx. Onyx is very unforgiving concerning unknown values and returns an error (without telling us which of the zillion fields contains the error). Changing the Category to “Net” ofcourse solved the problem. However, in order to prevent further problems, the XmlTranslation module was enhanced with a default translation if none of the given values matches.

Now closing the event through TNG works. The event changes status from Ticket Opened to Ticket Closed and changes Severity from 5 to 0. When a new down event comes in, the count is increased, and due to our “UPDATE ON DEDUPLICATION” modifier, the Severity is changed as well, and the event is reprocessed. Closing the open ticket in Onyx works in the same way.

Written by Han

February 20, 2003 at 03:06

Posted in Uncategorized

Tools – Micromuse Integration (v2) implemented!

Finished implementation of the integration.

Added 4 fields to the alerts.status table:

  1. Ticket: To hold ticket number issued by helpdesk
  2. ComponentClass: part of component identification. AddressSpace will be added by the adapter
  3. Cat: The category that is used to route events in the helpdesk
  4. OrigSeverity: Closed tickets will be subject do a severity downgrade. The original severity will be stored here

Additionally, another 4 fields were added to the generic_clear_problem_events table. This is a temporary table that is used during up/down correlation:

  1. ComponentClass. See above.
  2. UpIdentifier. This is the Identifier field of the incoming up-event. Stored here so that we can find the event back later
  3. Severity. Used to later give the up-event the same severity as the original down event
  4. Ref. The up-event needs the tools reference number so that it knows which ticket to close

In addition, the Severity field of alerts.table was modified with “UPDATE ON DEDUPLICATION”. This implies an event in closed status with a down graded Severity can be revived if a new down event of the same kind and component comes in.

The implemenation consists mainly of 7 object server automations:

  1. AA_Raise. This selects down events (Type = 1) that are in not-raised or closed status (Status = 0 or 4), and that are of Severity major or critical (4 or 5, or in Tools terms, Critical or Fatal). It then gives these events a unique Ref number, and bumps up their status to raising (Status = 1). This status is picked up by the Socket adapter (configuration in GW.conf) that filters for this status value. After sending it to the socket gateway listener, the status is bumped up to raised (2).
  2. AB_OpenedNotification. This selects ticket open notification events injected into object server by the event adapter (socket probe). These events have “Tools″ as their manager value and a status of ticket opened Status = 3). The action finds the corresponding down event that is still in “raised” status, based on theReference number and bumps up it’s status to “ticket opened” as well. The injected event is then deleted.
  3. AC_ClosedNotification. This event is also injected by the tools, now as a response to closing events. It sets the status of the original down event to “ticket closed” and deletes the injected event, and the up-event that caused the ticket close (if any).
  4. AD_GenericClear_Populate_Problems_Table. This, together with the AE and AF automations, takes care of up/down correlation. They are modifications of the generic events shipped with object server. This automation selects Type=1 (down) events that are currently open, based on whether there might be matching Type 2 (up) events, by checking for the following up-event attributes: Manager, AlertGroup, AlertKey, Node, and ComponentClass. Recall that these attributes define the component/event combination. If this is the case, a record is populated in the temporary generic_clear_problem_events table.
  5. AE_GenericClear_Correlate_Problems_Table. Selects up events of non-raised status and matches with open down events in the temporary table, based on the same 5 attributes described above for the AD event. It then marks the down event as resolved, and stores the identifier of the up event in the UpIdentifier field of the down event.
  6. AF_GenericClear_Correlate_Status_Table. Selects the resolved events in the trigger. Then raises the corresponding up-event in the action by setting it’s status to raising (Status=1). It also copies the down-event’s reference number and Severity. Note that AD to AF could have been coded in one rule, without a temporary table. Apparantly, Micromuse coded the rules this way for speed. It might be wise, however, to do some measurements to determine if the extra complexity is worth it.
  7. CleanStaleUpEvents. This automation removes up-events that are stuck in the raised state. This can happen if they try to close a non-exisitent or already closed ticket. They are removed based on how long they are in this state (currently set to 5 minutes).

The probe files (currently Simnet, TNG and Tivoli) have been updated so that Type is set correctly for up and down events. Previously, events were considered down events if their Severity was equal to 0. Now an explicit Type=2 is required. The probe files were also updated to make use of the generalized component / event identification using the 5 attributes described above.

The two perl soap servers used to communicate to the tools (SocketGatewayListener.pl and EventAdapter.pl) were updated to reflect the changes and the new attributes that were passed in. The same holds for the socket probe rules file and the gateway configuration file.

Finally, previously Status was known as STATUS, and Ref was called REF. All references to these fields were renamed as well.

To do:

  • Retry down events stuck in the raised state. They might be stuck due to a communication failure with tools. Specify a max number of retries, and inject an event if they cannot be sent even after retries.
  • Write general clean-up automations for events that are not raised or closed, and that are around for a specified time.
  • Category field is not yet filled in for Tivoli events
  • OpenView events are not yet adapted

Written by Han

February 19, 2003 at 01:38

Posted in Uncategorized

Mechanics of Tools – Micromuse Integration

Prerequisites:

  • Make sure, by adapting the probe rules files if necessary, that Type = 1 is set for problems and Type = 2 for resolutions.
  • Also make sure the value for the Status attribute is 0 (not raised)
  • Have an automation that selects all Type = 1 and Status = 0 and Severity >= 4 events.
    Action: Set Status to 1 (raising), and generate reference number in Ref
  • Have a filter on the socket adapter that forwards to tools which selects on Status=1 and updates Status to 2
  • The  Socket Probe will inject “Ticket Opened” events into Object Server. They have Status=3 and Manager = Tools
  • Have an automation that selects these events on Status = 3 (opened Ticket) and Manager is Tools. Then find events with the same Ref and Status=2 (raised). Delete the Tools injected event and raise Status of the original event to 3 (Opened ticket)
  • Modify the standard up/down correlation automations to select on all component / monitor fields.
  • Modify the last standard up / down correlation automation to set Status to raised of the incoming Type 2 event, and to set Ref to the value from the correlated Type 1 event.
  • Have an automation that selects events on Status = 4 (closed Ticket) and Manager is Tools. Then find events with the same Ref and Status=3 (opened Ticket). Delete the Tools injected event and raise Status of the original event to 4 (Closed ticket)
  • Have an automation that periodically cleans up non-raisable events, and closed events.

Written by Han

February 18, 2003 at 01:48

Posted in Uncategorized

Components and monitors in Tools and Micromuse

Tools events have four attributes that together define a component: Address, AddressSpace, Component Class and Component Instance. All of these bus Address are optional. Address usually specifies the IP address of a component, and AddressSpace an optional disambiguator between overlapping private IP address ranges (10.x, etc.). An IP address itself is often too crude an indication of component, and that is where Class and Instance come in. Class denotes the general class of a component, of which more than one instances may be present at a certain ip address. Examples are CPU 0 (class is CPU, instance is 0) and process httpd. (As an aside, technically, the instance should be a pid, but pid’s are dynamic, whereas process names are not. In practice, using names is hardly a problem). Components are important in the tools, as an event is always correlated back to a component, on which state is kept, and to which services, customers, and other components are linked. Events are explicitly not classified, but are allowed to be free form. Multiple events (and tickets) per component are supported, but a component can only be in one state. Since events are free form, the tools do not attempt to correlate up to down events. Instead this task is delegated to the layer 2 (system management) application, which is much better equipped to do this kind of low level correlation. Specifically, the tools do not keep event state, which is required for up / correlation of events, but only component state.

Micromuse, by contrast, does keep event state. Micromuse, does not keep component state. It keeps event state, simply by keeping incoming events around. Events coming in later can be correlated with events already there. This is not only important for up / down correlation, but also for collapsing of multiple identical events (something Micromuse calls “deduplication”). Deduplication works simply by construction a primary key of all the event attributes that define it as unique. Up / Down correlation cannot work that way. Instead, Micromuse includes a set of standard automations that handle this task. In order to accomplish it, events should be identifiable unambiguously by a standard number of attributes. In addition, an attribute should record up / down status. The latter attribute is called “Type”. In order to identify an event, Micromuse uses four attributes: Node, Manager, AlertGroup and AlertKey. Between the up and the down events, these fields should have the same values in order for the correlation to match them up.

In order to understand how to map Micromuse events to tools components, it is important to understand how they conceptually relate to each other. An event originates from a component. It might be sent by the component, or, much more commonly, it can be generated by an agent or probe monitoring the component. Multiple events can originate from a component. If we choose the granularity of component decomposition right, a component and a monitor together uniquely identify an event. So, event = component + monitor.

This is clear enough. Unfortunately the terminology of the Micromuse attributes does not clearly indicate where the monitor is. Is it AlertKey, AlertGroup, Manager, or perhaps some permutation of these? Additionally, how do we map the tool’s four component attributes? Can we only use the Micromuse Node attribute here?

Analyzing existing Openview, CA en Tivoli events, we note the following:

  • AlertGroup is hardly used
  • Manager is either a general indication of the management system (Openview NMS) or a specific instance of it (TEC-tecsvr01)
  • AlertKey is sometimes used as ComponetInstance (qfe0), but often it is empty
  • Node sometimes contains the IP address, sometimes a hostname, and sometimes it’s totally different

Recommendation:

  • Together use Alertgroup and Manager as the monitor specification
  • Equate AlertKey to Component Instance
  • Create a new attribute for Component Class
  • Consistently use the IP address in Node, and equate it to Address
  • Create a new attribute for AddressSpace, but use only when necessary in a dedicated automation or in the adapter.

Written by Han

February 18, 2003 at 00:46

Posted in Uncategorized

Severity and up/down status in Tools and Micromuse

Incoming events into Omnibus have severity levels 0 to 5. Only events of severity 4 or 5 (critical or fatal) are forwarded to the tools. In order to automatically close problems, Tools supports the notion of “up” events, that indicate a problem has been resolved. An “Up” event has to be correlated to it’s corresponding “down” event. This is one of the severity 4 or 5 events that were forwarded in the first place.

In previous integrations with Tivoli and CA Unicenter, a severity level of 0 was used to indicate an up event. However, Tools itself keeps the notion of up and down distinct from severity levels. Two separate attributes are used, “Severity” and “TicketAction”. In fact, it is perfectly normal, and encouraged, to have an “up” event with a severity of critical or fatal. This signals that a critical or fatal event will be closed. The Tivoli and CA adapters into Tools made sure the severity of the up event was set to the original severity of the down event. One place where this is important is in the service correlator, which only looks at severity=fatal events, irrespective of whether they are up or down.

It turns out that Micromuse, in contrast to Tivoli and CA has a similar concept of keeping severity and up/down distinct. In micromuse there is, in addition to the Severity attribute a “Type” attribute that serves to indicate up/down status, where a value of 1 indicates down or “problem” and 2 indicates up or “resolution”.

Omnibus server comes with a couple of standard automations that correlate up to down events based on type (up/down status) and component.

Written by Han

February 17, 2003 at 23:57

Posted in Uncategorized

IBM / SPDE visit report

Feb 7, 2003

IBM presents a product suite called “Service Provider Delivery Environment” (SPDE, pronounced “speedy”). This is a product bundle around WebSphere Business Integration (WBI), with standard adapters and integrations (MetaSolv provisioning among others), and a “common object model” based around TeleManagement Forum’s NGOSS (Next Generation OSS).

Background:
IBM Acquired CrossWorlds about a year ago, and rechristened the CrossWorlds product to WBI. This is a separate product from WebSphere Application Server. WBI also uses MQ workflow at run-time, which is integrated with Holosofx (another IBM aquisition) workflow design and analysis tools. WBI works around a common object model. Products that integrate with WBI transform their native objects and datastructures to the common object model in adapters. Workflow then works with this common object model. The data in this common object model is not persistent beyond workflow processes. Only identity mappings are persisted. As mentioned before, this common object model is based on an object model from NGOSS.

Siebel bought into this concept and named it UAN (Universal Application Network). In addition to a common object model, it also contains standard processes. Siebel and IBM cooperate on UAN. However, WBI supports any “common” object model and is not bound to what UAN prescribes.

The common object model is not XML-based. XML is just perceived as one of the serialization options of those objects. Consequently, transformations are not based on XQuery or XSL. Two lab-setting demo’s were shown by conference call. One involving ADSL provisioning. The other implemented Micromuse – Siebel integration through WBI. Apparently, Micromuse will support WBI directly through an adapter that is currently under implementation.

Having seen what’s coming up in both Websphere and WebLogic on a single day, this provided some real nice insights in where they are heading and how they differ. The most striking difference is the presence of the common object model in WBI, whereas WLI makes no such assumption. Instead WLI makes data transformations and aggregations easy using liquid data and XQuery support. Additionally, in WLI, XML and WebService support is at the core of the product, whereas in WBI it is merely one of the integration options.

The problem with a common object model is just that, that it has to be common among a group of vendors often with conflicting interests and requirements. Standardization on such a model is tedious and time-consuming and has never been very succesful in the past (both OASIS and Microsoft have attempted, but neither faired very well). If an agreement on such a model can be reached, it usually turns out that time has passed and “things” have changed. Requirements are different, previously unanticipated usage needs to be supported. Change, it turns out, is the only constant factor. Therefore, BEA’s strategy, focussing on ease of transformations makes much more sense going forward.

In addition to this, the BEA suite seems much more tightly integrated, and seems to offer better development support. Nevertheless, the shear weight of IBM will probably still make WBI an important player. As for our OSS, it seems clear that for the problem management side both WLI and WBI are overkill. This is because problem handling deals with a multitude of leight-weight events that don’t benefit from relatively heavyweight workflow and translation services and transactions support that WLI and WBI offer. Only from a provisioning standpoint will these players gain importance, as provisioning touches various enterprise subsystems (billing, inventory management, CRM) in a process-transactional fashion.

Written by Han

February 12, 2003 at 01:08

Posted in Uncategorized