Archive for April 2003
Onyx would cut internal tickets without any problem. However, customer tickets were not created. Instead the following error message would be returned:
<errorDescription>ƒNƒGƒŠ ƒGƒ“ƒWƒ“‚ÍA‹ó‚ÌƒŠƒ^[ƒ“ ƒR[ƒh‚ð•Ô‚µ‚Ü‚µ‚½ coming from wbospsiIncident</errorDescription> coming from <sourceName>wbospsiIncident</sourceName>.
Discussing this with Ken H, and stepping through the wbocpscIncident configated stored procedure in Visual Studio.net it was revealed that SQL-mail was failing. When creating a customer ticket, an e-mail is automatically sent to the account manager for that customer. When the xp_sendmail function fails, the configurated code returns an error and the whole stored procedure fails. Obviously, we would want to still have a customer ticket even if automatically forwarding a mail fails. Therefore, the code was changed to not return an error in the case of a mail failure.
Looking into the mail failure, it turned out that there was no mail profile defined for SQL server. SQL server needs this mail profile, provided by a local copy of Outlook 2000 as a MAPI provider is needed, to talk over POP and SNMP to our exchange server.
The main problem we recently had with Micromuse is that once every few minutes, the whole system would become completely unresponsive for a while. The GUI would stop, as would processing of events. What’s more, this period of unresponsiveness apparently was growing longer and longer, up to a few minutes at the end. This turned out to be due to the fact that Micromuse Objectserver is an in-memory SQL database, that once every few minutes writes a dump of it’s memory out to disk. Additionally, we got a disproportionally large number of events from the Firewall/1 probe (over a hundred thousand distinct events). Deactivating this probe and cleaning out the database solved the problem.
We should really take this as a warning that there are definite and no-so-far-away limits to the scalability of object server. More testing is necessary, to find out where the limits are, and how to scale. Ofcourse, in the tools we can always employ multiple Object servers reporting into the same infrastructure.
In addition to this show stopping problem, a number of other problems popped up. For one, micromuse alerts go through a lifecycle of being generated, raised and then resolved. Resolved events “sit” in the system for a while before being deleted. A fresh event that is the same as a resolved event in Micromuse will “reuse” the same event database record, updating it’s status and it’s severity. The other relevant fact here is that every Micromuse event receives, when it is being raised, a reference number. This number is used to update or close an associated helpdesk ticket. The reference number was computed by taking the Micromuse event serial number and appending that to the Object server instance name. However, reusing the same event slot by a fresh event would result in an identical Reference number (since the event serial number remains unchanged), which results in potential the wrong ticket being updated . To remedy this, the reference number was made unique by also including the Tally (count) of the alert. In contrast to the serial number the Tally is updated on each new entry.
Finally, the “up” events were able to close “down” events that occurred after the up event. This allowed Micromuse to get out of sync with the event producing system.
In preparation for a little internal demonstration of the tools, quite a few, though generally quite minor, bugs were found and fixed.
The Event class (PMEvent).
The purpose of the Event class is to support Problem management events (sometimes also called alerts). The class is purely a data container, which holds a small number of standard attributes plus any number of additional, non-standard attributes. These non-standard attributes serialize as an [XmlAnyElement] XmlElement array. In the current implementation the Event class is just a facade around an XmlDocument. In order to support the XmlSerializer, the non-standard attributes have dedicated property members. To differentiate between standard and non-standard attributes, a static string array is used. This lead to the problem as the same information is stored in two places, the property member and the string array. And naturally what happened was that the property name was changed without a corresponding change in the string array. This was fixed, but a more permanent fix would be to fill the array using reflection. Additionally, one may question the fact the deserialization into properties that are then used to create a DOM instance. Direct access to the XML streams should be preferred here as recently described by Tim Ewald.
One more addition to what was said before. If there are multiple datastreams at a component, with identical measurement type, it would be handy to label one as default. Unless special action is taken (an extra attribute in the graph URL), only the default graph stream will be returned to graphstream queries.
As for the reporting services interfaces, they are in a bit of a mess right now. Some guidance is needed, which will be provided right here and now:
- – Get everything you need in one pass
- – But allow enough flexibility (which might require more passes)
- – Do things that can be done on the client, on the client, after the relevant information is obtained from the server
In the context of the reporting service this translates to:
- A service that takes a component (or more components) and returns all datastreams of those components and optionally their children, and also optionally non-default datastreams (1 and 2)
- Client side functionality that provides consolidation functionality to easily create plots and graphs out of datastreams, using a variety of common consolidation mechanisms. The output is in the form of URL’s for inclusion on a reporting website, or an XML doc suited for input into the following service (3)
- A service that that takes an XML doc or a URL describing a graph, and returns the created graph as a Base64 binary gif or png file.
- A service that takes a graph stream or URL indicating a graph stream and returns the raw data for an indicated period of time
It could be argued that component selection and datastream retrieval should be decoupled completely. However, due to principle 1 (get all you need) and the fact that it is envisioned we are often going to need to find those datastreams from component children, this functionality is included in the service.
Consolidation, which is now provided by a service, should be moved to the client, in order to minimize round trips and allow for flexible configurations (e.g. forwarding clients that are webservices in a DMZ). Additionally, this will blend in well with graph layout and customization functionality that is already present in the client.
Now, let’s implement this quickly, and then follow up with decent component selection, based on things like customer ownership, customer service involvement, device types, location, etc.
In order to generalize the peak values functionality described yesterday, while minimizing implementation impact, it seems better not to store peak values (those consolidated with min and max) in a single rrd datafile as multiple datasources. Instead, keeping them around as separate datastreams seems preferable.
The “consolidationfunction” attribute will move out of MeasurementType and into DataStream. A component could, until now, contain only one datastream of a particular measurementtype. From now on, a component can actually have multiple datastreams for a single measurement type, each with a different consolidation function.
At data collection time, data for a component – measurement type comes in. Multiple datastreams may now be found, and they will all be updated from the same incoming SampleCollection.
At graph generation time, all datastreams are treated equally. In fact, no changes to code are necessary there, to support this change. In order to show, say, average and max bandwidth usage info for a certain device, we select both data streams with the appropriate measurement type and compose the graph as we wish. Additionally, peak streams can be collected at sparser invervals (using a different RRD setup), since the first RRA can be foregone. Finally, all peak data is available for consolidation as any other data, and configurations with avg, min and max values are possible.