Archive for February 2004
Some alerts are not properly formatted yet. CPU Utilization for CPU 0 on ip.ad.dr.ess has this Alert Key: CPU 0: sparcv9 440 Mhz (FPU: sparcv9). It should be simply “0″. It is an SSM Gen-alarm.
Also Probe disconnects (connection watch and probe watch) are not set up properly yet. They use hostname instead of ip address, and have a long number as alertkey
Currently only critical and fatal alerts are forwarded to tools and have tickets openend for them. Only these events are closed automatically when resolutions come in. We could also auto-close non forwarded events (those that are Severity 1, 2 or 3). This should be easier, since we don’t have to wait for a notification of ticket closure. Any thoughts anyone?
In addition, what do we do with Events that are Severity 4 or 5, but are not forwarded yet because they have PrecisionStatus or ImpactStatus set to 1? Currently we don’t autoclose them, but we could, and since no ticket is opened yet, we can close them right away.
Customer and Operator portal are the same thing. It’s like the SSL-VPN portal, where customers only get access to a subset of the total functionality, whereas our admins /operators get access to everything. I’ll have a look at the stories and the naming, and perhaps combine a couple into one.
Tickets should not be created for components that are in planned downtime, since that would create confusion for operators and engineers. The idea is to create a simple event filter in Smart24. Smart24 already has a component database, and events are enriched with component information. We’ll flag those components that are in planned down time, and filter events that come in for them. The filtering itself entails sending a NotOpenedNotification event back to Netcool, that will be correlated back to the original event. That event will then be demoted and put in the NotOpened status. This way the fact that an event came in is visible in Netcool for a while, but no ticket will be opened. Any comments or ideas?
As you are all no doubt aware, we have a slight naming inconsistency for Severity levels. In Tools, traditionally we have “Fatal” for components that are down and don’t function anymore (e.g. ping fail), and “Critical” for components that are in trouble but still function somewhat (e.g. diskspace Unfortunately, what Tools calls Fatal is called Critical in Netcool, and what Smart24 calls Critical is Major in Netcool.
Obviously this will cause tremendous confusion among operators. So we have to make naming consistent.
I feel the term “Fatal” is more in the spirit of the meaning we assigned to that Severity (namely something that is out). Would it cause huge problems if we adopt the Smart24 naming convention by renaming the Severity levels in Netcool?
The dispatcher can dispatch events to 3 kinds of listeners:
- Sniffers. Events are sent to all matching sniffers first.
- Modifiers. When an event matches the rule for a modifier, it is sent there, and further dispatching stops. The modifier is expected to sent the modified event back to the Dispatcher
- Endpoints. These are matched last. An event is sent to all matching endpoints
Event dispatching is not affected by any errors that occur in Sniffers or Endpoints. However, since a modifier is expected to post the event back to the dispatcher, if an error occurs in a modifier that prevents a post back, event dispatching stops. Since Modifiers are matched before end-points, an event would not reach the endpoint in such a case.
In fact this situation occurred today, as the auto deployment of the ServiceCorrelator failed due to a locked PerformanceCounters.dll. Part of the ServiceCorrelator is the AddCustomer modifier. This modifier adds component and customer information to an event and then posts back the enriched event to the dispatcher. Because deployment failed, the AddCustomer modifier was not running, and dispatching the event to this modifier resulted in an exception. Since there obviously was no post-back from this modifier, events did not reach their endpoints
This situation was remedied today, by having the Dispatcher check the result of the dispatch to modifiers. If an exception occurs, either because the modifier cannot be reached, or because the modifier faults when processing the event and returns a SOAP error, the modifier is now skipped in the Dispatcher. In order to enable this, the call to dispatch to Modifiers had to be made into a synchronous call. This poses a problem for the SocketGatewayListener. Since this Perl program is singlethreaded, it’s call to the Dispatcher is synchronous and blocks until a result is returned. Therefore the synchronous dispatch to a modifier in the Dispatcher cannot occur in the same thread that called the Accept method, as this would cause the SocketGatewayListener to wait for a timeout of any call to a Modifier from the Dispatcher! Therefore, dispatching events in the Dispatcher is now delegated to a different thread on the threadpool, and the Accept call can return asynchronously. The dispatch to Modifiers is still synchronous with respect to dispatches to other listeners, but asynchronous with respect to the calling thread.
W, I doubt whether the AB and AC automations need to check for PrecisionStatus too. After all the Opened and Closed notifications will only occur when an event has been raised to Smart24 (using the AA automation), and for that it’s PrecisionStatus had to be clear. What would happen if PrecisionStatus of an event that had been raised would be modified to 1 before the notification occurs? This event would remain stuck in a Status of 2, since the AB or AC automation wouldn’t fire on it. Therefore it may be better to remove the status check from AB and AC. Or did I miss something?
I modified the time before closed events are cleared to 1 hour, so that we can see closed events a little easier before they are deleted. What would be a good value for a production environment?