The Secret Diary of Han, Aged 0x29

twitter.com/h4n

Archive for February 2004

Some alerts not formatted right

Some alerts are not properly formatted yet. CPU Utilization for CPU 0 on ip.ad.dr.ess has this Alert Key: CPU 0: sparcv9 440 Mhz (FPU: sparcv9). It should be simply “0″. It is an SSM Gen-alarm.

Also Probe disconnects (connection watch and probe watch) are not set up properly yet. They use hostname instead of ip address, and have a long number as alertkey

Written by Han

February 26, 2004 at 15:21

Posted in Uncategorized

Automatic closure of non-critical and non-fatal events

Currently only critical and fatal alerts are forwarded to tools and have tickets openend for them. Only these events are closed automatically when resolutions come in. We could also auto-close non forwarded events (those that are Severity 1, 2 or 3). This should be easier, since we don’t have to wait for a notification of ticket closure. Any thoughts anyone?

In addition, what do we do with Events that are Severity 4 or 5, but are not forwarded yet because they have PrecisionStatus or ImpactStatus set to 1? Currently we don’t autoclose them, but we could, and since no ticket is opened yet, we can close them right away.

Written by Han

February 26, 2004 at 15:12

Posted in Uncategorized

re: Customer Portal vs. Operator Portal

Customer and Operator portal are the same thing. It’s like the SSL-VPN portal, where customers only get access to a subset of the total functionality, whereas our admins /operators get access to everything. I’ll have a look at the stories and the naming, and perhaps combine a couple into one.

Written by Han

February 26, 2004 at 12:33

Posted in Uncategorized

Event filter for components that are in planned downtime

Tickets should not be created for components that are in planned downtime, since that would create confusion for operators and engineers. The idea is to create a simple event filter in Smart24. Smart24 already has a component database, and events are enriched with component information. We’ll flag those components that are in planned down time, and filter events that come in for them. The filtering itself entails sending a NotOpenedNotification event back to Netcool, that will be correlated back to the original event. That event will then be demoted and put in the NotOpened status. This way the fact that an event came in is visible in Netcool for a while, but no ticket will be opened. Any comments or ideas?

Written by Han

February 26, 2004 at 12:30

Posted in Uncategorized

Severity naming inconsistency

As you are all no doubt aware, we have a slight naming inconsistency for Severity levels. In Tools, traditionally we have “Fatal” for components that are down and don’t function anymore (e.g. ping fail), and “Critical” for components that are in trouble but still function somewhat (e.g. diskspace Unfortunately, what Tools calls Fatal is called Critical in Netcool, and what Smart24 calls Critical is Major in Netcool.

Obviously this will cause tremendous confusion among operators. So we have to make naming consistent.

I feel the term “Fatal” is more in the spirit of the meaning we assigned to that Severity (namely something that is out). Would it cause huge problems if we adopt the Smart24 naming convention by renaming the Severity levels in Netcool?

Written by Han

February 26, 2004 at 09:48

Posted in Uncategorized

Making the Dispatcher more robust

The dispatcher can dispatch events to 3 kinds of listeners:

  1. Sniffers. Events are sent to all matching sniffers first.
  2. Modifiers. When an event matches the rule for a modifier, it is sent there, and further dispatching stops. The modifier is expected to sent the modified event back to the Dispatcher
  3. Endpoints. These are matched last. An event is sent to all matching endpoints

Event dispatching is not affected by any errors that occur in Sniffers or Endpoints. However, since a modifier is expected to post the event back to the dispatcher, if an error occurs in a modifier that prevents a post back, event dispatching stops. Since Modifiers are matched before end-points, an event would not reach the endpoint in such a case.

In fact this situation occurred today, as the auto deployment of the ServiceCorrelator failed due to a locked PerformanceCounters.dll. Part of the ServiceCorrelator is the AddCustomer modifier. This modifier adds component and customer information to an event and then posts back the enriched event to the dispatcher. Because deployment failed, the AddCustomer modifier was not running, and dispatching the event to this modifier resulted in an exception. Since there obviously was no post-back from this modifier, events did not reach their endpoints

This situation was remedied today, by having the Dispatcher check the result of the dispatch to modifiers. If an exception occurs, either because the modifier cannot be reached, or because the modifier faults when processing the event and returns a SOAP error, the modifier is now skipped in the Dispatcher. In order to enable this, the call to dispatch to Modifiers had to be made into a synchronous call. This poses a problem for the SocketGatewayListener. Since this Perl program is singlethreaded, it’s call to the Dispatcher is synchronous and blocks until a result is returned. Therefore the synchronous dispatch to a modifier in the Dispatcher cannot occur in the same thread that called the Accept method, as this would cause the SocketGatewayListener to wait for a timeout of any call to a Modifier from the Dispatcher! Therefore, dispatching events in the Dispatcher is now delegated to a different thread on the threadpool, and the Accept call can return asynchronously. The dispatch to Modifiers is still synchronous with respect to dispatches to other listeners, but asynchronous with respect to the calling thread.

Written by Han

February 26, 2004 at 09:12

Posted in Uncategorized

PrecisionStatus in AB and AC

W, I doubt whether the AB and AC automations need to check for PrecisionStatus too. After all the Opened and Closed notifications will only occur when an event has been raised to Smart24 (using the AA automation), and for that it’s PrecisionStatus had to be clear. What would happen if PrecisionStatus of an event that had been raised would be modified to 1 before the notification occurs? This event would remain stuck in a Status of 2, since the AB or AC automation wouldn’t fire on it. Therefore it may be better to remove the status check from AB and AC. Or did I miss something?

I modified the time before closed events are cleared to 1 hour, so that we can see closed events a little easier before they are deleted. What would be a good value for a production environment?

Written by Han

February 26, 2004 at 08:57

Posted in Uncategorized

More namespace issues

Solving the Event namespace issue yesterday, caused the Dispatcher to break since it did not properly handle anything but the default namespace. This has been fixed today. Before each Event child an “s:” will indicate the http://ntt.com/webservices/oss namespace. This “s” has been predeclared as an abbreviation for the namespace to keep the rules readable. For some reason, Event itself does not need this namespace declaration. Will investigate why this is so next week.

Using Soapscope I verified that the message format on the line is correct, i.e. the Namespace is set on Event, and the child elements don’t have a namespace set, but inherit it from event.

Written by Han

February 19, 2004 at 16:41

Posted in Uncategorized

Fixed ServiceCorrelator tests

I ran the service correlator tests in the nUnit GUI this afternoon and found that they failed. This was surprising since the tests seemed to succeed in the daily build. It turned out that the tests in the build failed as well, but due to a known problem in nAnt, the tests timed out before they got so far. They timed out because they take quite a while to execute due to the fact that a new Smart24 database is created and torn down on the fly for the test.

I fixed the problem by repairing the SQL script that creates the Smart24 database. When we changed the CustomerId and CustomerServiceIds to varchar(512), because they know contain URLs instead of int IDs, we forgot this db creating script. I fixed the time out problem as well, by creating a slimmed-down database that only contains the Components and Dependencies table.

It turned out eventually that the tests after the timed-out Service Correlator tests in the build were never run. All of these tests, which are Siebel Adapter related, fail, but we never noticed.

Written by Han

February 18, 2004 at 18:58

Posted in Uncategorized

Event serialization bug fixed

Smart24 events are open ended: In addition to a number of required attributes, any optional attributes can be added. Upon serialization of a C# event object, those optional attributes (child elements of Event in the XML message) landed in default namespace, whereas they should have been in the http://ntt.com/webservices/oss namespace. This was fixed today, and two testcases were added to verify the behavior

Written by Han

February 18, 2004 at 18:45

Posted in Uncategorized