Archive for March 2004
The TestFlag event attribute is now mapped between ObjectServer alerts.status and Event. Since the Siebel mapping was already done, it should now be fully supported. Test follows tomorrow. Also injected events should set the TestFlag, which will require rules file changes in some cases (traps).
The OID mapping file in the SNMP poller allows you to specify an expression. The expression is evaluated for the value read in SNMP polls using that oid. This allows simple unit conversions, etc. For example, specifying “100 – $value” replaces the SNMP poll value with 100 minus that value. A bug that was introduced when combining polls to the same IP address caused this expression not to be evaluated in some cases. C found this problem today. This was fixed.
The Dynamic instance refresh was implemented as described. Also, on startup of the poller and upon reread of the config after sending a HUP signal, it is not read at all. Instead it is read on demand whenever the first poll is about to happen. This spreads the load and makes that timeouts don’t cause problems.
Additionally, the action of sending collected data to the datacollector used to be done asynchronously because it was so slow. Now that has been reverted back to a synchronous action, which has the advantage that a missed send does not cause the collected data to be thrown out. It simply keeps on collecting an resends at the next interval.
The SNMP poller has a feature called Dynamic Instance Mapping. This is a means of mapping human readable Smart24 component instance names (like “/tmp” or “inetinfo.exe”) to SNMP instance numbers. It does this by walking two SNMP tables, one containing the readable name, the other containing the SNMP instance. The poller currently rereads the tables at a low (but configurable) interval. The problem is, it goes ahead and reads the tables on all IPs where this applies in a loop, and what’s more, it does that in a synchronous fashion. If an IP is not reachable, the loop will wait for a timeout before proceeding. This can cause the refresh to take a long time, and it limits scalability of the poller.
To remedy this, the refreshes will be spread out through time randomly for each IP. They will still need to be synchronous, but since only one instance mapping is read at a time, the scalability problem will be solved.
The complexity of our automations could be greatly reduced if we read from and write into the in-memory sybase database directly. No more round-about integration tricks where we need to inject new Notification events, only to delete them a little later, etc. The problem is, the current method is the official and approve way to do it. The JDBC way, although conceptually much cleaner, is not supported by Micromuse. Since ultimately what we need is an easy to maintain system, I propose giving this a serious look in the next cycle.
While reviewing these automations, found that events that do not have a ticket will never be closed. Although the they were selected, the automations would then try to forward the resolution event. In the normal case, the corresponding ticket would be closed and a notification of this would be sent back into netcool where the event would be closed as well. If there would be no ticket, however, there would be nothing to close, so that there would also be no notification back into netcool.
Worked with W. to fix this. We modified the third generic clear automation to only forward resolutions if Netcool knows of a ticket (Smart24Status = 3). We then added a fourth one that just clears the problem event and deletes the resolution event in all other cases.
There is one potential problem with this, in that cases where a resolution quickly follows a problem, the ticket opened notification might not have reached Netcool yet. In those cases, the Netcool event would be closed, but the ticket would never be. In practice, however, this is quite unlikely, since polls are always minutes apart, and we can configure the SSMs to send traps no closer than at least a minute apart.
It might be good however to think a little more about a solution that does not have this drawback.
This solution has not been drawn up and coded out but has not been installed yet.
Tried to poll all 532 polls to the core switches in a single poll. This failed with a time out. Now I max out the number of combined polls to 50, which seems to work.
The result of this all is that instead of 4000+ polls per 5 minutes, we now have only around 150.