Archive for October 2006
System uptime graphs on SSM machines really reflect agent uptime, not system uptime. Whenever the agent is reset, the graph drops down to zero. This leads to customer questions which annoy service managers. They demand a change.
The system uptime graphs, up to now, use a standard Mib-2 counter called sysUpTime. The SSM supports it too, but alas, reinterpretes it to mean “agent uptime”. The SSM has a different counter that measures real machine uptime. This counter should be used instead. The problem is, only the SSM supports it. Other devices should remain using sysUpTime.
This leads to our configuration problem. We have components of class “system”, that show graphs with measurement type “SystemUptime”. However, this SystemUptime should map to different OIDs for different devices.
To configure this, we need to use 2 SNMP poller configuration concepts: mibnames and profiles.
First off, the (new) SNMP poller allows a different OID mapping for the same measurement type, by using a mibname. OID mappings are configured in “measurementtypes.conf”, which contains mappings from measurement type to OID. For example:
To support different mappings, the measurement type can be postfixed with a mibname, like so:
There is one restriction: All measurement types for a class must use the same mib name. In other words, the mibname is configured at the class level. The system uptime belongs to the system class. It is therefore necessary to define a class system configuration for mibname ssm in classes.conf:
Since the class is configured to use mibname ssm, when mapping measurements to OIDs, the mibname is automatically appended to the measurement type. SystemUptime will look for SystemUptime:ssm and LoadAvg15 uses LoadAvg15:ssm.
Any node that is configured with profile “server” will use the system:ssm class, and will measure uptime using the SSM hrSystemUptime counter.Note that the mibname concept is purely an SNMP poller configuration thing. It is dropped when communicating with the DataCollector, so that from its point of view (data collector, poller, etc), we have a SystemUptime graph for a system component.
Last long weekend was spent updating Siebel in production. Saturday was used for preparation. From Sunday 4 pm to Tuesday 1 am (4 am, if the subsequent operator call is included) we struggled to get things installed and running. Most of the Siebel work was done by C., but here are some takeaways from my side.
– Windows supports symbolic links, and they can be pretty useful too.
In addition to hard links at the file level, NTFS supports symbolic directory links, called junctions. Windows does not ship with any tools to make use of this, for inexplicable reasons. However, pre-Microsoft sysinternals has a handy tool called “junction” to fill the gap. It definitely helped us prod Siebel into a successful install.
– Don’t use URL’s as identifiers.
This seemed a good idea at the time, 3 years ago, and we use URL’s as id’s for tickets, customers and customer services. A URL is a URI, a universal resource identifier, so what more can you ask for? However, people, when seeing a URL have the insuppressible urge to consummate the link. And if that link points to something useful, like an XML representation of the resource, and the URL is not carefully chosen, things like server names may get hardwired into your data. In our case the name of the Siebel webserver was part of the URL and when the webserver and its name changed , things broke. Unlikely things too. After this 36 hour marathon, whilst enjoying my first 30 minutes of sleep, the phone rang. No servers or services in the customer portal. Now Siebel was upgraded, and Siebel is about tickets, not about servers, so how come? Turns out that access to a resource in the customer portal triggers an authorization check for the user, which involves the customer ID, which is a URL, which in its stored form in the database was different from its live form which comes from Siebel. Fortunately to the rescue came the convenient
– Replacement of strings in SQL server
update Device set CustomerId=Replace(CustomerId, ’old name′, ’new name’)
Anyway, when something looks like a beer, smells like a beer and is cold, chances are people will drink it. Plain ID’s would have been a lot better here.
– Re IP-ing and renaming is easier done beforehand
After painstakingly setting up Siebel on 3 servers, things were thrown into a bit of a disarray by renaming the servers and giving them a different IP. Which is necessary, since the Opsware-provided DHCP addresses are meant to be temporary. And now that we’re at it, don’t forget setting the default gateway when assigning new IP’s…
We got repeated requests for a periodic interval filter. This would suppress alerts that occur at regular expected times, for example, a daily or monthly scheduled process restart. The existing date filter only works for a single non-repeating date range. Here is a small design for this extension:
Y 12-23 12:59:59 .. 12-24
M 02 07:00:00 .. 02 09:00:00
W Tue 13:00 .. Wed 13:00
D 17:00:00 .. 18:00:00
H 10:00 .. 11:00
Format will be validated, interpreted, converted and adjusted for timezone to UTC
The result is normalized timerange
Attributes coming in from alerts are normalized when compared to a filter.
Each period type has a mod function that normalizes a time value
For H D and W, the mod function is a simple mod operation, since the period is regular
For M and W the period is irregular and the mod is performed by subtracting a precomputed value. An array of precomputed year and month values is stored. The largest one that does not result in a negative number when subtracted is applied
It is then checked whether the normalized attribute is within the normalized range.
Note that inverse ranges are also allowed. For example: H 23:00 .. 01:00. This is interpreted as ! 01:00 .. 23:00, or the equivalent 00:00..01:00 OR 23:00..00:00
This is very useful for coping with timezone adjustments.
The Occurrence special attribute.
Due to netcool deduplication FirstOccurrence and LastOccurrence interplay. To make the intended behavior of a filter easier to configure, an Occurrence special attribute will be introduced.
Consider the following: F is time of first occurrence, L is time of last Occurrence and — the intended filter interval
F —– L
first occurrence before start of interval and last occurrence after end of it. Alert should not be filtered
first occurrence before start, last occurrence during. Alert should not be filtered
first and last occurrence during interval. Alert should be filtered
first occurrence during interval, last occurrence after end. Alert should be filtered at first, but become visible after interval has ended.
first occurrence during first interval, last occurrence during a later, different interval.
Alert should be filtered at first, but become visible after first interval has ended. it should not get filtered afterwards, even during the second interval.
Case 1 to 4 can be met by stipulating that both first occurrence and last occurrence values are within the filter range for it to be filtered
Case 5 can be met by applying the same mod to both first occurrence and last occurrence, effectively testing them for the same interval.
Configuration is eased by specifying an interval for the pseudo-attribute “Occurrence”. Doing this will have two effects:
1. Both FirstOccurrence and LastOccurrence have to match for the filter to match
2. The mod applied to FirstOccurrence will also be applied to LastOccurrence so that they are not matched by different instances of a repeating time range filter