Archive for March 2006
Background: The Precision servers in our development environment were reinstalled in November, wiping out the Precision dataserver.
Here is how to reinstall it:
1. make sure dbd::mysql is installed. It is in version for of the Perl 5.8.7 package. Reconcile if needed. To find the package browse the installed packages on a server that has it installed, it is somewhere at the end of a very long list of packages. Copy the name exactly. Use software-packages-search to find it.
2. make sure the mysql package is installed. In truth, only the client libs are needed, but what the heck, this works. The package is easy to find under databases
3. make sure the /usr/local/mysql/lib/mysql dir is in the default library path. This is not LD_LIBRARY_PATH. crle is used. Type crle and copy the default path. Then issue crle -l <default path>:/usr/local/mysql/lib/mysql
4. restart apache…
There are many LDAP servers in the development environment, for various purposes. However, the one used for the SSLVPN is on xx.xx.xx.xx. Therefore, browse to
And login as
admin_pw / ...
(you know what)
Make the new user under the com.com company, even though we do not own this domain name, and never will:
- Create the user, preferable using the same account name as their production account, otherwise make something up with two numbers at the end.
- Give them random password
- Don’t forget to make them a member of the Users group. This is necessary for SSL-VPN access.
In production, the discovery app is installed on impact1 and poller1 , in both cases in the /apps/discovery directory.
The steps to run discovery are as follows:
- export POLLERHOME=/apps/discovery
- create a discovery config file. These are located in /apps/discovery/discos. Copy the sample.conf file to 20060327_description.conf (in general <date>_<description>.conf). Basically, add one or more desired profiles for each IP address. The common profiles are “server” for the base server graphs and “networkdevice” for the base graphs on network devices.
- run perl Provisioner.pl <config file> <xml out filename> <measurements out filename>. The latter to are customarily names “xml.out” and “m.out” respectively. The provisioner app will report discovered instances on the console, and create the XML config file for Venkat’s DB record and RRD creation tool (in xml.out), and the Poller Measurements.conf config file settings (in m.out).
On friday we found that the backup DataCollector on the spare data collector was not updating the ISM performance graphs. All SNMP performance graphs were fine. It should have been updating the graphs since March 9, when we deployed the new ISM collection tool. The previous tool used the old (passive) data collector, that supported no HA, and was only running on reporting1. So before cycle 15, only SNMP performance data was collected by the HA data collector on spare1.
We deployed a large number of ISM graphs on the 15th. We then found on Friday, the RRDs had not changed date on the spare machine, though they were OK on the primary. Looking in the DataCollector log files, the datacollector reported that the “last” parameter was set to “0″ for the ISM poll group.
The last parameter keeps track of the last sample update the DataCollector received. It keeps track of this for each so called Poll Group. A poll group is a bunch of pollers that poll the same data in an HA configuration. We have two poll groups in production: One for SNMP (on primary and spare) and one for ISM (on the primary monitoring server). Whenever polls are gotten from a poll group, the “last” parameter is updated. The only difficulty arrises when the data collector is first started. In order to prevent starting with a “last” parameter of 0 and getting lots of data that it doesn’t need, the data collector analyzes all RRD files for the poll group, where it looks at the last sample entered, and picks the one with the oldest timestamp. Except, there is a small complication. If any RRD of the poll group is not being updated, perhaps because the server in question is not there anymore, or something has been misconfigured, the latest update timestamp of that RRD would be way back in the past. This would still cause the data collector to read a lot of superfluous data on startup, for no reason at all. Therefore, an additional parameter can be specified, that determines a cut-off time to weed out these non-updating RRD files. If the last updated time is older than that cut off time, the default is 12 hours, then it is considered dead, and not used to determine the value of “last”.
Therefore, if at least one RRD file exists that has recently been updated, the last parameter should not be 0. But, for the ISM pollgroup, it was. Now, on the 15th, many new RRDs were created. A newly created RRD has a last timestamp of its creation time. Why did the data collector not pick these up?
It turns out this was due to two unrelated errors:
1. The old, pre cycle 15 release RRDs were never copied to the spare machine at cycle 15 deployment time.
2. The new customer’s RRDs for there ISM performance graphs were misconfigured in the database. They were accidentally entered in the SNMP poll group, not in the ISM poll group.
For these two reasons, the data collector on the spare machine could not find any RRDs at all for the ISM pollgroup, and the last parameter remained “0″. This caused the data collector to skip this poll group alltogether, and the ISM RRDs were therefore not updated. The primary data collector on reporting1 was OK, because it had the pre-cycle 15 ISM RRDs to set the “last” parameter.
To fix it, the poll groups of the ISM RRDs were fixed in the database. Since the last updated time was still way in the past though, the cut off parameter was temporarily set to a large value, to force the data collector to include them in the calculation of “last”. This did fix the problem and caused the data collector on spare to collect all data for the customer’s ISM graphs since they were started on March15.
As a side effect of this, the last 24 hours of SNMP data (we don’t keep more than 24 hours on the poller server) was all read, as an old non-updating SNMP RRD now determined “last” for that poll group. This is a huge amount of data. More than a million samples were transmitted, and it took 50 seconds to transmit this amount of data over the net. This probably was a transfer of around 100 Mb. The data collector (and SNMP poller) pulled this off without a hitch.
As a final step, the old pre-cycle 15 ISM RRD files still needed to be copied from primary to the spare.