Archive for September 2007
A substantial SNMP poller update was deployed today. Here is the low down:
- Separation of polling and discovery processes
- Slow memory leaks fixed
- Support for different poll frequencies per agent
- Support for polling classes at different ports in different agents
- Addition of node level XML instance files
- Override of various parameters by profile
- Lock and manual manipulation of per agent configuration
- Command interface to Discovery process
The internals have changed quite considerably. The poller process (”ndpoller.pl”) is now very simple. It reads agent yaml files, and polls based on that. It checks from time to time whether the agent has changed, and if so, reloads the agent’s yaml file, and reschedules the agent’s polls. It’s memory usage is absolutely stable. Sustained polling in production for 2 months, did not show an increase at all.
All configuration and discovery related tasks are handled by a separate process, “discovery.pl”. It monitors any agent’s uptime and rediscovers when it decreases. It also rediscovers agents regularly at a slow pace. It listens for changes in the nodes.conf configuration file. This file is updated automatically by the nodes_writer process, if a database change is detected. Finally, it listens for HUP signals, and for user commands, that are entered in a command file. Any change in agent configuration or agent discovery information results in an update of the agent’s yaml file, and notification of the ndpoller process, and external listeners.
Also note that the discovery process can safely be terminated without affecting the polling process. When the discovery process is restarted, it will read a cached copy of the configuration and compare it to the actual configuration at the time of restart. Usually, it will be able to start without doing a full discovery, although certain configuration changes may trigger one.
The new poller is deployed in /apps/np3 on the primary and backup poller machines. The configuration files have, except for the addition of a few new parameters, not changed between the previous and the current poller. However, the apache configuration did change. Apache is used for a few webservices surrounding the poller, most notably the “/getperfdata” service that serves up polled data and discovery information to the data collectors. Apart from this there are the live poller service (”/poll”), which is used by the poller information pages, and the “/getinstances” and “/getoids” services. These services are now configured in a virtual hosts section, based on port. This port is 8080 in both pollers. This was done so that the existing poller could continue to be accessible for a while at a different port (8083). This setting allowed configuring the poller update and testing it, while its web services were running at a different port, without affecting the existing poller. Upon cut-over time, the apache configuration file was simply switched, leading the updated poller data being served. It also allows easy roll-back, if necessary, simply by restoring the original httpd.conf.
The web server root for the virtual host was changed from /usr/local/apache2/htdocs to /apps/np3/www, and the apache log files are now found in /apps/np3/logs.
The phup script used to send a HUP signal to the poller process now only sends a HUP to the discovery process. Usually, it will not be necessary to run this script though, since changes to nodes.conf are automatically detected and taken in. Changes to any of the other config files still need the HUP to be sent.
The polling process also responds to HUP, but only reads a few parameters. These are the SNMP related parameters, directory and virtual poller changes. Note that one piece of agent related information is not written to the yaml files, since it has always been resolved on the fly. This is the conversion from measurement type to OID, as configured in the “measurements.conf” configuration file. Therefore, if this file changes (which is very rare), the poller needs to be hupped as well.
Deployment was started on the backup poller.
After starting up the discovery process and letting the initial discovery finish, the generated agent
YAML files were compared to those of the existing poller using the “check_discoveries.pl” script. Apart from differences in the sizes of certain SAN disks, the discovered instances appeared identical. After starting the polling process spot checks of polled data revealed no differences either, and the number of data points polled seemed identical.
Successful completion of these checks was followed by switching the apache configuration file to make the updated poller data being delivered on the normal /getperfdata port. This data was then collected by the backup data collector, which was temporarily driven of a copy of the database. The primary pollgroup URL for the new data collector was changed to point to the backup poller, in this copied database. Beforehand a backup copy of the rrd files had been made. A copy of the operator portal, also driven off the same database copy, going to the backup graph service, to show the data polled from the backup poller, was consulted to verify that the graphs looked OK. Spot checks performed on the graphs using this operator portal copy revealed no problems. Statistics reported by the backup data collector revealed the same numbers as the primary data collector.
Given that all tests were positive, attention was turned to primary poller. Here discovery and polling were set up in an identical fashion, discovery results and polls were checked and the apache configuration was switched. The primary data collector and graph service, and therefore operator portal and customer portal were now receiving data from the new poller. All statistics and graphs seemed in order. Finally, the backup data collector was switched back to the live data base.
The previous version of the SNMP poller is being kept alive for the time being on both pollers. If no problems arise, it will be stopped early next week.
The script to automatically update the nodes.conf configuration file from the database is not started yet. It will be activated when the new provisioning tool is deployed, next week. Before it can be run a small database schema update is needed, and this update is part of the deployment procedure of the provisioning tool.
The installation and switch went off without a hitch (so far). All graphs seamlessly continued.
Today, the generalized alert filter tool (a.k.a. alert processor) was deployed. It has been deployed separately from its predecessor, the alert filter tool, on a different server. The reason is that the alert processor has not been through QA yet. It is deployed pre-QA since the storage team needs the tool urgently for backup job monitoring. And so, a separate copy was deployed.
The alert filter tool continues to run using the old version until QA is finished. Once that is done, it will be upgraded, and the two separate tools will be merged.
– No member of the tools team has admin rights to the RSA admin tool. This means we cannot create roles or assign roles to Agilit members.
– The alert processor administrator role was named “AlertFilterAdmins”. Being a member of this role, allows you to edit Filters and Lookup Tables in the tool. For each filter, a separate role can be set, and users must have that role to edit rules of the filter.
– The AlertFilter windows service and Mongrel cannot run under any of the special accounts, like localnetwork, because the old Tools windows servers have a proxy server configured for all accounts. This is the legacy proxy server, which is clearly mistaken, but there seems to be no easy way to remove it. As a work-around the service can be run using the administrator account, but this is not advisable. I therefore configured a separate local user account, called alertfiltersvc, gave it User and Remote Login User rights, logged in remotely as that user, and unset the proxy server setting in IE. Then I configured both the alert filter service and the Mongrel service to run under this account. Mongrel is the webserver that servers the alert filter UI.
– The alert filter UI is using Ruby on Rails, and uses a Windows specific Ruby library to connect to Active Directory. This library needs the Visual C 7 redistributable library, which, due to myriad registry settings needs to be installed using an installer. The Windows installer service is used for this, but the version of this installer on our production servers is too old. This causes the install to fail with a cryptic error. In order to resolve this, the installer must be upgraded. Fortunately it uses a different installer… The need-to-reboot message at the end can be ignored. After this, the VC7 library installer runs correctly.
– As mentioned before the Mongrel web server is installed to serve up the Ruby on Rails GUI. It is installed as a windows service . The name of the service is “Rubella”, in keeping with our sickly naming convention. Before installing mongrel as a service, it is useful to test it on the command line from the Ruby on Rails application base directory , using mongrel_rails start -e production. Any fatal errors can be spotted a lot easier that way
– Rails has the concept of different environments (development, staging, producition). In production, we use, obviously, the production environment. Each environment has its own configuration file, so be sure to use the config/enviroments/production.yml file.
– The RuleChecker web application, to test which rules match an alert and vice versa, has been installed under /RulesChecker/RulesChecker.aspx. It wants either a rule= parameter, with the rule id, or a Serial= parameter with the serial number of the alert to be tested. A list of comma separated serial numbers works too.
– In addition to AlertFilterAdmins, an AlertFilterTest role was created, and the same suspects were made members. Useful for testing out new filters.
– An update to NetcoolData server was installed. It adds col=”columname” attributes to the td elements returned to easy navigation using XPath. It also adds optimistic locking to the update method, using lockcolumn=, lockid= and lockvalue= columns. lockid and lockvalue are obligatory. Lockcolumn defaults to “StateChange”. If any of the obligatory parameters is absent, a normal, non-locked update is performed. If the parameters are present, the value of lockcolumn for alert with serial lockid is read and compared to the value of lockvalue. If these are different, the lock fails.
– A small update to SiebelServices was installed. This adds the TroubleTicketNote.asmx/InsertNote?ticketid=ticketnumber&msg=messagetext method, that is invocable using POST, sans SOAP. This can be used to automatically forward alert clear messages to ticket notes in Siebel.
– sendalert.html pages were installed on both objectservers. They can be used to send test alerts to Netcool. sendalert.html
– Test filters and rules were created and tested in production, and they worked flawlessly. They have been removed again.