Archive for November 2005
A first version of the new discovery webservice is up.
General url format:
port and agent can be set per class to support multiple agents running simultaneously. Do we have an insight manager agent deployed somewhere to test this?
to be done:
- timeout still too long
- error handling
- discovery of real treasures
Here is the config file, server side, that makes this possible:
# accepted formats: # # class, agent, "dynamic", SNMP instance, app instance [, filter table, filter]+ # class, agent, "static", filename # class, agent, "ipaddress", SNMP instance (under construction) # class, agent, "index", app instance [, filter table, filter]+ # class, agent, "ciscoqos" # class, agent, "none" # # agent should can be blank for SSM or standard MIBs. system,, none disk,, dynamic, hrFSIndex, hrFSMountPoint, hrFSType, NTFS|BerkeleyFFS|FAT32|LinuxExt2 cpu,, index, hrDeviceDescr, hrDeviceType, Processor physicaldisk,, dynamic, hrDeviceIndex, hrDeviceDescr, hrDiskStorageMedia, hardDisk, hrDeviceStatus, running|warning interface,, dynamic, ifIndex, ifIndex, ifType, ethernet, ifAdminStatus, up, ifOperStatus, up oracle,, index, oraInstanceSID tablespace,, index, oraTableSpaceName oracleprocess,, index, oraProcessProgram
Components, classes and discovery
root component has sub components (sub components can only “belong” to one component) note: currently components – sub component relationships are denoted using a dependency relation. In a dependency relation, a component is “dependent on” a sub component. However, dependency has subtly different semantics, and is many to many. component has class class has instances, which are components of that class Discovery is the act of finding the components (instances ) of a class on a certain ip address. Given a class and an ip address, discovery finds the instances of that class. Component provisioning is the act of using discovery to find and setup sub components for a root component. A root component is typically a server or network device, but it could also be a service component. In order to do component provisioning, it is sufficient to have a list of possible classes of sub components for a root component. The list of classes applicable to sub components of a root component is variable and not dependent on the class of the root component. For example, the list of classes of subcompnents for a network device will be different from a server. And even between components that are servers of the same OS, differences will exist. For example, an oracle server may have a need for oracle specific sub components. The list of classes of subcomponents for a root component, therefore cannot be derived from the class of the root component. Properties like type of device, OS, installed applications and even hardware brand and model number may all influence the classes of subcomponents. Temperature monitoring is an example of the latter. It is measured using a hardware specific agent. A simple way to model this variation of possibilities is the use of labelling components with tags. For example, an Oracle server could be labelled with “Solaris” and “Oracle”. Tags, in this model, are also associated with component classes. At component provisioning time, tagging a server, then, is all that needs to be done to select the right sub component classes and kick off discovery.
A final hurdle to component discovery is the fact that multiple SNMP agents may be active on a given server. Typically an SSM agent is running on the standard SNMP port, and a hardware or vendor specific agent may be present and active on a different port. This hardware specific agent is used for non-generic, hardware or vendor specific low-level measurements, such as temperature or RAID related parameters. At times, measurements that are logically the same are carried by different agents and using different OIDs. Logically identical measurements are represented by the same MeasurementType. Usually a MeasurementType combined with a component class maps to an OID. The SNMP poller has an additional field, called MibName, that is taken into account in this mapping, that enables agent variability. In addition, a different port can be specified in the IP address using ip.add.re.ss:port syntax. Thus, the SNMP poller, even currently, supports agent variability. The problem is, however, how the component provisioning process and discovery can take this into account. Clearly, class and measurement type are generic concepts that should be unaware of vendor specific concepts. Fortunately, it turns out the tagging mechanism outlined above can be used. One way to do this is to set up a separate table for SNMP agent types. Each record contains metadata for an SNMP agent type, including mibname and port. In addition each record can be associated with one or more tags. Finally, measurement types can also have tags. The dynamics are then as follows: Suppose component provisioning is done for a server component that has a tag called “Sun”. based on this tag, a number of classes is selected. The poller type table is checked for pollers that have this tag. Poller metadata is read and passed on to the discovery process, so that instances can be discovered correctly using the right agent. When configuring the agent, the tags on measurement types are used. This would be needed if different agents are used for the same measurement type, but not every agent supports all measurements. Finally, it should be noted that discovery of a instances for a class can be carried out only be one agent. This introduces a limitation if two agents are present on a server that measure parameters on components of the same class, but the agents don’t agree on the name of the instances. In such cases, different classes should be used.
Implementation of tags
Since tags are a way of loosely coupling instances of various types, it does not mesh well with the relational database model. It would be fairly hard and inflexible to set up a many to many table that links tables based on tags. A much easier solution is to include a single character string attribute in tables that support tags, and enter tags as comma separated quoted values. Clearly, this limits the ability to query the database on tags directly. Therefore, its usefulness is limited to those scenarios where queries on large tables are not needed. Two scenarios using tags have been described. The first scenario requires that a selection of classes is made based on tags. The list of classes is expected to remain fairly small, and is not expected to grow with the number of components. In the second scenario, again the query on classes is used, and in addition a selection of poller types is called for. Also this table is expected to be of limited size. The second scenario also requires a check of measurement types. And fortunately again, this table is expected to remain fairly small and constant. Note however that the tag mechanism should not be misused for searches on large and growing tables like components or datastreams.
Support for non-SNMP based pollers
In principle, support for non-SNMP based pollers should work exactly the same as different SNMP pollers. Using the tagging mechanism, the right poller metadata is selected and passed onto the discovery and poller configuration processes. Since no assumption can be made on which data is necessary, it is hard to model this a-priori. However, no component other than the poller will use the data, and so it is enough to store, in the poller type table, this data is text stings, and treat it as opaque data. Since two processes need to be supported, discovery and configuration, the table will support two attributes. Format and usage are up to the individual pollers. For the SNMP poller, a format like “port=2106, mibname=sun” could be sufficient.
It seemed a good idea to support High Availability in the new data collector and snmp poller by giving the pollgroup and collgroup a link to a primary graph service or poller through a primary field. The idea being that the “primary” poller or graph service would be tried first, and if it failed, the implementation would cycle through the others.
However, this turns out to be a bad idea. There should be no direct coupling between the service (the pollgroup or the collgroup) and its implementation (the poller or graph service). The services should just have a URL, and leave it up to the infrastructure to determine the exact path from service to implementation.
One great way to achieve load balancing is using a hardware load balancer. In production we can use the Alteons in the management zone for this. But even when this hardware is not available, it would be better use a loadbalancing proxy between client and service than to program it into the service access code.
The reasons for externalizing fail-over and load balancing include that fact that a service and/or client implementation, not having to involve itself with fail over, can be simpler. Doing the fail-over in software usually means that each client of a service will have to redo it, and each will dot it a bit different, leading to subtle problems. Finally, and devastatingly, implementing it in the service usually means that an external implementation is made hard or impossible.
And so it is in the data collector. First, the fail-over code is programmed twice, once for data accessing the poller, and another time for accessing the graph service. Moreover, it complicates the code in places. And worse of all, it makes it hard to use an external load balancer, since it uses URL’s that are configured for individual services, and tries to impose it’s own logic to accessing them
For these reasons, and for the greater good of mankind, we’ll remove the load balancing code, and turn the “Primary” in the collgroup and pollgroup records into a URL.
- Now checks the last update dates and starts off at the earliest of those last dates, when asking data from poller
- A warning is produced for rrds that are out of date for a while. This while is configurable, but 24 hours by default
- Datacollectors automatically reload if the UpdateCount property in the DataCollector in the db changes. No more explicit service restarts necessary.