Archive for April 2007
When upgrading to new data collectors we found that the number of graphs supported was a function of the memory size of the server, and more specifically, by the number of RRD files that the server manages to cache in memory. Once the cache overflows, writing collected samples to RRD files slows down by a factor 20.
In production we had a little over 40,000 graphs at the time of upgrade. Both the new and the old servers had 4 Gb of memory, but in the new servers only 3.25 Gb were available, for unknown reasons. This caused to push the cache just over the limit, and the updating the graphs got slow.
It is important to understand that once the cache overflows, update of all RRD files gets slow. It’s a dramatic all-or-nothing effect. The reason is that the cache works on a Least-Recently-Used (LRU) basis, when evicting files from the cache. At the moment the cache is full, the files that weren’t updated for the longest time period are kicked out first. If the RRD files are numbered R1 to Rn, the data collector updates the files in sequence, starting from R1 and ending with Rn. When only the last file Rn does not fit in the cache, the oldest file, R1, is removed. Upon the next update cycle, R1 has to be loaded from disk, since it is not in the cache anymore. As the cache is still full, the oldest file is evicted again, which is R2, even though the next file to be updated is R2. When R2 is loaded, R3 is kicked out, and so on. This causes a chain reaction of cache evicts and reloads, which makes the whole update process very slow at the very moment the cache overflows.
To remedy this situation, the new servers are upgraded to 8 Gb, and since the non-enterprise version of Windows 2003 server only supports 4 Gb, the OS is upgraded as well.
We ran a capacity test to ascertain how many graphs can be handled by an 8 Gb server before the cache fills up, by making a test program that updates an large number of artificial graphs. We got unpredictable results ranging from 20,000 to 90,0000 graphs supported. Investigating this, we found that the following registry settings are necessary, to reliably configure the size of the Windows System Cache:
Default of this parameter is 0, which means, “let Windows calculate the ‘optimal’ value”. Apparently, Windows calculated a different ‘optimal’ value upon each reboot. Setting it to 0xFFFFFFFF ensures that the System Cache is always maximized.
In addition, the following parameter:
was set to 98
This parameter controls the percentage full at which the cache starts evicting pages. The default is 80. At the default value, only about 70,000 graphs are supported. At a value of 98, 90,0000 graphs are OK.
Note that these parameters are in addition to the LargeSystemCache variable, which has to be set to 1 for the above parameters to work. The LargeSystemCache variable can be set by manipulating the registry, or, more conveniently by setting the ‘Optimize for File Server’ property in the My Computer / Properties / Advanced / Performance Settings / Advanced / Memory Usage / Adjust for best performance of System Cache