Client Cache Tracing

Note

This document contains information based on the original. View that document for more information.

An important question for RelStorage users is: how large should the RelStorage local cache be? RelStorage 2 (as of RelStorage 2.0b3) has a new feature that lets you collect a trace of cache activity and tools to analyze this trace, enabling you to make an informed decision about the cache size.

Don’t confuse the RelStorage local cache with the ZODB object cache. The RelStorage local cache is only used when an object is not in the ZODB object cache; the RelStorage local cache avoids roundtrips to the database server.

Enabling Cache Tracing

To enable cache tracing, you must use a persistent cache (specify a cache-local-dir name), and set the environment variable ZEO_CACHE_TRACE to a non-empty value. The path to the trace file is derived from the path to the persistent cache directory and the process’s PID. If the file doesn’t exist, RelStorage will try to create it. If the file does exist, it’s opened for appending (previous trace information is not overwritten). If there are problems with the file, a warning message is logged. To start or stop tracing, the RelStorage instance must be closed and re-opened (usually this means stopping and restarting the entire process).

Tip

If you are using only one client process per cache-local-dir, set ZEO_CACHE_TRACE to single. This will allow process restarts to be recorded in the trace file. Otherwise (if you are using multiple processes sharing the same cache directory or use a value different than single, each process will use its own trace file.) Note that using a value of single with multiple processes is undefined.

The trace file can grow pretty quickly; on a moderately loaded server, we observed it growing by 7 MB per hour. The file consists of binary records, each 34 bytes long if 8-byte oids are in use. No sensitive data are logged: data record sizes (but not data records), and binary object and transaction ids are logged, but no object pickles, object types or names, user names, transaction comments, access paths, or machine information (such as machine name or IP address) is logged.

Analyzing a Cache Trace

The cache_stats.py command-line tool (python -m ZEO.scripts.cache_stats) is the first-line tool to analyze a cache trace. Its default output consists of two parts: a one-line summary of essential statistics for each segment of 15 minutes, interspersed with lines indicating client restarts, followed by a more detailed summary of overall statistics.

The most important statistic is the “hit rate”, a percentage indicating how many requests to load an object could be satisfied from the cache. Hit rates around 70% are good. 90% is excellent. If you see a hit rate under 60% you can probably improve the cache performance (and hence your application server’s performance) by increasing the local cache size. This is normally configured using key cache-local-mb in the relstorage section of your configuration file. The default cache size is 10 MB, which is small.

Note

The trace file records all cache activity. If you are using Memcached servers (via the cache-servers setting), hits they produce will also be recorded. To isolate just the effects of the local cache, disable cache-servers. On the other hand, comparing trace files from instances with Memcached enabled and it disabled can help show if the additional overhead produces enough extra hits to be worthwhile.

The cache_stats.py tool (python -m ZEO.scripts.cache_stats) shows its command line syntax when invoked without arguments. The tracefile argument can be a gzipped file if it has a .gz extension. It will be read from stdin (assuming uncompressed data) if the tracefile argument is ‘-‘.

usage: cache_stats.py [--verbose | --quiet] [--sizes] [--no-stats]
                      [--load-histogram] [--check] [--interval INTERVAL]
                      tracefile

Trace file statistics analyzer

positional arguments:
  tracefile             The trace to read; may be gzipped

optional arguments:
  --verbose, -v         Be verbose; print each record
  --quiet, -q           Reduce output; don't print summaries
  --sizes, -s           print histogram of object sizes
  --no-stats, -S        don't print statistics
  --load-histogram, -h  print histogram of object load frequencies
  --check, -X           enable heuristic checking for misaligned records: oids
                        > 2**32 will be rejected; this requires the tracefile
                        to be seekable
  --interval INTERVAL, -i INTERVAL
                        summarizing interval in minutes (default 15; max 60)

Simulating Different Cache Sizes

Based on a cache trace file, you can make a prediction of how well the cache might do with a different cache size. The cache_simul.py (python -m ZEO.scripts.cache_simul) tool runs a simulation of the ZEO client cache implementation based upon the events read from a trace file. A new simulation is started each time the trace file records a client restart event; if a trace file contains more than one restart event, a separate line is printed for each simulation, and a line with overall statistics is added at the end.

usage: cache_simul.py [-h] [--size CACHELIMIT] [--interval INTERVAL]
                      [--rearrange REARRANGE]
                      tracefile

Cache simulation. Note: - The simulation isn't perfect. - The simulation will
be far off if the trace file was created starting with a non-empty cache

positional arguments:
  tracefile             The trace to read; may be gzipped

optional arguments:
  -h, --help            show this help message and exit
  --size CACHELIMIT, -s CACHELIMIT
                        cache size in MB (default 20MB)
  --interval INTERVAL, -i INTERVAL
                        summarizing interval in minutes (default 15; max 60)
  --rearrange REARRANGE, -r REARRANGE
                        rearrange factor

Example, assuming the trace file is in /tmp/cachetrace.log:

$ python -m ZEO.scripts.cache_simul.py -s 4 /tmp/cachetrace.log
CircularCacheSimulation, cache size 4,194,304 bytes
  START TIME  DURATION    LOADS     HITS INVALS WRITES HITRATE  EVICTS   INUSE
Jul 22 22:22     39:09  3218856  1429329  24046  41517   44.4%   40776    99.8

This shows that with a 4 MB cache size, the cache hit rate is 44.4%, the percentage 1429329 (number of cache hits) is of 3218856 (number of load requests). The cache simulated 40776 evictions, to make room for new object states. At the end, 99.8% of the bytes reserved for the cache file were in use to hold object state (the remaining 0.2% consists of “holes”, bytes freed by object eviction and not yet reused to hold another object’s state).

Let’s try this again with an 8 MB cache:

$ python -m ZEO.scripts.cache_simul.py -s 8 /tmp/cachetrace.log
CircularCacheSimulation, cache size 8,388,608 bytes
  START TIME  DURATION    LOADS     HITS INVALS WRITES HITRATE  EVICTS   INUSE
Jul 22 22:22     39:09  3218856  2182722  31315  41517   67.8%   40016   100.0

That’s a huge improvement in hit rate, which isn’t surprising since these are very small cache sizes. The default cache size is 20 MB, which is still on the small side:

$ python -m ZEO.scripts.cache_simul.py /tmp/cachetrace.log
CircularCacheSimulation, cache size 20,971,520 bytes
  START TIME  DURATION    LOADS     HITS INVALS WRITES HITRATE  EVICTS   INUSE
Jul 22 22:22     39:09  3218856  2982589  37922  41517   92.7%   37761    99.9

Again a very nice improvement in hit rate, and there’s not a lot of room left for improvement. Let’s try 100 MB:

$ python -m ZEO.scripts.cache_simul.py -s 100 /tmp/cachetrace.log
CircularCacheSimulation, cache size 104,857,600 bytes
  START TIME  DURATION    LOADS     HITS INVALS WRITES HITRATE  EVICTS   INUSE
Jul 22 22:22     39:09  3218856  3218741  39572  41517  100.0%   22778   100.0

It’s very unusual to see a hit rate so high. The application here frequently modified a very large BTree, so given enough cache space to hold the entire BTree it rarely needed to ask the ZEO server for data: this application reused the same objects over and over.

More typical is that a substantial number of objects will be referenced only once. Whenever an object turns out to be loaded only once, it’s a pure loss for the cache: the first (and only) load is a cache miss; storing the object evicts other objects, possibly causing more cache misses; and the object is never loaded again. If, for example, a third of the objects are loaded only once, it’s quite possible for the theoretical maximum hit rate to be 67%, no matter how large the cache.

The cache_simul.py script also contains code to simulate different cache strategies. Since none of these are implemented, and only the default cache strategy’s code has been updated to be aware of MVCC, these are not further documented here.

Simulation Limitations

The cache simulation is an approximation, and actual hit rate may be higher or lower than the simulated result. These are some factors that inhibit exact simulation:

  • Important The simulation tools were designed for ZEO and may not be accurate for RelStorage due to the different mechanisms of invalidation employed. Still, they may give a useful idea.

    Because of the invalidation differences, some trace files might cause the script to produce an AssertionError. These can typically be ignored and commented out in the script itself to proceed.

  • Each time a load of an object O in the trace file was a cache hit, but the simulated cache has evicted O, the simulated cache has no way to repair its knowledge about O. This is more frequent when simulating caches smaller than the cache used to produce the trace file. When a real cache suffers a cache miss, it asks the database server for the needed information about O, and saves O in the local cache. The simulated cache doesn’t have a database server to ask, and O continues to be absent in the simulated cache. Further requests for O will continue to be simulated cache misses, although in a real cache they’ll likely be cache hits. On the other hand, the simulated cache doesn’t need to evict any objects to make room for O, so it may enjoy further cache hits on objects a real cache would have evicted.