Python caches
The fastest code is the code you don’t run
July 2, 2018 — November 30, 2023
I need to cache stuff in Python sometimes, and I would like it to be easy. For my (scientific computation) needs, ideally, I would like a cache that requires no server process, provides locking or otherwise safe access for multiprocess writes, stores big binary blobs of data, and has minimal installation requirements. If I can get all that at once, my life will be easy.
1 DiskCache
Current front-runner:
DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.
The cloud-based computing of 2023 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn’t it be nice to leverage empty disk space for caching? […]
DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There’s no need for a C compiler or running another process. Performance is a feature and testing has 100% coverage with unit tests and hours of stress.
It promises thread-safety and multiprocess-safety. For scientific cluster computing, where persistent server processes are hard, but caching is easy, this is typically what I want.
2 LMDB
LMDB is not Python-specific, but it possibly does the job?
Symas LMDB is an extraordinarily fast, memory-efficient database we developed for the OpenLDAP Project. With memory-mapped files, LMDB has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.
Bottom line, with only 32KB of object code, LMDB may seem tiny. But it’s the right 32KB. Compact and efficient are two sides of a coin; that’s part of what makes LMDB so powerful.
- Ordered-map interface
- keys are always sorted; range lookups are supported
- Fully transactional
- full ACID semantics with MVCC
- Reader/writer transactions
- readers don’t block writers; writers don’t block readers
- Fully serialized writers
- writes are always deadlock-free
- Extremely cheap read transactions
- can be performed using no mallocs or any other blocking calls
- Multi-thread and multi-process concurrency supported
- environments may be opened by multiple processes on the same host
- Multiple sub-databases may be created
- transactions cover all sub-databases
- Memory-mapped
- allows for zero-copy lookup and iteration
- Maintenance-free
- no external process or background cleanup or compaction required
- Crash-proof
- no logs or crash recovery procedures required
- No application-level caching
- LMDB fully exploits the operating system’s buffer cache
- 32KB of object code and 6KLOC of C
- fits in CPU L1 cache for maximum performance
Has a Python API:
It seems to target small records, of 1-2 kilobytes, and bigger stuff is less efficient.
3 Dogpile.cache
Previous front-runner.
Dogpile consists of two subsystems, one building on top of the other.
dogpile
provides the concept of a “dogpile lock”, a control structure which allows a single thread of execution to be selected as the “creator” of some resource, while allowing other threads of execution to refer to the previous version of this resource as the creation proceeds; if there is no previous version, then those threads block until the object is available.
dogpile.cache
is a caching API which provides a generic interface to caching backends of any variety, and additionally provides API hooks which integrate these cache backends with the locking mechanism ofdogpile
. Source[…]Included backends feature three memcached backends (python-memcached, pylibmc, bmemcached), a Redis backend, a backend based on Python’s anydbm, and a plain dictionary backend.
Pro: it is definitely smart and modern about locking.
Con: plain file disk persistence is not supported by default. The dbm-type wrappers can probably get us something acceptable for data that serializes well in a dbm database; presumably dogpile makes them safe for concurrent writes. Not sure how it would go with heavy numerical work.
There is an advanced version for higher-performance Redis.
4 joblib.Memory
The joblib cache looks convenient:
Transparent and fast disk-caching of output value: a memoize or make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. Separate persistence and flow-execution logic from domain logic or algorithmic code by writing the operations as a set of steps with well-defined inputs and outputs: Python functions. Joblib can save their computation to disk and rerun it only if necessary.
I can’t work out if it’s multi-write safe, or supposed to be only invoked from some master process and thus not need locking. Surely, as part of joblib, it should be multi-process safe?
It only supports function memoization, so if you want to access results some other way or access partial results it can get convoluted unless you can naturally factor your code into function memoizations.
5 Klepto
klepto (source) is also scientific computation focused, (part of the pathos project):
klepto
extends Python’slru_cache
to utilize different keymaps and alternate caching algorithms, such aslfu_cache
andmru_cache
. While caching is meant for fast access to saved results,klepto
also has archiving capabilities, for longer-term storage.klepto
uses a simple dictionary-style interface for all caches and archives, and all caches can be applied to any Python function as a decorator. Keymaps are algorithms for converting a function’s input signature to a unique dictionary, where the function’s results are the dictionary value. Thus fory = f (x)
,y
will be stored incache [x]
(e.g.{x:y}
).
klepto
provides both standard and “safe” caching, where “safe” caches are slower but can recover from hashing errors.klepto
is intended to be used for distributed and parallel computing, where several of the keymaps serialize the stored objects. Caches and archives are intended to be read/write accessible from different threads and processes.klepto
enables a user to decorate a function, save the results to a file or database archive, close the interpreter, start a new session, and reload the function and its cache.
Given the use cases, one would assume this is safe for concurrent writes from multiple processes, but I cannot find info in the documentation.
6 Incoming
cachetools extends the Python3 lru_cache reference implementation.