Python caches
The fastest code is the code you don’t run
2018-07-01 — 2025-10-19
Wherein a survey of Python caches is presented, and disk‑backed, multiprocess‑safe solutions that store large binary blobs without a server process and with minimal installation are examined.
I sometimes need to cache things in Python, and I’d like it to be easy. For my scientific computation needs, Ideally I’d have a cache that requires no server process, provides locking or otherwise safe multiprocess write access, stores large binary blobs and has minimal installation requirements. If I can get all that at once, my life will be much easier.
1 DiskCache
Our current front-runner:
DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.
The cloud-based computing of 2023 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn’t it be nice to leverage empty disk space for caching? […]
DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There’s no need for a C compiler or running another process. Performance is a feature and testing has 100% coverage with unit tests and hours of stress.
LMDB promises thread-safety and multiprocess-safety. For scientific cluster computing, where running persistent server processes is hard but caching is easy, this is typically what I want.
2 LMDB
LMDB is not Python-specific, but it might do the job?
Symas LMDB is an extraordinarily fast, memory-efficient database we developed for the OpenLDAP Project. With memory-mapped files, LMDB has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.
Bottom line, with only 32KB of object code, LMDB may seem tiny. But it’s the right 32KB. Compact and efficient are two sides of a coin; that’s part of what makes LMDB so powerful.
- Ordered-map interface
- keys are always sorted; range lookups are supported
- Fully transactional
- full ACID semantics with MVCC
- Reader/writer transactions
- readers don’t block writers; writers don’t block readers
- Fully serialized writers
- writes are always deadlock-free
- Extremely cheap read transactions
- can be performed using no mallocs or any other blocking calls
- Multi-thread and multi-process concurrency supported
- environments may be opened by multiple processes on the same host
- Multiple sub-databases may be created
- transactions cover all sub-databases
- Memory-mapped
- allows for zero-copy lookup and iteration
- Maintenance-free
- no external process or background cleanup or compaction required
- Crash-proof
- no logs or crash recovery procedures required
- No application-level caching
- LMDB fully exploits the operating system’s buffer cache
- 32KB of object code and 6KLOC of C
- fits in CPU L1 cache for maximum performance
Has a Python API:
It seems to target small records, of 1–2 kilobytes; larger items are less efficient.
3 Dogpile.cache
It was our previous front-runner.
Dogpile consists of two subsystems, one building on top of the other.
dogpileprovides the concept of a “dogpile lock”, a control structure which allows a single thread of execution to be selected as the “creator” of some resource, while allowing other threads of execution to refer to the previous version of this resource as the creation proceeds; if there is no previous version, then those threads block until the object is available.
dogpile.cacheis a caching API which provides a generic interface to caching backends of any variety, and additionally provides API hooks which integrate these cache backends with the locking mechanism ofdogpile. Source[…]Included backends feature three memcached backends (python-memcached, pylibmc, bmemcached), a Redis backend, a backend based on Python’s anydbm, and a plain dictionary backend.
Pro: it’s definitely smart and modern about locking.
Con: plain file disk persistence isn’t supported by default. The dbm-type wrappers can probably give us something acceptable for data that serializes well in a dbm database; presumably dogpile makes them safe for concurrent writes. We’re not sure how that would go with heavy numerical work.
There’s an advanced version for higher-performance Redis.
4 joblib.Memory
The joblib cache looks convenient:
Transparent and fast disk-caching of output value: a memoize or make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. Separate persistence and flow-execution logic from domain logic or algorithmic code by writing the operations as a set of steps with well-defined inputs and outputs: Python functions. Joblib can save their computation to disk and rerun it only if necessary.
I can’t work out whether it’s safe for multiple writers, or whether it’s meant to be invoked only from a master process and thus doesn’t need locking. Surely, since it’s part of joblib, it should be safe for multiple processes?
It only supports function memoization, so if we want to access results some other way or access partial results it gets convoluted unless we can naturally factor our code into memoized functions.
5 Klepto
klepto (source) is also focused on scientific computation, (part of the pathos project):
kleptoextends Python’slru_cacheto utilize different keymaps and alternate caching algorithms, such aslfu_cacheandmru_cache. While caching is meant for fast access to saved results,kleptoalso has archiving capabilities, for longer-term storage.kleptouses a simple dictionary-style interface for all caches and archives, and all caches can be applied to any Python function as a decorator. Keymaps are algorithms for converting a function’s input signature to a unique dictionary, where the function’s results are the dictionary value. Thus fory = f (x),ywill be stored incache [x](e.g.{x:y}).
kleptoprovides both standard and “safe” caching, where “safe” caches are slower but can recover from hashing errors.kleptois intended to be used for distributed and parallel computing, where several of the keymaps serialize the stored objects. Caches and archives are intended to be read/write accessible from different threads and processes.kleptoenables a user to decorate a function, save the results to a file or database archive, close the interpreter, start a new session, and reload the function and its cache.
Given the use cases, we’d assume it’s safe for concurrent writes from multiple processes, but I can’t find any information in the documentation.
6 Flexicache
fastcore flexicache is “Like lru_cache, but customizable with policy funcs”. It’s backed by Jeremy Howard at Answer.AI. See Exploring flexicache.
7 Incoming
cachetools extends lru_cache, the Python standard library’s reference implementation.
