Python caches

The fastest code is the code you don’t run

2018-07-01 — 2025-10-19

Wherein a survey of Python caches is presented, and disk‑backed, multiprocess‑safe solutions that store large binary blobs without a server process and with minimal installation are examined.

computers are awful
python

I sometimes need to cache things in Python, and I’d like it to be easy. For my scientific computation needs, Ideally I’d have a cache that requires no server process, provides locking or otherwise safe multiprocess write access, stores large binary blobs and has minimal installation requirements. If I can get all that at once, my life will be much easier.

Figure 1

1 DiskCache

Our current front-runner:

DiskCache: Disk Backed Cache

DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.

The cloud-based computing of 2023 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn’t it be nice to leverage empty disk space for caching? […]

DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There’s no need for a C compiler or running another process. Performance is a feature and testing has 100% coverage with unit tests and hours of stress.

LMDB promises thread-safety and multiprocess-safety. For scientific cluster computing, where running persistent server processes is hard but caching is easy, this is typically what I want.

2 LMDB

LMDB is not Python-specific, but it might do the job?

LMDB | symas

Symas LMDB is an extraordinarily fast, memory-efficient database we developed for the OpenLDAP Project. With memory-mapped files, LMDB has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.

Bottom line, with only 32KB of object code, LMDB may seem tiny. But it’s the right 32KB. Compact and efficient are two sides of a coin; that’s part of what makes LMDB so powerful.

Ordered-map interface
keys are always sorted; range lookups are supported
Fully transactional
full ACID semantics with MVCC
Reader/writer transactions
readers don’t block writers; writers don’t block readers
Fully serialized writers
writes are always deadlock-free
Extremely cheap read transactions
can be performed using no mallocs or any other blocking calls
Multi-thread and multi-process concurrency supported
environments may be opened by multiple processes on the same host
Multiple sub-databases may be created
transactions cover all sub-databases
Memory-mapped
allows for zero-copy lookup and iteration
Maintenance-free
no external process or background cleanup or compaction required
Crash-proof
no logs or crash recovery procedures required
No application-level caching
LMDB fully exploits the operating system’s buffer cache
32KB of object code and 6KLOC of C
fits in CPU L1 cache for maximum performance

Has a Python API:

It seems to target small records, of 1–2 kilobytes; larger items are less efficient.

3 Dogpile.cache

It was our previous front-runner.

dogpile.cache:

Dogpile consists of two subsystems, one building on top of the other.

dogpile provides the concept of a “dogpile lock”, a control structure which allows a single thread of execution to be selected as the “creator” of some resource, while allowing other threads of execution to refer to the previous version of this resource as the creation proceeds; if there is no previous version, then those threads block until the object is available.

dogpile.cache is a caching API which provides a generic interface to caching backends of any variety, and additionally provides API hooks which integrate these cache backends with the locking mechanism of dogpile. Source[…]

Included backends feature three memcached backends (python-memcached, pylibmc, bmemcached), a Redis backend, a backend based on Python’s anydbm, and a plain dictionary backend.

Pro: it’s definitely smart and modern about locking.

Con: plain file disk persistence isn’t supported by default. The dbm-type wrappers can probably give us something acceptable for data that serializes well in a dbm database; presumably dogpile makes them safe for concurrent writes. We’re not sure how that would go with heavy numerical work.

There’s an advanced version for higher-performance Redis.

4 joblib.Memory

The joblib cache looks convenient:

Transparent and fast disk-caching of output value: a memoize or make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. Separate persistence and flow-execution logic from domain logic or algorithmic code by writing the operations as a set of steps with well-defined inputs and outputs: Python functions. Joblib can save their computation to disk and rerun it only if necessary.

I can’t work out whether it’s safe for multiple writers, or whether it’s meant to be invoked only from a master process and thus doesn’t need locking. Surely, since it’s part of joblib, it should be safe for multiple processes?

It only supports function memoization, so if we want to access results some other way or access partial results it gets convoluted unless we can naturally factor our code into memoized functions.

5 Klepto

klepto (source) is also focused on scientific computation, (part of the pathos project):

klepto extends Python’s lru_cache to utilize different keymaps and alternate caching algorithms, such as lfu_cache and mru_cache. While caching is meant for fast access to saved results, klepto also has archiving capabilities, for longer-term storage. klepto uses a simple dictionary-style interface for all caches and archives, and all caches can be applied to any Python function as a decorator. Keymaps are algorithms for converting a function’s input signature to a unique dictionary, where the function’s results are the dictionary value. Thus for y = f (x), y will be stored in cache [x] (e.g. {x:y}).

klepto provides both standard and “safe” caching, where “safe” caches are slower but can recover from hashing errors. klepto is intended to be used for distributed and parallel computing, where several of the keymaps serialize the stored objects. Caches and archives are intended to be read/write accessible from different threads and processes. klepto enables a user to decorate a function, save the results to a file or database archive, close the interpreter, start a new session, and reload the function and its cache.

Given the use cases, we’d assume it’s safe for concurrent writes from multiple processes, but I can’t find any information in the documentation.

6 Flexicache

fastcore flexicache is “Like lru_cache, but customizable with policy funcs”. It’s backed by Jeremy Howard at Answer.AI. See Exploring flexicache.

7 Incoming

cachetools extends lru_cache, the Python standard library’s reference implementation.