I need to cache stuff in python sometimes, and I would like it to be easy. For my (scientific computation) needs, ideally I would like a cache which requires no server process, which will nonetheless provide locking or otherwise safe access for multiprocess writes, which will store big binary blobs of data and which has minimal installation requirements. If I can get all that at once, my life will be easy.
DiskCache is an Apache2 licensed disk and file backed cache library, written in pure-Python, and compatible with Django.
The cloud-based computing of 2023 puts a premium on memory. Gigabytes of empty space is left on disks as processes vie for memory. Among these processes is Memcached (and sometimes Redis) which is used as a cache. Wouldn’t it be nice to leverage empty disk space for caching? […]
DiskCache efficiently makes gigabytes of storage space available for caching. By leveraging rock-solid database libraries and memory-mapped files, cache performance can match and exceed industry-standard solutions. There’s no need for a C compiler or running another process. Performance is a feature and testing has 100% coverage with unit tests and hours of stress.
It promise thread-safety and multiprocess-safety. For scientific cluster computing, where persistent servier processes are hard, but caching is easy, this is typically what I want.
LMDB is not python-specific, but it possibly does the job?
Symas LMDB is an extraordinarily fast, memory-efficient database we developed for the OpenLDAP Project. With memory-mapped files, LMDB has the read performance of a pure in-memory database while retaining the persistence of standard disk-based databases.
Bottom line, with only 32KB of object code, LMDB may seem tiny. But it's the right 32KB. Compact and efficient are two sides of a coin; that's part of what makes LMDB so powerful.
- Ordered-map interface
- keys are always sorted; range lookups are supported
- Fully transactional
- full ACID semantics with MVCC
- Reader/writer transactions
- readers don't block writers; writers don't block readers
- Fully serialized writers
- writes are always deadlock-free
- Extremely cheap read transactions
- can be performed using no mallocs or any other blocking calls
- Multi-thread and multi-process concurrency supported
- environments may be opened by multiple processes on the same host
- Multiple sub-databases may be created
- transactions cover all sub-databases
- allows for zero-copy lookup and iteration
- no external process or background cleanup or compaction required
- no logs or crash recovery procedures required
- No application-level caching
- LMDB fully exploits the operating system's buffer cache
- 32KB of object code and 6KLOC of C
- fits in CPU L1 cache for maximum performance
Has a python API:
It seems to target small records, of 1-2 kilobytes, and bigger stuff is less efficient.
Dogpile consists of two subsystems, one building on top of the other.
dogpileprovides the concept of a "dogpile lock", a control structure which allows a single thread of execution to be selected as the "creator" of some resource, while allowing other threads of execution to refer to the previous version of this resource as the creation proceeds; if there is no previous version, then those threads block until the object is available.
dogpile.cacheis a caching API which provides a generic interface to caching backends of any variety, and additionally provides API hooks which integrate these cache backends with the locking mechanism of
Included backends feature three memcached backends (python-memcached, pylibmc, bmemcached), a Redis backend, a backend based on Python's anydbm, and a plain dictionary backend.
Pro: it is definitely smart and modern about locking.
Con: plain file disk persistence is not supported by default. The dbm-type wrappers can probably get us something acceptable for data that serialises well in a dbm database; presumably dogpile makes them safe for concurrent writes. Not sure how it would go with heavy numerical work.
There is an advanced version for higher-performance redis.
The joblib cache looks convenient:
Transparent and fast disk-caching of output value: a memoize or make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. Separate persistence and flow-execution logic from domain logic or algorithmic code by writing the operations as a set of steps with well-defined inputs and outputs: Python functions. Joblib can save their computation to disk and rerun it only if necessary.
I can’t work out if it’s multi-write safe, or supposed to be only invoked from some master process and thus not need locking. Surely, as part of joblib, it should be multi-process safe?
It only supports function memoization, so if you want to access results some other way or access partial results it can get convoluted unless you can naturally factor your code into function memoizations.
lru_cacheto utilize different keymaps and alternate caching algorithms, such as
mru_cache. While caching is meant for fast access to saved results,
kleptoalso has archiving capabilities, for longer-term storage.
kleptouses a simple dictionary-sytle interface for all caches and archives, and all caches can be applied to any Python function as a decorator. Keymaps are algorithms for converting a function’s input signature to a unique dictionary, where the function’s results are the dictionary value. Thus for
y = f (x),
ywill be stored in
kleptoprovides both standard and “safe” caching, where “safe” caches are slower but can recover from hashing errors.
kleptois intended to be used for distributed and parallel computing, where several of the keymaps serialize the stored objects. Caches and archives are intended to be read/write accessible from different threads and processes.
kleptoenables a user to decorate a function, save the results to a file or database archive, close the interpreter, start a new session, and reload the function and it’s cache.
Given the use cases, one woudl assume this is safe for concurrent writes from multiple processes, but I cannot find info in the documentation.