Stream processing and reactive programming

Lazy bookmark for practical details to processing and transforming possibly-infinite streams of data, from signals to parse trees. Disambiguating “transducers”.

Used in parallel/offline processing of very large data sets that do not fit in core, or processing things that happen in realtime such as UI.

I am imagining more general objects than singly-indexed real-valued signals; Tokens, maybe. Classic DSP can be elsewhere. Infrastructure to do this is under message queues.

In statistics and machine learning, this connects with online learning; incorporating data as it comes in, but there you probably care about good out-of-core optimisation algorithms.

See also artificial chemistry, parallel computing. (Streams aren’t necessarily parallel, but they are a convenient way of managing asynchrony.)

CSP/ FRP/ reactive programming

A parallelism/streaming thing. Communicating Sequential Processes. Functional Reactive Programming.

  • CSP and transducers – I don’t think these are transducers as I understand them, i.e. stack machines, but I could be wrong. See also the expositions from the clojure authors:

  • ReactiveX is a particular stream processing paradigm with implementations for many languages

  • Reactive manifesto:

    We believe that a coherent approach to systems architecture is needed, and we believe that all necessary aspects are already recognized individually: we want systems that are Responsive, Resilient, Elastic and Message Driven. We call these Reactive Systems.

    Systems built as Reactive Systems are more flexible, loosely-coupled and scalable. This makes them easier to develop and amenable to change. They are significantly more tolerant of failure and when failure does occur they meet it with elegance rather than disaster. Reactive Systems are highly responsive, giving users effective interactive feedback.

  • A collection of links for streaming algorithms and data structures

Javascript

See FRP in Javascript.

Python

See Python See FRP in Python.

Streaming data analysis

Online, possibly realtime, certainly memory-constrained.

  • Heka

    • Loading and parsing log files from a file system.

    • Accepting statsd type metrics data for aggregation and forwarding to upstream time series data stores such as graphite or InfluxDB.

    • Launching external processes to gather operational data from the local system.

    • Performing real time analysis, graphing, and anomaly detection on any data flowing through the Heka pipeline.

    • Shipping data from one location to another via the use of an external transport (such as AMQP) or directly (via TCP).

    • Delivering processed data to one or more persistent data stores.

    Written in Go, plugins in Lua.

  • Apache Storm

    • Storm-compatible, Heron aims to be Storm-but-more-reliable.

To read

Hu, Tao, Cengiz Pehlevan, and Dmitri B. Chklovskii. 2014. “A Hebbian/Anti-Hebbian Network for Online Sparse Dictionary Learning Derived from Symmetric Matrix Factorization.” In 2014 48th Asilomar Conference on Signals, Systems and Computers. https://doi.org/10.1109/ACSSC.2014.7094519.

McSherry, Frank D., Rebecca Isaacs, Michael A. Isard, and Derek G. Murray. 2013. Differential dataflow. 20130304744 A1. U.S. patent, issued November 14, 2013. https://www.google.com/patents/US20130304744.

Murray, Derek G., Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. “Naiad: A Timely Dataflow System.” In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, 439–55. SOSP ’13. New York, NY, USA: ACM. https://doi.org/10.1145/2517349.2522738.

Pan, Gang, Wangsheng Zhang, Zhaohui Wu, and Shijian Li. 2014. “Online Community Detection for Large Complex Networks.” PLoS ONE 9 (7): e102799. https://doi.org/10.1371/journal.pone.0102799.

Ryabko, Daniil, and Boris Ryabko. 2010. “Nonparametric Statistical Inference for Ergodic Processes.” IEEE Transactions on Information Theory 56 (3): 1430–5. https://doi.org/10.1109/TIT.2009.2039169.

Sorensen, Andrew, and Henry Gardner. 2010. “Programming with Time: Cyber-Physical Programming with Impromptu.” In ACM Sigplan Notices, 45:822. ACM Press. https://doi.org/10.1145/1869459.1869526.