tl;dr I’m not currently using Transfer Entropy so should not be taken as an expert. But I have dumped some notes here from an email I was writing to a physicist, explaining why I don’t think it is, in general, a meaningful thing to estimate from data “non-parametrically”.
That explanation needs to be written, but I never got around to finishing it. The key point is that if you want to estimate this quantity empirically should just use an appropriate time-series graphical model instead. Then, you recover the main utility of transfer entropy and with more general interaction structures than discrete-time multivariate series used by transfer entropy, plus you get to choose your favourite conditional independence test and your estimation theory is better, or at least not worse, and more general, or at least not less general. You can, for example, use an information-theoretic dependence test in that framing, if that is important to you for any reason, or a Kernel Mean Embedding, or \(\Chi^2\)…
Transfer Entropy is, like Granger Causality, a quantity summarising, between two random processes, a particular species of Weiner-causation. As Granger summarized it
The statement about causality has just two components:
The cause occurs before the effect; and
The cause contains information about the effect that is unique, and is in no other variable.
In practice this “in no other variable” business is usually quietly ignored in favour of “in no other variable that i have to hand”; for a better approach to this see causal DAGs.
Transfer entropy is the brainchild of Thomas Schreiber, Peter Grassberger, Andreas Kaiser and others. It makes a particular assumption of the form of the data (discrete-time series) and the method in which one quantifies dependence; It is based on the KL-divergence (a kind of information measure) between two stochastic processes. In the first model, you assume that the two processes are both Markov, but completely independent. In the second, you assume that the two sequences are jointly Markov. The transfer entropy is the KL-divergence between the joint distribution of the next time step for each model. Intuitively, it tells us how much predictive power we have lost by assuming that the sequences are independent.
One needs to make this concrete by plugging in specific assumptions on the form of the process; One such special type of Wiener-causality, Granger-causality is based on linear vector ARIMA time series. Barnett (Barnett, Barrett, and Seth 2009) shows that for the special case of your processes being a jointly autoregressive linear model with Gaussian noise, it is the same as Granger causality.
Or if your time series is a finite discrete random variable you can just use discrete Markov chains, as in Lizier and Prokopenko (2010). Other models are possible, but I haven’t used any such.
Why do we care about this model of causation?
There is a famous data set from an ancient Santa Fe time series data analysis contest, of ECG and breath data. Transfer entropy has been to this to measure whether heart rate “t-causes” breath rate or vice versa. However, if you really which to know whether heart rate “causes” breath rate or breath rate “causes” heart rate, at least one experiment to work it out has been done many times: Stop either someone’s heart or breath for long enough, the other one will stop shortly after. Homework: What is the relationship between the model of causation implicit in this experiment and the one from the observational time series data?
Like all Wiener-causation, TE does not measure causal influence per se but predictive usefulness. G-causation (Or t-causation?) is not like intuitive causation; Specifically, we are often not only interested in how well we can predict one from the other, but how we can change overall system behaviour by intervening in it. This is a complicated and different question than asking about which parts of a system are informative about which others. See, e.g. causal DAGs.
Estimating from data
All this is about stochastic processes for which we know the parameters. Why do we want to calculate this predictive importance measure for processes More usually, you want to gain insight into some real-world stochastic process from which you have data, but imperfect knowledge of the parameters.
If you have to estimate the transfer entropy between processes with unknown parameters from noisy observations, you have now arrived in the world of statistics.
How can you estimate it? Which parametric models work? Which nonparametric methods?
TODO: mention how to do this; Specifically, mention what might go wrong with information estimation.