Abstract¶
Are viral dynamics, the peer-to-peer propagation of ideas and trends, important in social media systems, compared to external influence? Are such dynamics quantifiable? Are they predictable? We suspect they are, in at least some cases, but quantifying how and when such dynamics are significant, and how and when we can detect this remains open question.
This thesis investigates how estimate the parameters of branching dynamics in a large heterogeneous social media time series data set.
The specific model that I use, the class of Hawkes processes has been used to model variety of phenomena characterized by "self-exciting" dynamics - broadly speaking, time-series where “lots of things happening recently” is the best predictor of “more things happening soon”, conditional upon the external input to the system.
Variants have been applied as models of seismic events, financial market dynamics, opportunistic crime, epidemic disease spread, and viral marketing. Detecting self-exciting dynamics is of huge importance in these application areas, where it can make a large difference in the certainty and accuracy of prediction, and of the usefulness and practically of interventions to change behavior of the system.
This data I investigate, documenting the time evolution of Youtube views counters, was collected by Crane and Sornette [32].
The notoriously viral nature of Youtube popularity suggests that this could supply an opportunity to attempt to quantify these viral dynamics.
The data set has various characteristics which make it a novel source of insight, both into self exciting phenomena, and into the difficulties of estimating them. The time series exhibit a huge variety of different behavioral regimes and different characteristics. While this data contains many observations, it is also incomplete, in the sense that rather than a complete set of occurrence times, there are only sample statistics for that data available.
These qualities present challenges both to the model I attempt to fit, and the estimator that I use to fit the model.
This places some constraints upon how how precisely I can identify branching dynamics, and with how much certainty, and the kind of hypotheses I can support.
This thesis consists of two major phases.
In the first phase, I attempt to address the question: What component of the Youtube video views may be ascribed to self-excitation dynamics? In this regard I will attempt to estimate the parameters of generating Hawkes process models for various time series to identify the "branching coefficient" of these models, which is one measure of the significance of viral dynamics.
Based on naive application of the model, I find the evidence is ambiguous. Whilst I cannot reject the hypothesis of branching dynamics, I show that the model class is unidentifiable within this framework due to several problems.
The first is the unusual form of the dataset; the incompleteness of the time series leads to missing data problems with no immediate and computationally tractable solution.
But even with complete data, I face second class of problems due to the misspecified model. For example, we should be surprised to ever fail to find branching dynamics at work, since branching dynamics is the only explanation permitted for intensity variation in this particular model. The classical Hawkes model assumes away any other source of time-variability, including exogenous influences.
The homogeneity assumption is not essential to modeling self-exciting systems, but merely a convenient assumption in a particular model class.
Therefore, in the second phase of the project I consider how to address these difficulties by weakening this assumption. I address the most commonly mentioned source of inhomogeneous behavior, exogenous influence, in what i believe to be a novel fashion.
I use penalized semi-parametric kernel estimators to the data to simultaneously recover exogenous drivers of system behavior and the system parameter. A simple implementation of this idea recovers model parameters under plausible values for the dataset.
The particular combination of estimators and penalties I use here is, to the best of my knowledge, novel, and there are limited statistical guarantees available. I address this deficit with simulations, and discuss how the results might be given more rigorous statistical foundation.
When applied to the data set in hand, the Youtube data, I find that there is support for the significance of branching dynamics; However, the parameters of the inferred process are different to those o the homogeneous estimator. This result means that it is crucial to consider the driving process in fitting such models, and the utility of investigating methods such as the one I use here to do so.