Results for the homogeneous Hawkes model

Here I discuss the results applied to various subsets of the time series.

One possible option to deal with the problems identified in the simulation stage would be to manually select some "best" data sets that I believe to be free from inhomogenaeity, and fit the estimator to those. This has the unfortunate quality that I have no well-specified notion of what the sampling process of selecting data that "looks like it fits my model" allows me to say about the data more generally. Certainly, finding datasets that "look" endogenous is a trivial procedure, and I have fit some as a diagnostic.

I return to the idea of filtering the data-sets to find ones that are tractable in a principled fashion later, by sugesting that we can simply identify inhomogeneity using the right sort of estimator.

For now, I restrict myself to random sampling. I estimate model parameters from time series chosen uniformly without replacement, from the set of Youtube videos. As discussed earlier, it it not clear if this will give us information about the population of youtube vides per se, but the same criticism could be made of many schemes. I let the computing cluster run through the time series in a random order until the batch jobs are terminated by exceeding their time limits. At the end of the process, there are 92183 time series results.

Using the AIC procedure, I examine the question of which model is selected.

Model selected Count %
No self-excitation 7 0.01
Exponential kernel 42259 45.84
Omori kernel 49917 51.15
Total 92183 100.0

The Poisson distribution is, as expected, rejected apart from a handful of time series of near constant rate. Much as with the misspecified test data, the kernel choice is reasonably evenly divided between two alternative hypothetical kernel shapes. The data set is too large now for violin plots. However, some histograms convey the idea. I show a raw histogram of estimated results; it is not, for example, weighted by a goodness of fit measure or AIC difference.

AIC results for the fit models

Histogram of AIC values for the fit models. Positive numbers select the exponential kernel.

AICs results for the fit models

Cumulative histogram of AIC values for the fit models. Positive numbers select the exponential kernel.

The distribution is qualitatively similar the "lead balloon" fits earlier. We need sharper statistical tools, however, to see if this means anything.

The estimates suggest a rather high branching ratio in general, although as mentioned, we have excluded the alternative implicitly in the model choice stage.

Estimated branching ratio

The estimated parameters of the Omori kernels are messy, as with the simulated date. Once again I resort to the plugin kernel median estimate to give us a timescale. The Omori and exponential kernel fits results for the plugin median estimate are given here. The distribution is broad but shows, across both types, a peaks at around 0.1-0.2, corresponding to a median influence decay on the order of 4 hours. This is not an implausible time-scale to estimate from our data. For comparison I plot also the histogram of observation intervals.

Estimated branching ratio

Note, however, that if we believe these estimates are meaningful, then we need to recall that the interpolation process has introduced upward bias to these values; the "real" time scale is likely even shorter. This effect could be estimated by parametric bootstrap from the estimated parameters.

I might consider how much the estimate might be influenced by the lead balloons in particular. A simple statistic to quantify this is the estimated median occurrence time in the time series, considered as a fraction of the length. If the rate of video views was constant, we would expect this to cluster at the 50% line. If half the views a video ever has were to occur in the first 5% of its sampling window, then this would be at the 5% line.

Distribution of time series media occurrence median time.

This prevalence of lead-balloons is itself not uniformly distributed with regard to time series size. Rather, high view-rate time series are disproportionately likely to be lead balloons.

Distribution of time series median occurrence time.

Distribution of median times in the top 5% show an even more extreme skew towards sudden collapse

It seems that early success is not necessarily associated with overall success in a simple manner. On one had this shows the interest of the data set -there are clearly non-trivial dynamics in play. On the other hand, these dynamics are ones that we know to be problematic.

We can informally diagnose at least one type of outlier. We see whether the distribution of these estimates is determined by the lead-balloon-type outliers, by filtering out all time series whose median sample point occurs before the 50% mark. This will restrict the estimates to the 29418 time series that are, by this informal measure, definitely not lead balloons. We are still in exploratory mode here, so I simply plot some histograms to informally inspect the changes in the distributions of estimates.

Estimated branching ratio for non-lead-balloons
Estimated branching ratio for non-lead balloons

The alteration is not spectacular; This particular method of selecting results doesn't get substantially different set of estimates.

If we return to the time series that I showed at the beginning of this report, there are signs that individual estimates are indeed misleading. Consider, for example, the archetypal "lead balloon" that I showed at the beginning The archetypal "lead balloon" time series of David Navarro's fight is intersting for two reasons.

First, I note the estimates for this time series with the exponential kernel, which is selected over the Poisson. We have, to two decimal places \(\hat{mu}=31.8,\,\hat{eta}=0.99,\,\hat{kappa}=0.01\). Sure enough, the estimator has determined that this least viral of time series is very nearly critical, and it has detected a time scale of approximately 13 minutes. 13 minutes is so far below the sampling window that it seems implausible to resolve. The extreme criticality estimate, however, shows the estimator is maching our intuitions poorly. If we believe the model is well specified we can easily bootstrap ourselves some confidence intervals, but even that seems too faith at the moment.

The second reason that it is interesting is that I don't have an Omori kernel estimate for this one. The reason, it turns out, is that the estimation for those kernel parameters, did not terminate in the permitted time. This time series, with 36198 occurrences, is not especially large, but calculation with such time series is simply expensive, and this one was too expensive. There is, then a degree of censoring in the data regarding Omori law kernels We can use alternative estimators that approximate the Omori kernel - especially if we think that Omori law kernels are implicate in the extreme behavour of this system On the other hand, since the simulations lead us to believe that we cannot detect precisely the extreme heavy tailed Omori kernels that are of interest here, there does not seem to be immediate benefit to this particular use of CPU cycles.

Further work

Expanding the candidate set

the case for expanding the class of model we consider in this is clear; It seems likely that even a simple liiner trend model for background intensity might be a start, and it has a rasonably simple estimator [77].

It turns out to be not so easy to do this immediately for basic technical reasons; The software backagae I use, pyhawkes, and indeed, most of the commonly used packages for this estimation procedure, have no support for inference of variable background rate. One can manually simulate a variable background rate by ad hoc procedures such as time-transformation of the data.

I did in fact attempt this procedure in an earlier stage of the project, but the problems with goodness of fit and model selection procedures were already severe enough that I decided not to this particular layer of confusion. If I am committed to deriving a new estimator then, I need to choose the one with the highest practical return on investment, and that would involve a flexible and principled choice of inhomogeneity modeling. I believe the estimators outlines in the next chapter are my best chance in that regard.

Summary data estimation

If we are deeply invested in this data set, or ones with similar missing data problems, we may wish to consider how to reduce the uncertainty due to the data interpolation, since the list of shaky steps in the estimator's construction is clearly too long at the moment in the light o the results o the last chapter.

There are many potential ways to do address this.

Without any change to the current estimator, we could aim to improve confidence intervals by using simulation methods robust agains mis-specification-robust. One can construct confidence intervals, and possibly even bias corrections using the bootstrap; In the case of this model, ideally a nonparametric bootstrap. [71, 72, 73, 21]. This is not trivial time-series with long-range dependence, but there is work in the area.

We could try to construct an estimator that addressed the missing data problem analytically. The problem of estimating Hawkes model parameters from summary statistics is not trivial, but the literature suggests potential solutions. I mention some of these here.

  1. The brute force method:

    Maximize the likelihood with respect to parameters and missing time stamps. Then, at least, we are on firmer ground regarding the use of Maximum likelihood estimation theory, since we wil have maximised the likelihood over all the unknowns and not parameters of interest. This idea, while simple in principle, results in an optimization over \(N_T+\|\theta\|\) unknowns with a complicated derivative, many constraints and strong coupling between the parameters. This turned out to be computationally prohibitive in my basic implementation.

  2. Stochastic expectation maximisation:

    Expectation Maximisation (EM) is a common iterative procedure for estimating "missing-data"-type problems in an ML framework. Informally, this estimator [38, 121], alternates between estimating missing data and estimating parameters based on the missing data. While particular form of this estimator does not seem to be any more tractable than the brute force method, stochastic variants [25, 37, 118, 70] allow us to sample from much simpler distributions to approximate missing data, and EM procedures have been used for other problems in Hawkes process inference [116, 55]. Deriving an estimator for summary data is addressed by [5] (for small data sets) and more fully by [113], and the solution ultimately involes an EM-type procedure.

  1. Deconvolution: Cucala ([33]) gives a deconvoluting kernel estimate of point process intensity, which gives certain robustness guarantees against uncertainties in measurement time, which might possibly be extended to our case.

  2. Bayesian inference: There are several attempts to derive Bayesian estimators for the state of self-exciting processes, both offline in Markov Chain Monte Carlo settings [94] and online, as Sequential Monte Carlo [76].

  3. Summary estimator:

    It seems plausible that an estimator could be constructed that used the observation summary statistics to calculate the full Maximum Likelihood estimate, by calculating likelihood from the summarized observations directly without inferring occurrences. There is no simple closed form representation for conditional distributions here, but there are certain inequalities. For the exponential kernel higher moments have a simple form [86], and for some other kernels at least a manageable form [19, 7]. We could also consider the use of moderate deviation principles and such inequalities bound estimates from subsampled data. [61, 124].

  1. Indirect Inference:
Inference by matching summary statistics between the data and simulations from the hypothesized model is well-established in econometrics. Smith and Gourieroux introduced such methods for continuous-valued time series,:cite:p:Gourieroux1993,Smith1993 although there are point-process versions. [66, 30]. This technique is not purely econometric, having been used in ecology as well. [68]. It is involved and computationally expensive, but uncontroversial, and comes with its own analogue of the maximum likelihood estimation theory, including various asymptotic results.

Asides from the lack of guarantees, a shortcoming of the constant-rate interpolation estimator is that it has bad computational scaling behaviour; while a single Youtube video time series might encompass, say, a hundred thousand views, I still only have a a few hundred sample points statistics to use for inference. And yet evaluating the likelihood function involves simulating a hundred thousand synthetic data points, constructing in turn a kernel density function with a hundred thousand synthetic points, then evaluating that likelihood function at each of the hundred thousand synthetic data points. The computation cost of naïve evaluation the likelihood function scales as \(\mathcal{O}(N^2)\) for \(N\) occurrences observed at \(n\) times, where in general \(N\gg n\). Optimizations exist to improve the scaling properties for for exponential [92] and Omori kernels [90], although only the first of these is implemented in my software.

Ideally, an estimator whose computational cost scaled with number of observations rather than number of occurrence would be desirable. \(\mathcal{O}(n)\lll \mathcal{O}(N)\). I know of no such estimator. Rather, the estimation procedure is slower and has weak guarnatees.

Computational efficiency is, like plumbing, something we would rather work without paying it any attention. However, as with many large data sets, computational efficiency does become a factor later on in this project when I attempt to extend the model.

Moreover, if we did ultimately wish to use a model averaging procedure, the fact that the AIC scales with number of occurrences rather than number of observations should be concerning

On potential fix for this problem is exploiting the decomposability of linear simple point processes such as the Hawkes process We may decompose a point process and its intensity function into two simultaneous independent point process [97]. This suggests that we might be able to "downsample" the time series by thinning, construct the estimate on the smaller series, and use those estimates to infer behaviour of the larger series. Once again, I leave that for future work.

Regardless, of the method, if we want to handle this data in a principled fashion, and especially if we care about estimating the heavy-tailed kernel types reliably, then pursuing a better estimator for this class of data is essential.

Goodness of fit

Finally, although it was logically clear here that the models fit "badly" in some sense, and thus I didn't need a goodness of fit test, this absence will is pressing for future work when we would like a statistical guarantee. Without it, we don't have a general way of diagnosing the shortcoming of the candidate model class, and so expanding the candidate model set is a dangerous proposition.