Results for the inhomogeneous Hawkes model

Once again, I fit the estimator to selected times series from within the Youtube data set, withholding for the moment concrete hypotheses, and I reporti the estimates.

My limits in this section are sharper. My algorithm is far more computationally intensive, and my code far less highly optimized, than pyhawkes. I will thus, in an exploratory mode, fit parameters to small subsets to validate that this idea in fact gets us somewhere, and see what further directions are supported for this kind of analysis.

Single time series detailed analysis

Turning to specific examples, I recall the time series id '-2IXE5DcWzg', Valentin Elizalde, Volvere a amar. The question I posed at the start was whether his biographer's hypothesized milestone was the cause of the spike in the data. Did this singer get more views on Youtube because of Billboard magazine listing him, or did Billboard magazine list him because of his surging popularity? Until this moment we've had no tools that could even hint at the answer to this kind of question.

I fit the inhomogeneous estimator to the data. Using AICc, I select the fit with penalty \(\pi=0.466\), corresponding to \(\hat{mu}=5.59,\, \hat{eta}=0.963,\, \hat{kappa} = 0.186\).

Exemplary time series

View-rate estimate \(\hat{\lambda}_{\mathrm simple}(t)\), for time series Valentin Elizalde, Volvere a amar

Graphing the estimated background intensity, I find that the model does estimate an increased intensity at around the start of that surge in interest. However, the date it suggests is 2007-02-21, substantially before the Billboard on listing 2007-03-3. This model suggests we need to look elsewhere to find an exogenous trigger. At the same time, it suggests that the singer was highly viral, with an endogenous

Aggregate analysis

Turning to bulk analysis, I try to fit as many models as possible.

The price of my improvements in the estimator is a high computational burden. Running the software estimator through the random list of time series, I find that I have only estimated parameters for constructs 913 models before.

I give summaries of these here. This is purely informational. I do need to know more about the sampling distribution of the estimator estimates in order to draw strong conclusions about population properties, even before I consider how to address the various other difficulties with the data.

Branching ratio estimates

Branching ratio estimates for the inhomogeneous estimator

Time scale estimates

Kernel delay estimates for the inhomogeneous estimator. I show the median here for consistence with analysis of the homogeneous estimator.

Despite the significant of the differences considering background rate makes on selected time series, over the ensemble my innovations turn out to make little different to the sampling distribution of randomly chosen series from the data. What is happening here?

We have a couple of possibilities. Firstly, that the problematic "lead balloon", and "spiky" time series I have chosen to test the estimator are not significant considered on the scale of the population. Or we might be missing other significant types of inhomogeneity, such as slowly varying fluctuations. It could be that the this random sampling is not representative of the population qualities.

Certainly, there is more to do.

We might consider constraining our search space, by hypothesizing that the influence kernel of these videos has universal decay time, so that we could exploit the huge amount of data available; Once we are only estimating the background rate and branching ratio but not all the other parameters anew for each individual time series we can exploit the data more effectively.

The most logical next step, however, would be to set the estimator running on a database of the most problematic sets of time series in the database, and then, while the computing cluster is humming away, get out a pencil and derive a new goodness-of-fit test for the model.