# Simulations for the homogenous estimator¶

## Point estimates¶

Maximum likelihood estimators for complete times series are well studied. [29, 87, 92, 99] The novel quality of the problem here is the summarization and interpolation step. I estimate the significance of this by simulation.

As mentioned, the observation interval is variable both within and between time series, and there is no known model for the intervals. For want of a better alternative, in my simulations I will use a constant observation interval with each time series. I start with the case of estimating the parameters of a Hawkes model with exponential triggering kernel from data generated by such a process.

I choose a set of 9 different sampling intervals \(\{\Delta \tau_{-4}=2^{-4}, \Delta \tau_{-4}=2^{-4}, \dots, \Delta \tau_{4}=2^{4}\}\). I fix the number of observations per realization at \(n=201\), the branching ratio \(\eta=0.9\) and hold the kernel parameters \(\kappa\) fixed. To keep the number of occurrences comparable, I choose \(\mu_0=5.0\) and \(\mu_i=\mu_0/\Delta T_i\). I pick \(M=300\) simulations.

For each \(i\), I simulate a realization of a Hawkes process \(N_{m,i}(\cdot)\) over the interval \([0,200\Delta\tau_i]\). I construct maximum likelihood estimate \(\hat{\theta}_\mathrm{complete}\) for the parameters from the realization. Next, I reduce this realization to summary tuples

I synthesize "complete data" for these summary tuples with a piecewise constant-rate process to create a new time series \(N'_{m,i}(\cdot)\), and estimate the parameters from the new series.

Observe that each realization constructed this way has similar number of occurrences \(EN(200\Delta\tau_i)\). Due to edge effects we should expect slightly fewer occurrences when the time step, and hence the whole simulation interval, is shorter. Each realization has an identical number (201) of observations of those occurrences.

I do not resample any single realization with multiple observation intervals, since that will confound number of observations with the sample window influence, but rather simulate each time with a different seed for the pseudorandom number generator.

Over the 300 simulations I get an estimate of the sampling distribution of the parameter point estimates as the sample step size changes relative to the kernel size. Equivalently I could change the kernel size and leave the step size constant; as there is no other inherent scale in the process, the results are equivalent.

The procedure described thus far corresponds to time series started "from zero", with no history before the start of the data collection, and thus an initially lower rate than the stationary process. An alternative is to sample from the stationary model, Practically, the "stationary" case corresponds to simulating over the interval \([0,C+200\Delta\tau_i]\) but fitting over the interval \([C,C+200\Delta\tau_i]\) for some sufficiently large positive \(C\).

Since ultimately the datasets that I use are themselves usually started "from zero", in that the first data point is close to the start of the series the non-stationary case seems more relevant.
I display only the "started from zero" data here.
That estimation from stationary simulated data generally *worse*, in the sense that the point estimates have larger bias.

I set \(\kappa=1\). This gives a mean delay of 1,

I now plot the inferred sampling distributions against the sampling distribution of the complete data estimator. Across the 300 simulations, I get an average of 8627 occurrences per realization

This is reasonable performance.
The estimates for the branching ratio are somewhat biassed, as are those for background rate.
It would be interesting to know if this the estimates were *consistent*, with a bias that vanishes in the large sample limit.
That question is irrelevant for the current data, which has limited sample sizes.

The estimates for the kernel times scale are the most affected by the resampling process. Compared to the complete data case, almost all estimates have significant bias, and the degree of this bias depends upon the scale of the kernel. In the best case, when the kernel scale is of the same order as the observation interval, it has acquired a mild upward bias by comparison with the complete data estimates. In all other cases the bias is large. The performance is good enough to use as is for the current experiment. When the sampling window hits 16 the estimate is wrong by a factor of approximately two --- but then this estimate is for the parameters of a kernel that is \(\frac{1}{16}\) the time scale one might at which one might give up hope of estimating the parameters at all. Moreover, the bias is regular enough that one could possibly correct bias for a given parameter estimate by bootstrap simulation. On this range the sample mean point estimate is apparently monotonic, in the sense that, with large empirical probability, \(\hat{\theta}_{\Delta\tau_i}\lt 2\hat{\theta}_{2\Delta\tau_i}\).

That avenue however, will not be pursued in this project, as I will encounter more pressing problems.
Note also that *all* re-interpolation, even those whose timescale is much smaller than the supposed timescale, introduce additional variance into the estimator.

These graphs are all of marginal estimator values. Point estimates of components of the parameter vector for any given data set are not independent from the full data estimate in the sense that the Fisher information matrix is not diagonal. [87] Likewise we should not expect the different parameter estimates for fits to the interpolation-based estimator to be independent; For the moment, marginal distributions of the estimates are informative enough.

I now turn to heavy tailed kernel parameter point estimation.

The Omori kernel has two parameters which determine the time scale. Since it is standing in for the whole class of heavy-tailed distributions, it seems wise to test a strongly heavy-tailed parameter set. Accordingly, I choose tail exponent \(\theta=3/2\). Maintaining same mean delay as the exponential kernel requires me to choose the other parameter \(\kappa=1/2\). Note that heavy tailed kernels such as this can lead to estimation uncertainty even with complete data, so problems with this data are likely. [105]

Indeed, the uncertainties in the interpolated estimates are large. I consider first the branching ratio estimates and background rates, both of which are reasonably good, and indeed substantially worse than the complete data estimator.

Across the 300 simulations, I get an average of 8823 occurrences per realization.

The time scale estimates are bad enough that they present real difficulties in presenting graphically.

Considered individually, the kernel parameter estimates are not consistent, spanning many orders of magnitude. The sampling distribution has more in common with modernist painting than statistical analysis. I show one example, although estimates for both parameters are similar. Note the large vertical scale.

This problem could be to do with the estimator becoming trapped in local minima, possibly even for pure numerical estimation reasons. Indeed, to save CPU cycles, I restarted the numerical optimizations with only a small number of initial points when estimating parameter values.

I posit that the problem with the estimator is that is getting the *shape* of the kernel wrong, but that the estimates might still be correct in terms of the time scale when the kernel parameters are considered together.

I need some care to plot this, since the Omori law does not necessarily have finite moments of any order, and indeed the estimates often give me a kernel with no finite moments.
It seems that mean delay was not a wise choice.
I use the classic trick with heavy-tailed distribution and consider the *median* delay.

I use the plug-in estimator of the median given the estimated parameters. Plotting the sampling distribution of this reveals some support for this idea, showing me a somewhat similar picture to the exponential kernel case.

Indeed, the Omori kernel has, in an approximate sense informally speaking, "more information"" destroyed by the randomization process than the Exponential kernel. The parameters change not just the average scale of interaction, but the relative weighting of local and long-range interaction.

If we suspect heavy-tailed interaction kernels are important for this kind of data, a different heavy-tailed kernel family might resolve this problem. for now, I will observe that we should be suspicious of the inferred shapes for Omori kernel fits. IN particular, our estimate of heaviness of the tail distribution, which is our motivation for using this kernel, is suspect.

So much for point estimates. I still need to examine the AIC model selection question.

## Model selection¶

Here I largely follow the methodology and parameter choices of the previous sections.

At this point, I have 3 different candidate models available: The Hawkes model with exponential kernel, the Hawkes model with Omori kernel, and the constant rate Poisson process corresponding to either of the Hawkes models with a zero branching ratio.

Now, instead of merely fitting the Hawkes process with Omori kernel to data generated by a Hawkes process with Omori kernel, I am concerned with whether I can work out *which* model to fit given only the data.

I simulate and interpolate data as before, but now I fit each of the 3 candidate models to the same interpolated data set and compare the AIC statistic for each fit. The goal is to identify the correct model class with high probability.

Even with 3 models there are many permutations here.
I will demonstrate only a couple.
The primary difference with the previous section is that I will not show the statistic distribution with complete data for technical reasons ^{1}

[1] | I accidentally deleted it. |

First I consider whether I select the Hawkes process with exponential kernel when the true model is in fact a constant rate Poisson process.

Reassuringly, the proposed procedure usually gets this right at all observation intervals, although there is a significant tail of false acceptances of the Hawkes process.

In the converse case we also select the correct model, although it should be noted that if we consider magnitude of the AIC difference as an indicator of certainty, the larger sampling intervals give us spurious confidence.

The case is less clear if we try to identify the correct kernel. Trying to select between Omori and Exponential kernels the AIC difference depends strongly on relationship between kernel and observation interval timescales.

Qualitatively, the AIC model selection usually selects a Hawkes model when the true generating process is a Hawkes process and rejects it for a constant rate Poisson process. When we need to select between the two different kernel types, however, the AIC distribution is messy and timescale dependent, and magnitudes of the difference are generally misleading.

This leaves the interpolation-based estimator in a curious position.

Consider a toy world in which all time series are generated by one of the three models I have identified, and in which we must use the interpolation-based estimator, and select between models using the AIC.

In this world, for at least the parameter ranges I have used here, and setting aside the question of the influence of uneven sampling intervals, I can get a good estimate of the presence or absence of self-exciting dynamics. I can get a reasonable estimate of the branching ratio, the background intensity, and even some idea of the characteristic timescale of the process. My ability to estimate specifics of the kernel shape beyond that is virtually non-existent.

This might be acceptable, depending on our purposes. Certainly, in this world we have *some* idea of branching ratio and dynamics for the time series we observe.

I now turn to the question of what happens if we expand the parameters of the toy world to consider the possibility of time series generated by processes from outside this class.

## Empirical validation of estimators of inhomogenous data¶

I turn to the phenomena of isolated spikes in the data, and consider the behavior that we can expect from the estimators in handling these in articular. We should of course bear in mind that there are many possible inhomogeneities in the data, and many plausible generating mechanisms outside the candidate set. We might nonetheless prefer a restricted candidate model set for easy of interpretation or computational efficiency, so long as the behavior is reasonable despite the misspecification.

I simulate a stylized "lead balloon" spike. This I define as the inhomogeneous Poisson process \(N(t)\) with the rate function

I thus have an example of Lead Balloon-type behavior, where the median timestamp should occur somewhere around \(t\approx 50\), or 25% of the series total observation window, which is not particularly extreme for this data set. Apart from the single inhomogeneity, the process has zero branching ratio.

Once again I simulate and fit this model 300 times using the estimator. resampling and reinterpolation makes little difference with this piecewise-constant intensity function, so I do not bother variable observation interval sizes and so on, but fit using the complete data estimator.

Using the AIC test, the Poisson model comes last all 300 times. That is, we select a positive branching ratio for some parameters the rest of the time, by a large margin. I picture the evidence, in the form of AIC difference, in favor of the Hawkes models. Since I have been using violin plots so far, I will continue to do that here for consistency, although it should be borne in mind that AIC comparisons are only meaningly with a single dataset, and these comparisons are usually between data sets. Nonetheless we can still learn something from the ensemble distribution - for example, that this test never prefers the Poisson model.

The estimation procedure turns out to be reasonably agnostic between the exponential and Omori laws for the kernel shape, much as with the summarized data.

We also get low variance estimates of the series parameter, although biassed to a spurious value. Consider the branching ratio, for example, which is always approximately 0.54 for Omori and Exponential kernels. It turns out that, discounting some estimated fits where the parameter values diverged, the models are reasonably consistent, producing estimates of around 0.54.

Similarly, the estimator happily reports a characteristic timescale with low sampling variance

These spurious estimators are data-dependent. By choosing, for example, more extreme spikes, I can cause the estimator to pick a higher branching ratio.

However, the real point is not to investigate this particular mis-specified model. Rather, it is to bear in mind that the admirably elegant set set of models that we can fit with `pyhawkes``

out of the box is too small to plausibly handle the kind of behavior that the Youtube data suggests, and that all results will be colored by this fact.

Nonetheless, I cross my fingers and hope for moment, and turn to the data.