Estimation of parameters of the homogeneous Hawkes model

Here I discuss estimating parameters of the Hawkes model.

For the current section, this means the techniques for parameter estimation from data where the true generating process is a stationary Hawkes process, as implemented by the pyhawkes code which I use for this part. I also cover model selection procedures for this method, i.e. how well we can choose which generating model is the correct one given the data.

Later the vicissitudes of using this estimator for the available data, and the limitations of the methods available.

I begin by discussing the case of estimating the parameters of the homogeneous Hawkes model in the case with complete data. The case of “summary” data, where we estimate the model from summary statistics, I will examine shortly.

That is, we are given an interval of length \(T\), taken without loss of generality to be \([0,T]\), and the (monotonic) sequence of occurrence times of the increments of the observed processed on that interval \({t_i}_{1\leq i\leq N}\equiv t_{1:N}\). I say "complete" to denote the common case for time series data; that we have access to the timestamps of every occurrence in the time series realization, \(t_{1:N}\).

In this chapter I use \(\theta_i\) to denote the generic \(i\)th component of the model parameter, and \(\hat{\theta}_i\) to denote the estimates of it. When I need to be clear, I name components. In general \(\theta=(\mu,\eta,\kappa)\) for background rate, branching ratio \(\eta\) and some kernel parameters \(\kappa\) depending upon the kernel under consideration. With the Omori kernel I require an additional kernel parameter, and I occasionally recycle \(\theta\) when it is unambiguous from context.

Estimating parameters from occurrence timestamps

Parameter estimation for the Hawkes process models is framed as a Maximum Likelihood (ML) estimation problem. It is not clear that this minimizes prediction error in any norm; for prediction with these models one often uses simple smoothing instead. [119, 62] or weighted model averaging [50]. There exist various methods for estimating parameters via second order statistics [7, 9, 98, 2]. There are also Bayesian estimators --- see [94] for a survey of general offline Bayesian techniques, and the optimal filtering literature for online sequential Monte Carlo. [76, 103].

For now, the classic ML method is my starting point. This is the most widely used technique, or at least most widely cited technique, [92, 87] I suppose for now that we are interested in estimating the "true" parameters \(\theta\) of the hypothesized generating model, or as close to that as we can get in some sense, and that this true generating model is a Hawkes process. We assume that we have a realization \(t_{1:N}\) of all timestamps from a Hawkes model over an interval \([0,T]\)

We consider the hypothesized joint probability density \(f_\theta(t_{1:N})\) of that model, here call it the likelihood, and choose the values for the parameter \(\theta\) which maximize the value of the joint likelihood for the observed data \(t_{1:N}\). Practically, we maximize the log likelihood given the data \(L_\theta(t_{1:N}):=\log f_\theta(t_{1:N})\). I will derive the exact for for this informally.

\[\hat{\theta}_\pi(t_{1:N}) = \operatorname{argmax}_\theta L_\pi(t_{1:N})\]

Using the regular point process representation of the probability density of the occurrences, we have the following joint log likelihood for all the occurrences,[92] 1

\begin{align*} L_\theta(t_{1:N}) &:= -\int_0^T\lambda^*_\theta(t)dt + \int_0^T\log \lambda^*_\theta(t) dN_t\\ &= -\int_0^T\lambda^*_\theta(t)dt + \sum_{t_j \leq t_i} \log \lambda^*_\theta(t_j) \end{align*}

I recall the the intensity for the Hawkes process in in particular

\[\lambda^*(t) = \mu + \int_{-\infty}^t \eta\phi(t-s)dNs\]

where \(\phi_\kappa(t)\) is the influence kernel with parameter \(\kappa\), \(\eta\) the branching ratio, and the star \(\lambda^*(t)\) is shorthand for \(\lambda^*(t|\mathcal{F}_t)\).

Now it only remains to maximize this

\[\hat{\theta}(t_{1:N}) = \operatorname{argmax}_\theta L_\theta(t_{1:N})\]

There is no general closed-form representation for the location of the extrema, but they are simple to solve by numerical optimization.

While our observations are by no means independently or identically distributed, this is still recognizably a ML estimator of the sort common in i.i.d. parameter estimation. Indeed, the same sort of large-sample asymptotic theory as \(T\to\infty\) for this kind of estimator does apply, given the assumptions of stationarity and certain other technical requirements. [87] Not however that one does not usually use a large sample theory for these estimators, in the sense of collecting many time series and trying to estimate shared parameters for all of them.

[1]The full log likelihood on \([0,T]\), pace Ozaki, includes a final extra term to denote the contribution to likelihood by stipulating that no occurrences were in \((t_n,T)\), i.e. The likelihood of \(N\) points observed on \((0,T]\), is the joint density of the occurrences \(\{t_1 \ldots t_N\}, t_i<T\) and no occurrences on \((t_n,T]\). It is tedious to write this down here. However one can show that it is equivalent to the likelihood function of the extended series \(N'\) with an occurrence at time \(T\), such that \(N'(t):= (N(t)\wedge N(T)) + \mathbb{I}_{t>T}\). For the duration of these estimation theory chapters, when I refer to \(N(\cdot)\) on an interval \((0,T]\), I will really mean \(N'(\cdot)\). The difference is in any case small for my data sets, and the increase in clarity is significant.

There are various disadvantages with this estimator. Naïve maximization can become trapped in local extrema, [89], or fail to converge over parameter ranges where the shallow likelihood gradient is dominated by numerical error. [116], especially under mis-specification, [46], or timestamp randomization [58]. Techniques such as Expectation Maximation or logarithmic transformation of the parameters are sometimes used to improve convergence. [116].

In addition to the above-identified problems, the normal criticisms of ML estimation can be made - e.g. lack of small sample guarantees, lack of guarantees regarding prediction error under various loss functions and so on..

If we assume, for example, an incorrect shape for the kernel then estimates for the other parameters may become poor. [6, 57, 46, 58] discuss various non-parametric or semi-parametric estimates of the kernel shape in order to avoid this problem. Rather than implementing such outré techniques here, I will restrict myself to the “classic” kernel shapes, the exponential and the Omori law which I introduced in the previous chapter, as these two will cause me enough trouble as it is.

Set against these is the practical advantage of being simple to estimate, and the usual benefits of the Maximum Likelihood estimation - specifically, a comprehensive asymptotic theory, and model selection procedures based upon it. The simplicity in particular will turn out to be useful for the current problem, so with those caveats I move on.

Estimation from summary statistics

Recalling the data, this estimator is useless without some further development. It tells us how to estimate parameter values from a time series realization comprising a sequence of occurrence timestamps \(\{t_i\}_{1 \lt i \leq N(T)}\). The data here is, in contrast, a sequence of observation tuples \(\{(\tau_j,N(\tau_i))\}_{1 \lt j \leq n}\) for some \(n \ll N(T)\) and, under any parameterisation of the Hawkes model, \(\forall k,\,\forall j,\, \mathrm P(\tau_k=t_j)=0\).

I take the domain of each realization to be \(\tau_1=0,\tau_n=T\). (If necessary I translate the observation timestamps to ensure this)

The problem of estimating the process parameters from such summary statistics is an unusual one. There is much work in seismology literature (e.g. [10, 100]) on estimating the model parameters from censored data, where some proportion of the timestamp data is missing due to some presumed censoring process. However, censored data is a different problem than summarized data, with far less work done upon it (with some exceptions: [5, 113])

It is the latter problem that we face here. There are no occurrence timestamps available at all, and thus we need to invent some. I will write \(\hat{t}_i\) for the \(i\)the invented occurrence timestamp.

To be a true ML estimator, we would have to choose all estimates \(\hat{t}_i\) such that they maximized the likelihood of the model given the data. This would entail choosing them differently depending on, for example, kernel parameters. Conducting such a maximization turns out to be computationally intractable even for tiny numbers of points, however, and some time series have millions, so we need an alternative scheme.

To be plausible, the restrictions are that:

  1. by the assumptions of the model, increments of the process occur simultaneously with probability zero, so we cannot place multiple occurrences at the one location;
  2. We must place all the occurrences, and only the occurrences, that belong to each interval in that interval, so that \(\tau_j \leq \hat{t}_i \lt \tau_{j+1}, \,\forall N(\tau_j) \leq i \lt N(\tau_{j+1})\).

Apart from that, there is no a priori reason to prefer any particular scheme. We could distribute them uniformly at random, or place them equidistantly, or in a converging sequence etc. I choose uniformly at random. Placing the points uniformly at random upon each interval corresponds to a piecewise constant Poisson rate conditional upon the correct number of occurrences in that time interval. Thus, the approximation that I introduced earlier for plotting purposes becomes the actual estimate of the instantaneous intensity of the process, and I interpolate the occurrences from the observations according to this estimate.

Piecewise constant estimate of the intensity function

Estimating of the hypothetical unobserved true intensity \(\lambda(t)\) function by a simple function \(\hat{\lambda}_{\mathrm{simple}}(t)\)

The questions are then: Is this process statistically valid? Can it be improved?

Certainly, applying the Maximum Likelihood estimation software to arbitrarily interpolated data like this trivially does not produce a Maximum Likelihood estimator. We have not maximized over all the unknown parameters, which include not only the branching ratio and kernel parameters etc, but also the precise value of each occurrence time. Rather, we guess those other unknown parameters and do not attempt to optimize with respect to them.

Working out how well this approximates an actual ML estimate is not trivial. I will largely ignore this issue here, because that for this particular research group it is a higher priority to see how far we can get with the approximation than to spend much time analyzing the approximation analytically. If we have promising results, then we can attempt to improve the estimator, or to correct for its sampling distribution using a bootstrap procedure. I therefore use simulation to reassure us that the idea is not outright crazy, and that we are plausibly approaching the ML estimator, and move on. I do suggest some ways that the problem might be solved analytically at the end of the chapter.

Model selection

The Akaike Information Criterion

Here I discuss the classic model selection procedure for this class of model, the Akaike Information criterion, or AIC. [3, 26].

AIC model selection is a well-known procedure for Hawkes-type models [88]. In the literature it is vastly more common than, for example, the Likelihood Ratio identification test of Rubin, [97] although for nested models they are equivalent. [20]

The AIC formula, for a model \(M\) fit to a given data set \(X\), for estimated parameter vector \(\hat{\theta^M}\) with log likelihood \(L\) and degrees of freedom \(d^M\)

\[\mathrm{AIC}(X,M)=2d^M-2L^M(X,\hat{\theta}^M)\]

The degrees of freedom are usually equivalent to the length of the parameter vector \(\theta\), although this depends upon the model and fitting procedure. [42]

Given two ML fitted models, \((M_1,\hat{\theta}^M_1)\) and \((M_2\,\hat{\theta}^M_2)\), the difference in their AIC values is an estimate of the relative Kullback-Leibler divergence of the inferred measures \(\hat{\mu_1},\hat{\mu_2}\) from the unknown true distribution, \(\mu\) i.e.

\[\mathrm{AIC}(X,M_1)-\mathrm{AIC}(X,M_2) \simeq D_\mathrm{KL}(\mu\|\hat{\mu_1}) - D_\mathrm{KL}(\mu\|\hat{\mu_2})\]

Without going into the derivation, it suffices for us to know that it is an information-theoretically-motivated measure of relative goodness of model fit to a given dataset. The lower the relative AIC of a model for a given dataset, the more we prefer it. The decision procedure using the AIC is to rank all candidate models fit to a given data set by AIC value, and to select choose the one with the lowest value. Heuristically, we see that a model is “rewarded” for a higher likelihood fit to a given data set, and penalized for the number of parameters it requires to attain this fit.

General difficulties with AIC

There are certain subtleties to the application of this idea.

AIC, as with many ML estimation procedures, is an asymptotic result, relying on the large-sample limiting distribution of the estimators. Just as with the ML procedures in general, it is not robust against outlier observations [26] §2 and we might prefer a robust estimator if the data set has been contaminated by outliers.

Additionally, unlike many test statistics, there is not necessarily a known sampling distribution of AIC difference between two models, even asymptotically. The difference in AIC between two nested models approaches $chi^2$ distribution under fairly weak assumptions and it becomes a normal likelihood test. [26] In the case of non-nested models, we have to estimate the statistics's distribution by simulation or analytic qualities of the particular models.

The derivation of the AIC does not presume that the true generating model is in the candidate set, and so we may use to find the “least worst” in such cases. We could, for example, have several candidate models for a point process, find that they are all rejected by some significance test at the 0.05 level, and the AIC will still give us a “winner” from among this set of rejected models. The “winner” in the Kullback-Leibler metric may of course not give us particularly good performance under other metrics.

More generally Akaike Information Criteria estimates may lose even asymptotic guarantees under model misspecification for badly behaved distribution. :cite:p`Konishi1996`, and so our model selection procedure may be faulty. One may introduce model-misspecification guarantees using the more general Takeuchi Information Criterion [69, 26]. A more commonly preferred solution is simply to expand the candidate set.

Of these difficulties, the problem of model mis-specification will be the more urgent in the current phase. Problems of small-sample corrections I will ignore at this point, but when I add more parameters to the model in the second part of this report, that issue too becomes pressing - see Model selection.

Finally, we also need to recall that although I use an ML-based estimation procedure, due to the interpolation I am are not really doing true ML estimates from the data, but rather, hoping that my estimates approach the true ML estimates. I know of no results that extend the AIC to this case. Once again I will use simulation to see if this procedure is at least plausible, but we do need to bear this shaky theoretical premise in mind.

Experimental hypotheses

Model selection with large data sets

The classic setup for use of the AIC is to propose a candidate set of parametric models of the data, with varying numbers of parameters, and then use the AIC to select between them based on the particular tradeoff of goodness-of-fit.

For this youtube data, for example, we might construct the following set of candidates:

  1. Each time series \(N_v\) is generated by a Poisson process with rate \(\lambda\) (\(d=2\))
  2. Each time series \(N_v\) is generated by a Poisson process with rate \(\lambda_v\) where \(\lambda_v\sim \mathrm{Pareto}(\lambda_min, \alpha_\lambda)\). (\(d=2\))
  3. Each time series \(N_v\) is generated by a Hawkes process with exponential kernel, background rate \(\mu\), branching ratio \(\eta\), and kernel parameters \(\kappa\). (\(d=3\))
  4. Each time series \(N_v\) is generated by a Hawkes process with exponential kernel, background rate \(\mu_v\), branching ratio \(\eta_v\), and kernel parameters \(\kappa\), where \(\mu_v\sim \mathrm{Pareto}(\mu_min, \alpha_\mu)\) and \(\eta_v\sim \mathrm{Beta}(\alpha_\eta, \beta_\eta)\).
  5. ...

The more data we have, the more complex a hypothesis we can support. We can include regression here e.g. that the branching ratio is predicted by time of day of upload etc. ([56]), or that parameters follow a simple trend etc.

We might also ignore some dimensions if consider some of the parameters to be nuisance parameters; i.e. we do not care about the distribution of \(\mu_v\), but we might suspect that \(\kappa\) has a universal value parameter [53, 31], or a limited number of possible values. [32].

Under the AIC method, complexity of the hypothesis we can support increases, in general, with the available amount of data. It follows that this stupendously large data set would support stupendously complex hypotheses; We are faced with, in principle, a combinatorial explosion of possible hypotheses and all of them are computationally expensive to evaluate - and practically, very few of them are supported by the available software.

We can avoid that issue for now since, I argue, we need to infer models that can handle the variation within one series adequately before testing composite hypothesis that couple various parameters together.

Multiple testing considerations

A frequently alternative in, for example, financial time series is to give up attempt to finding a small number of universal parameters, and estimate parameters to independently to each time series. Then we report on the estimated values.

This has issues of its own. If I wish to report, for example, the confidence intervals for \(10^6\) separate estimates fitted to \(10^6\) time series, then I am likely to find something by sheer force of numbers; This is the multiple testing problem. Moreover, if I am relying on bootstrap estimator variance estimates I face potentially as many bootstrap estimates as I have point estimates. The question of how to construct and report confidence sets or hypothesis tests in this case is a complex and active area of research. [13, 14, 1, 80, 117, 49, 12, 79, 115, 85, 114].

While not discounting the importance of these concerns, I argue that there are other methodological questions about the estimator that need to be addressed before I can approach a confidence set for a single times series, let alone multiple ones, and so I set this issue aside.

Goodness-of-fit tests

Traditionally residual analysis is used to diagnose goodness of fit of the Hawkes process parameters using a time change of the process into a uniform unit-rate Poisson process. [88]. I do not test residuals in this project, since I am aware of no known test that calculates residuals for the summary statistics used here.

Informally, the residual test for point process intensity estimates examine whether the process “looks like” a unit rate Poisson process when scaled, according to estimated intensity, to have unit rate. Since my estimation procedure here involves arbitrary interpolation of the series, we do not have residuals in the pointwise per-occurrence sense assumed by Ogata.

Our residual fit must be a last defense against bad model, and therefore if nothing else, must be a statistic with some basic guarantees against Type I error. There is no sense going through the motions of applying such model diagnostics, if they can provide, at worst, false confidence, and at best, no information.

In any case, I will provide ample alternative sources of evidence that the fitting procedure is problematic without requiring the goodness of fit test.