# Estimating branching effects for inhomogeneous processes¶

To recap: before us we have a huge data set of variable quality. It seems pregnant with possible conclusions, and yet, the tools we have available to extract them, estimators based upon stationarity assumptions, have limited ability to draw inference from them.

In this section, I extend my toolset to solve this problem I relax the assumptions of homogeneity (and also, implicitly, stationarity) that the previous section relied upon.

So far I have estimated parameters or lot of data sets, but not managed to reassure myself that I should have confidence in the estimates. My estimates are based on various assumptions:

• that I am sampling time series in the large-$$T$$ limit from a stationary process
• that, therefore, the process has homogeneous parameter values
• that the self excitation dynamics have a certain parametric forms
• that my interpolation from the summary statistics does not unduly bias estimates
• ...and so on.

The question is now, is if the conclusion robust to relaxing some of these assumptions. Does this data set still suggest near-critical branching dynamics under more plausible assumptions?

A great deal of the literature on branching-type dynamics, including prior work on this data set, suggests that it is crucial to understand them in terms of isloated “exogenous shocks” external influences upon the system which temporarily change their behavior [106, 108, 39, 32, 31, 93]. Such external shocks can be celebrity endorsements, natural disasters, or news events, etc. The key question is how the model coudl be extended to account for them. Importance of regime changes in inflating estimates of branching ratio is discussed much in recent work. [46, 45, 58, 47]. We are invited to consider estimating, for example, the "renormalized" kernel; the mean effect of an imposed exogenous shock upon the system. [61, 8]

There are many ways that this can be done. I concern myself specifically with inferring deterministic inhomogeneous time-varying background rates $$\mu(t)$$. Allowing other parameters to vary is of course possible (e.g. [58] varies kernel parameters); One can also assume the parameters themselves to be stochastic, then infer the parameters of the underlying distribution (e.g. [28, 27, 84, 83, 89, 76, 51, 36]) I try one thing at a time, however.

## Semiparametric kernel density estimation¶

Observe that the Hawkes process is a kind of kernel estimation, in the sense of convolution kernels. It is not qualitatively different in this sense from, for example, kernel density estimators. [102] Using convolution kernels of various types to estimate point process intensity even fro non-Hawkes-type processes is well-established area. [40, 16, 74, 16] Admittedly, the particular case of the Hawkes estimator has unusual features if regarded for kernel estimation problem.

Firstly, the convolution kernels are causal; that the intensity process is predictable with respect to the filtration generated by the occurrence times. Equivalently, the kernel has mass only on the positive half-line. The “classic” kernel-density estimator for example, uses a zero-centered Gaussian density as the kernel.

Secondly, in ML estimation of the Hawkes model parameters, we have an unusual bandwidth selection procedure, based on maximum likelihood estimation of the model's presumed dynamics. Classic kernel density estimation uses different methods, such as rule-of-thumb, or cross-validation procedures. We have, in fact, a parametric estimation problem, whose form happens resemble the nonparametric problems that convolutions kernels are used to solve.

I consider, then, what alternate kernel decompositions are plausible, and in particular the combination of parametric and non parametric estimates. This is precisely the set-up for semi-parametric estimation methods I argue that there is a particular semi-parametric method that seems particularly appropriate to this data set.

## The algorithm¶

Assembling such penalized regression estimators a job of assembling several different interdependent pieces. Pedagogically it is clearer to introduce each piece as I need it to explain the overall algorithm, and so I do that here.

I consider estimating the parameters of the Hawkes model where the constant background rate $$\mu$$ is replaced with inhomogeneous rate $$\mu_t(t)$$. On one hand, this is a sacrifice; in the inhomogeneous case we no longer meet Ogata's sufficient conditions for asymptotic efficiency and normality of the maximum likelihood estimator. [87] On the other hand, there is nothing to lose. The insufficiently general candidate model class was also uncertain ground, and even if it were better, the composite interpolated data estimator I am constructing has no such guarantees.. In any case, many point process estimators even with complete data are used without asymptotic efficiency guarantees :cite:pSchoenberg2005 or unimodal likelihood functions [46].

For technical reasons (see On the complexity of the simplest possible thing) I discard the pyhawkes library for estimation; I retain it, as a known-good implementation of the ML estimation procedure, for checking the results of my own implementation. (Likelihood estimates agree to within numerical precision for all values, and point estimates usually agree, although my estimator is more prone to numerical instability)

My non-parametric of choice here will be of convolution kernel type. That is, I will introduce an additional convolution kernel $$\psi$$, and functions of the form

$\mu(t) = \mu + \sum_{1 \leq j \leq p}\omega_i\psi_{\nu_j}(t-t_j)$

for some set of kernel bandwidths $$\{\nu_j\}_{1 \leq j \leq p}$$, kernel weights $$\{\omega_{\nu_j}\}_{1 \leq j \leq p}$$, kernel locations $$\{t_j\}_{1 \leq j \leq p}$$.

There are many kernels available. For reasons of computational efficiency I would like to have one with compact support. For reasons of minimizing errors in my working, I would like to start with the simplest possible option as far as implementation.

By these criteria, the logical choice is the top hat kernel, the piecewise-constant function.

$\psi_{\nu}(t):= \frac{\mathbb{I}_{0\lt t \leq \nu}}{\nu}$

Traditionally the top hat kernel is taken to be centered on 0. I choose it to have positive support because it makes no difference to the working, at this stage but that, as with the Hawkes influence kernels, it is causal in the right filtration generated by the observation times. In particular, this background rate estimate could be made predictable with respect to the observation process; we could hypothetically do this kind of estimate on an online setting. I stick to the offline setting here, however.

If we would prefer other qualities, such as smoothness, we may of course choose different kernels. Indeed, in an offline setting we can also surrender causality.

Weighted combinations of such functions give me simple, i.e. piecewise-constant, functions.

$\mu(t) = \mu + \sum_{1\leq j\le p}\omega_j\frac{\mathbb{I}_{(0, \nu]}(t-t_j)}{\nu}$

By this point, the parallel should be apparent with the piecewise-constant intensity estimate that I have already used in diagnostic plots;

$\hat{\lambda}_{\mathrm{simple}}(t) := \sum_{i=2}^n \frac{N(\tau_{i})-N(\tau_{i-1})}{\tau_i-\tau_{i-1}} \left(\mathbb{I}_{[\tau_{i-1},\tau_i)}(t)\right)$

Our interaction kernels are in fact kernel estimates of the conditional intensity function.

This, in turn, suggests a particular form for the nonparametric kernel intensity function, for observations times $$\{\tau_j\}_{j\leq n}$$

$\begin{split}\mu_t(t) := \mu + \omega(t)\\\end{split}$

where

$\omega(t) = \sum_{j=2}^n \omega_j \mathbb{I}_{[\tau_{j-1},\tau_j)}(t)$

With only summary observations there doesn't seem to be any point in trying to estimate rates on a finer scale than this, and so I adopt it as a starting point.

I write $$\boldsymbol \omega$$ for the vector of all $$\omega_j$$ values. Now I take the data vector $$\mathbf t$$ is taken to contain the observation times $$\{\tau_j\}$$ and the occurrence times $$\{t_i\}$$ although in practice these will be interpolated as before, but we are ignoring that for now. One is free to choose any set of kernel locations - say, one per day or per hour; the $$\{\tau_j\}$$ values are merely convenient.

I augment the parameter vector to include the kernel weights $$\theta':=( \mu,\eta,\kappa, \boldsymbol\omega)$$ The hypothesized generating model now has conditional intensity process

$\lambda_{\theta'}(t|\mathcal{F}_t) = \mu + \sum_{j=2}^n \omega_j \mathbb{I}_{[\tau_{j-1},\tau_j)}(t) +\eta \sum_{t_i\lt t}\phi_\kappa(t-t_i)$

To remain well-defined I require $$\forall t,\,\mu(t)\gt 0\Rightarrow \forall j,\, \omega_j>-\mu$$.

It is not immediately clear what this has gained us, since there are now more parameters to estimate than data points. Additional restrictions are necessary to make the estimation problem well-posed.

One such assumption would be penalization [111, 63] -- I apply an additional penalty to the log likelihood function, penalizing particular values of the parameters.

For the penalized log likelihood for parameter $$\theta$$ and penalty $$\pi$$, I write

$L_\pi(\mathbf t, \theta'):= L_\pi(\mathbf t, \theta')- \pi P(\theta')$

$$P$$ here is a non-negative functional which penalizes certain values of the parameter vector, and $$\pi\geq 0$$ is the penalty weight hyperparameter.

As before, the estimate is defined as the maximizer:

$\hat{\theta}_\pi(\mathbf t) = \operatorname{argmax}_\theta L_\pi(\mathbf t;\theta)$

For non-parametric extension to a parametric model, one usually penalizes only the non-parametric extensions to the model, such that they vanish as the penalty term increases. [52]

$\lim_{\pi\to\infty}\hat{\theta}_\pi'(\mathbf t) =\hat{\theta}(\mathbf t)$

I follow this practice here, penalizing only the deviations of $$\omega(t)$$ from the parametric estimate.

Hereafter, I will drop the prime from the augmented parameter vector and simply write $$\theta$$.

Penalization is more frequently presented in the context of generalized linear regression from i.id. samples, but it also fits within a generalized Maximum Likelihood estimation theory.

Many alternative choices penalization open up to us here. We could favor low variation in the background rate by including $$\sum|\omega_i-\omega_{i-1}|$$ in the penalty. We could penalize values of $$\eta$$, or $$\theta$$ etc. Many other loss functionals are possible. If sparsity were not a priority we could use $$L_2$$ norms, or mixtures of norms. We could add additional hyperparameters to weight various penalty functions differently. This choice will depend on the assumptions on the background rate process. Verifying such assumptions is a whole additional question which I will not address here.

For the kind of “signals” we might expect from the examples that I have shown expect to recover from Youtube data, infrequent spikes, a logical choice penalty would be a sparsifying $$L_1$$ penalty on deviations in the background rate. This is the penalty made famous by Lasso regression, compressive sensing and so on. [110, 44, 23, 41] and also applied to point process models [54]. Its key feature is favoring estimates of parameter vectors where many entries in that vector are zero - in other words, it identifies isolated spikes. Since spikes are what I hope to control for, this sounds like a logical first choice.

$P(( \mu,\eta,\kappa, \boldsymbol\omega)) := \|\boldsymbol \omega_\theta(t)\|_1$

where $$\|\boldsymbol \omega_\theta(t)\|_1 = \int_0^T |\omega(t)|dt$$ The $$L_1$$ penalty is favored for sparse signal recovery, since it creates sparse solution (i.e. with many components zero) with high probability and yet can typically be efficiently numerically solved. if indeed the true process is represented by a sparse $$\boldsymbol \omega$$ vector. $$\pi$$ in this context interpolates between various assumed levels of sparsity.

I am aware of no off-the-shelf penalized estimation procedures for the Hawkes model, and so derivations for this particular model, and so I derive a new one.

I therefore impose an additional pragmatic restriction: I implement only the exponential kernel for the Hawkes intensity estimation. The exponential kernel is simpler and easiest for me to debug, and the indications are that the sparse observation intervals make Omori kernels hard to fit. We may in any case construct general interaction kernels from combinations of exponential kernels, [57] so this is a logical building block to solving the problem of more general kernel types.

With these in mind, I construct the penalized log likelihood function for fixed $$\pi$$. Selection of appropriate penalty $$\pi$$ will be left until later. The derivation of the estimator is an exercise in tedious calculus.

I calculate derivatives so that we can do gradient ascent. (or if you'd prefer, gradient descent for the negative penalized log likelihood)

I write down the penalized log likelihood.

$L_\pi(\mathbf t;\theta):=-\int_0^T\lambda_\theta(t)dt + \int_0^T\log \lambda_\theta(t) dN_t - \pi\|\boldsymbol \omega_\theta(t)\|_1$

Calculating this likelihood is computationally taxing. Fortunately, we get partial derivatives nearly "free" when likelihood the likelihood, so gradient ascent can be done efficiently at the cost of keeping the intermediate partial sums in memory.

Following Ozaki, I differentiate with respect to an arbitrary component of the parameter vector $$\theta_x$$ [92]

\begin{align*} { \scriptstyle \frac{\partial}{\partial\theta_x} } L_\pi(\mathbf t;\theta)&=-\int_0^T{ \scriptstyle \frac{\partial}{\partial\theta_x} }\lambda_\theta(t)dt + \int_0^T { \scriptstyle \frac{\partial}{\partial\theta_x} }\log \lambda_\theta(t) dN_t -{ \scriptstyle \frac{\partial}{\partial\theta_x} }\|\boldsymbol \omega_\theta(t)\|_1 \\ &=-\int_0^T{ \scriptstyle \frac{\partial}{\partial\theta_x} }\lambda_\theta(t)dt + \sum_{0\leq t_i\leq T} \frac{{ \scriptstyle \frac{\partial}{\partial\theta_x} }\lambda_\theta(t_i)}{\lambda_\theta(t_i)} -{ \scriptstyle \frac{\partial}{\partial\theta_x} }\|\boldsymbol \omega_\theta(t)\|_1 \end{align*}

By construction, $${ \scriptstyle \frac{\partial}{\partial\mu} }\|\boldsymbol \omega_\theta(t)\|_1 = { \scriptstyle \frac{\partial}{\partial\eta} }\|\boldsymbol \omega_\theta(t)\|_1 = { \scriptstyle \frac{\partial}{\partial\kappa} }\|\boldsymbol \omega_\theta(t)\|_1 = 0$$.

Taking $$\theta_x=\mu$$,

${ \scriptstyle \frac{\partial}{\partial\mu} }\lambda_\theta(t|\mathcal{F}_t) = 1$

I use higher derivatives for $$\mu$$ so that I may optimize this component using a higher order Newton method, since we know that typically the optimal value is particularly slow to converge for this component [116] and the values are simple.

\begin{align*} { \scriptstyle \frac{\partial}{\partial\mu} } L_\pi(\mathbf t;\theta) &=-T + \sum_{0\leq t_i\leq T} \frac{1}{\lambda_\theta(t_i)} \\ { \scriptstyle \frac{\partial^2}{\partial\mu^2} } L_\pi(\mathbf t;\theta) &= \sum_{0\leq t_i\leq T} \frac{-1}{\lambda_\theta^2(t_i)}\\ { \scriptstyle \frac{\partial^3}{\partial\mu^3} } L_\pi(\mathbf t;\theta) &= \sum_{0\leq t_i\leq T} \frac{2}{\lambda_\theta^3(t_i)} \end{align*}

Now I handle $$\theta_x=\eta$$,

${ \scriptstyle \frac{\partial}{\partial\eta} }\lambda_\theta(t|\mathcal{F}_t) = \sum_{t_i\lt t}\phi_\kappa(t-t_i)$

so that

\begin{align*} { \scriptstyle \frac{\partial}{\partial\eta} } L_\pi(\mathbf t;\theta)&=-\int_0^T \sum_{t_i\lt t}\phi_\kappa(t-t_i) dt + \sum_{0\leq t_i\leq T} \frac{\sum_{t_k\lt t_i}\phi_\kappa(t_i-t_k)}{\lambda_\theta(t_i)} \end{align*}

Taking $$\theta_x=\kappa$$, we find

${ \scriptstyle \frac{\partial}{\partial\kappa} }\lambda_\theta(t|\mathcal{F}_t) = \eta\sum_{t_i\lt t}{ \scriptstyle \frac{\partial}{\partial\kappa} }\phi_\kappa(t-t_i)$

and so

${ \scriptstyle \frac{\partial}{\partial\kappa} } L_\pi(\mathbf t;\theta) =-\eta\int_0^T \sum_{t_i\lt t}{ \scriptstyle \frac{\partial}{\partial\kappa} }\phi_\kappa(t-t_i) dt + \sum_{0\leq t_i\leq T} \frac{ \eta\sum_{t_k\lt t_i}{ \scriptstyle \frac{\partial}{\partial\kappa} }\phi_\kappa(t_i-t_k)}{\lambda_\theta(t_i)}$

I take the rate parameterisation of the exponential interaction kernel, such that, $$\phi=\mathbb{I}_{\mathbb{R}^+}(t)\kappa e^{-\kappa t}$$, to make the derivative more compact. It is an easy matter if invert the parameter if you want the more usual parameterisation. Suppressing the indicator function - we are only evaluating this kernel on its (non-negative) support,

${ \scriptstyle \frac{\partial}{\partial\kappa} }\phi_\kappa(t)= e^{-\kappa t} - \kappa t e^{-\kappa t}$

and

$\int_0^{t_i} { \scriptstyle \frac{\partial}{\partial\kappa} }\phi_\kappa(t-i)dt = t_ie^{-\kappa t_i}$

giving

${ \scriptstyle \frac{\partial}{\partial\kappa} } L_\pi(\mathbf t;\theta) = -\eta\sum_{t_i \lt T}\left[(t-t_i)e^{-\kappa(t-t_i)}\right]_{t=0\vee t_i}^T+ \sum_{0\leq t_i\leq T} \frac{ \eta\sum_{t_k\lt t_i}e^{-\kappa(t_i-t_k)}(1-\kappa(t_i-t_k)}{\lambda_\theta(t_i)} .$

One may repeat these steps to produce the Hessian matrix of the Hawkes model [87, 92], but I deemed that excessive for my immediate needs.

Finally, I handle the $$\omega_j$$ values; these are similar to $$\mu$$ part from the un-penalised part.

Taking $$\theta_x=\omega_j$$, and defining $$\Delta\tau_j:= {\tau_{j-1}}-{\tau_j}$$ we find

\begin{align}\begin{aligned}:nowrap:\\\begin{split}\begin{align*} { \scriptstyle \frac{\partial}{\partial\omega_j} }\pi\|\boldsymbol \omega_\theta(t)\|_1 &= { \scriptstyle \frac{\partial}{\partial\omega_j} }\pi \int_{\tau_{j-1}}^{\tau_j} \left|\omega_j\right|dt\\ &=\pi\Delta\tau_j\operatorname{sgn} \omega_j\\ { \scriptstyle \frac{\partial}{\partial\omega_j} }\lambda_\theta(t|\mathcal{F}_t) &= \Delta\tau_j\mathbb{I}_{[\tau_{j-1},\tau_j)}(t)\\ { \scriptstyle \frac{\partial}{\partial\omega_j} } L_\pi(\mathbf t;\theta) &=-\Delta\tau_j-\pi\Delta\tau_j\operatorname{sgn} \omega_j + \sum_{\tau_{j-1}\leq t_i\leq \tau_j} \frac{1}{\lambda_\theta(t_i)} \end{align*}\end{split}\end{aligned}\end{align}

Higher partial derivatives are also analogous to the partial derivatives with respect to $$\mu$$, although of course the penalty introduces a discontinuity at $$\omega_j=0$$. This last formula is the key to the gradient ascent algorithm.

Note that the $$\omega_j$$ values are mutually orthogonal, in the sense that $${ \scriptstyle \frac{\partial^2}{\partial\omega_i\omega_j}} =0$$ if $$i\ne j$$. I can treat these components, more or less, as separate univariate components when optimizing, and the Hessian will be sparse off the diagonal.

Note also that although the partial derivative is undefined when $$\omega_j=0$$, we can still tell whether to update it in the gradient ascent algorithm by using the elbow formula:

$\left|\sum_{\tau_{j-1}\leq t_i\leq \tau_j} \frac{1}{\Delta\tau_j\lambda_\theta(t_i)}-1\right|\leq \pi$

that the value of $$\omega_j$$ is trapped at 0, in that we know the sign of the partial derivative is different on either side of 0, and we don't need to bother doing any further updates for that parameter.

Now we have an estimate for a given fixed penalty.

I now present the forward approximate stage-wise path algorithm.

1. Fit the unpenalized parameters of the model with your choice of numerical optimization, $$\hat{\theta}(\mathbf t) = \operatorname{argmax}_\theta L(\mathbf t;\theta)$$ leaving $$\boldsymbol \omega\equiv 0$$. Call this estimate $$\hat{\theta}_0:math:$$. By construction, this is equivalent to taking the penalty weight $$\pi$$ large enough that the regularized fit is the same as the non-regularized ML fit.
2. By evaluating the elbow formula fro each component $$\omega_j$$, we can find the smallest value of $$\pi$$ such that we would not alter the estimated value for any $$\omega$$ if we used it as the penalty parameter. Call this $$\pi_0$$.
3. Now choose a new $$\pi_1=\pi_0-\Delta$$ for $$\Delta$$ small.
4. By construction, $$\hat{\theta}_1:= \operatorname{argmax}_\theta L_{\pi_1}(\mathbf t;\theta)\ne\operatorname{argmax}_\theta L_{\pi_0}(\mathbf t;\theta)$$. Using $$\hat{\theta}_0$$ as an initial estimate, ascend the penalized log-likelihood gradient until the new maxima is attained. You can use the elbow formula to check which $$\omega_j$$ values are stuck at zero without performing updates once all the non-zero parameters have been updated. Any non-stuck parameters are now introduced to the model and their parameters estimated.
5. Choose $$\pi_{i+1}=\pi_i-\Delta$$. Repeat from step 4.
6. When $$\pi_m=0$$ for some $$m$$, stop.

The output of this algorithm is the whole regularization path of $$m$$ different parameters estimates indexed by the hyperparameter $$\pi_m$$ Having calculated it, one can choose from the available set by some model selection criteria.

I am deliberately vague about the choice of gradient ascent technique and the choice of step size $$\Delta$$. These penalized regression problems are typically calculated by using a range of $$\pi$$ values and updating the estimates progressively, using the estimate calculated at each state as an approximation to the next stage,as the parameter is varied. For linear regression, for example, there are particularly efficient methods for calculating these regularization paths. [48].

For the Hawkes model I know of no such specific algorithms.

More smaller steps is better, but also more computationally expensive. I could choose number of steps and size adaptively etc. In practice I use a rule-of-thumb logarithmic spacing with a number os steps equal to the number of parameters.

Various gradient ascent algorithms are known to perform well for :cite::Simon2011,Wu2008. Pragmatically, for performance reasons, I used a mixed strategy. My algorithm attempts to update marginal estimates for each $$\omega_i$$ parameter more rapidly first via Newton's method [11, 92] and uses conjugate gradient descent for the parametric estimates of the Hawkes parameters. If the updates steps are small this seems to be stable. There are various tuning parameters to such algorithms.

Many variants of pathwise regularization and exist, such as backwards stepwise regularization, versions with random initialization, Least Angle Regression, [44] selection with integrated cross validation and so on. [48] For this particular model I have found few results giving me analytic guarantees, so I use heuristics and simulation-based verification. I do not declare that this method "best" in any sense, or that it will find the correct global maximum penalized likelihood etc. The method, however, is simple enough to prototype rapidly and, as it turns out, performs well on test data. Results will be presented in the next chapter.

## Model selection¶

In an exploratory project such as this, I take an informal approach to the complexities of model selection in high dimensional penalized regression. This is a current and active area of research; [41, 23, 78, 117, 81, 115, 85, 123, 114, 75, 22] There is however, to my knowledge, no published prêt-à-porter selection procedure for the particular model family I use here. I therefore proceed heuristically, which is to say, I will adopt rules of thumb from the literature, but observe that if one desires formal quantitative guarantees about confidence intervals, that more work will need to be done. I will assume that we may ignore the question of whether the model is well-specified, and further, I apply use some results derived for the special case of linear regression to the estimation problem here.

Even proceeding informally, we still need to revisit the AIC model selection procedure. I recall the AIC definition, for a model with log likelihood $$L$$ and degrees of freedom $$d$$

${\mathrm {AIC}(L_\hat{\theta}(X))}=2d-2\ln(L)$

We have few data points per time series, and potentially many parameters. The naïve AIC estimate used so far is asymptotically consistent, but negatively biased for small sample sizes. [64]. Where the number of degrees of freedom in the model is on the same order as the number of parameters, we should expect to see that the use of the AIC formula favors complex models.

To remedy this, I use Sugiura's finite-sample-corrected version, the AICc [109].

${\mathrm {AICc}}:={\mathrm {AIC}}+{\frac {2d(d+1)}{N-d-1}}$

for number of observations $$N=|X|$$. Whilst Sugiura originally derived this correction for linear regression models, it has been shown in practice to function as a good model selection statistic for a broader class. [64, 24] I therefore adopt it as an approximation here.

Other alternatives would include correcting for finite sample bias by deriving an analytic or simulation-based empirical correction e.g. [26] §6.5 gives convenient constructions based on the Hessian of the estimate, which we get “for free” in this estimator. I leave such refinements for a future project.

Taking $$N$$ as the number of samples in the data set, this leaves the question of the effective degrees of freedom $$d$$. For the unpenalized Hawkes process with a single parameter kernel, $$d=3$$. []. For penalized regression the calculation of degrees of freedom can be unintuitive, and can even fail entirely to summarize model complexity. [67]. However, under certain “niceness” conditions, which I will assume here without proof, there is a remarkably simple solution: the number of non-zero coefficients fit under a particular value of the regularization penalty $$\pi$$ in an $$\ell_1$$ regression model provides an unbiassed estimate of the effective degrees of freedom of the model and functions as an effective model complexity measure. [43, 125]

Taking these together, we get the following “rule of thumb” model selection statistic:

${\widehat{\mathrm {AICc}}_\pi(X, \hat{\theta}_\pi)} = {\frac {2d_\pi N}{N-d_\pi-1}} - 2\ln(L_\hat{\theta}_\pi(X), X))$

where $$d_\pi$$ counts the number of non-zero coefficients estimated under regularization coefficient $$\pi$$.

As with the usual asymptotic AIC, we may choose between models within this regularization path based on this criteria.

$\pi_\mathrm{opt}=\operatorname{argmin}_\pi{\mathrm {AICc}_\pi}$

Finally, I note that this estimator is both discontinuous and non-monotonic in $$\pi$$ by construction, and thus potentially tricky to optimize. Further, it is based upon the regularization paths of the model which themselves are found by computationally expensive numerical optimization. Optimizing $${\widehat{\mathrm {AICc}}_\pi}$$ with respect to a continuum of $$\pi$$ values is tedious, so I will evaluate this statistic only at specified finite set of values of $$\{\pi_i\}$$, and choose models from this subset of all possible values of $$\pi$$. The estimated optimal $$\pi$$ value is then,

$\hat{pi}_\mathrm{opt} = \operatorname{argmin}_{\pi\in\{\pi_i\}}{\widehat{\mathrm {AICc}}_\pi}$

Various improvements to this scheme are possible, such as choosing the regularization parameters over the entire data set, using cross-validation or choosing $$\{\pi_i\}$$ adaptively during model fit [15] and so on. A thoroughgoing theoretical treatment of the estimator is, however, inappropriate for this data-driven exploration, and so we move on.