# The data¶

I present qualitative description of the cleaned data here. Technical details of the cleaning process are available in the supplement.

As much as possible mathematics and analysis will be reserved for later chapters, with the exception of some essential terms.

## Nomenclature¶

The data set comprises many separate time series, each comprising certain summary statistics for an underlying view-count process.

The underlying process, the increments of the view counter time series for a given video I will call occurrences. The summaries, of how many view counts have occurred at what time, are observations. Each time series is made of many observations.

Model of the observation procedure of the time series

I will denote to the underlying view-counter process as $$N_v(t)$$, where $$t$$ indexes time and the subscript $$v$$ indexes over all time series. Normally I will omit the subscript, unless I need to distinguish between two time series.

For a given series time, I have only $$n$$ observations of the value of the view counter, on an interval $$[0,T]$$: $${\tau_i}_{1 \lt i \leq n}$$ where I set $$\tau_1=0,\tau_n=T$$ I write such observation tuples $$\{(\tau_i,N(\tau_i))\}_{1 < i \leq n}$$. It will always be clear from the context which time series a given set of timestamps belong to, although it should be understood that there is an implicit index: $$\{(\tau_{v,i},N_v(\tau_{v,i}))\}_{1 < i \leq n_v}$$.

The dataset was gathered from 13. October 2006 until 25. May 2007 for use in an article published in 2008, [32] Information was scraped from Youtube, which is to say, extracted by machine text processing of web page data by an automated web browser; The web pages in question, in this case, are the pages for individual videos displayed on Youtube; To pick one example, the time series encoded as epUk3T2Kfno is available at https://www.youtube.com/watch?v=epUk3T2Kfno and corresponds to a video entitled Otters holding hands, uploaded on Mar 19, 2007, with summary information

Vancouver Aquarium: two sea otters float around, napping, holding hands. SO CUTE!

which is an accurate summary of the 88 second cinematic masterpiece.

Screen capture of "Otters holding hands". Content copyright by the Youtube user "Otters holding hands."

Source code for the Youtube sampling is no longer available, and limited communication with the author has been possible, so I adopt a conservative approach to interpretation of the available data.

One unusual quality of the data is an administrative one: at the time of data collection, there was no known prohibition against automated data collection from Youtube. At the time of writing, however, the current Youtube Terms Of Service agreement for Switzerland (date 2013/4/3) expressly prohibit the use of automated data collection. Even if I can find a jurisdiction with more permissive Terms of Service, I would have to circumvent complex software defense mechanisms to prevent automated data extraction. I am thus precluded from automated verification of hypotheses developed from this data; I may, however, legally manually verify a small number of hypotheses, insofar as that is possible from normal information available to the user of a browser. This fact will be significant in discussing optimal semiparametric regression strategies later on.

Timespans for individual video series span subsets of the overall interval, and are variously sampled at different rates. The observation interval for a different video can vary from seconds to days - After my data set cleaning, details of which are discussed elsewhere Data extraction and cleaning, the rate is approximately 3 observations per calendar day, but varies apparently randomly over time and between videos. There is no obvious correspondence between the observation rates of different videos' time series, or between the observation rate and qualities of the video itself, such as popularity.

The timestamp of the $$i$$th such increment I take to be $$\tau_i$$. One could consider taking this data as a noisy estimate of the true unobserved observation time $$\hat{\tau}_i$$. A principled approach to this data decontamination would then be to construct a stochastic process model for the observation time to reflect the stochastic relationship between recorded counter value and true counter value. One could also attempt to correct for view rates to allow for the time-zone a video is likely to be viewed from and when its viewers would be awake and so forth. The sampling intervals are messy enough that I doubt we could extract such information. An analysis of the robustness of the estimator under perturbation of timestamps to estimate the significance of these assumptions would be wise. I leave that to later work.

As I depend upon asymptotic results in the estimation packages, I cannot learn much from small time series. I discard all series with less than $$200$$ observations. This value is somewhat arbitrary, but is chosen to include a “hump” in the frequency of time series with around $$220$$ observations. This constitutes non-random censoring of the time series due to the data cleaning process, as discussed in the technical supplement. The data is likely already censored, however, as discussed in the technical supplement, and I put this problem aside for future research.

After filtering, $$253,326$$ time series remain. These time series exhibit a range of different behavior, different sampling densities, total number of occurrences, and view rates.

Distribution of the 240659 time series with at least 200 observations, in terms of number of data points and mean daily rate. Each successive contour encloses approximately an extra 5% of the total number of time series, totaling 95% of the observations. Some of the final 5% possess mean rate values orders of magnitude greater than the median, and the 0% contour line is therefore excluded for clarity.

I approximate the instantaneous rate of views for a given time series by a piecewise constant function for visualization.

For compatibility with the notation I use later, I denote this estimate $$\hat{\lambda}_{\mathrm{simple}}(t)$$, and define it

$\hat{\lambda}_{\mathrm{simple}}(t) := \sum_{i=2}^n \frac{N(\tau_{i})-N(\tau_{i-1})}{\tau_i,\tau_{i-1}} \left(\mathbb{I}_{[\tau_{i-1}-\tau_i)}(t)\right)$

$$\mathbb{I}_A$$ is the indicator function for set $$A$$.

An example is pictured.

View-rate estimate $$\hat{\lambda}_{\mathrm simple}(t)$$, for time series Valentin Elizalde, Volvere a amar

We might ask if the spikes in this video can be explained by endogenous branching dynamics, or exogenous influence. What could explain the variability in this time series? Is it a video shared for its intrinsic interest, or it is responding to external events?

Sleuthing reveals that the subject of the video, Mexican singer-songwriter Valentin Elizalde, was assassinated at around the upload time of this video. That is a plausible explanation for the initial peak in interest. But the later resurgence?

An biography suggests one hypothesis:

When he was alive, he never had a best-selling album. But less than four months after his murder and half a year after "To My Enemies" became an Internet hit, Elizalde made it big. On March 3, when Billboard came out with its list of best-selling Latin albums in the United States, Elizalde occupied the top two spots. [96]

Was it Elizalde's success in Billboard magazine that lead to the spike in video views? I will return to this question later.

## Outliers and Dragon Kings¶

An example of a video from the top 5% of mean view rates without any initial spike.

(The short transient spike after a long low rate could be evidence for errors in sampling time; These will be largely ignore here.)

We need to consider whether the kind of behavior that we witness amongst large time series, in the sense of having many occurrences recorded, are similar to the results for small time series. For one, this kind of regularity is precisely the kind of thing that we would like to discover. For another thing, if there is no such regularity, that would be nice to know too, as the estimators I use scale very poorly in efficiency with increasing occurrence count.

I consider here the distribution of sizes amongst the time series

Cumulative distribution of observations by time series. Dotted red line denotes a the curve of a hypothetical uniform allocation of observations to time series.

Cumulative distribution of occurrences by time series. Dotted red line denotes a the curve of a hypothetical uniform allocation of occurrences to time series.

Cumulative distribution of occurrences by time series, log-log scale. Dotted red line denotes a the curve of a hypothetical uniform allocation of occurrences to time series.

25% of the total occurrences recorded by the (filtered) data set are contained in only 671 time series.

If we wish to ultimately understand this data set, the extremely large number of total views concentrated in a small proportion of total videos will be significant in determining a sampling strategy.

It is tempting to draw comparison with Sornette's “Dragon King” distributions :cite:pSornette2009c, although given the unknown data censoring process, I will not attempt to draw conclusion about the population of Youtube videos from sample here.

The self-exciting model is interesting precisely because it can produce variable dynamics. As such, extreme rate variation within a time series or between time series is not necessarily a problem for the model. On the other hand, the Maximum Likelihood estimators that I develop here are sensitive to outliers, so we need to see the kind of problems the data presents, especially where they represent the kind of extreme behavior that will be an outlier with respect to the model.

There are time series where unambiguous external evidence leads us to suspect that the process has undergone an exogenous shock, leading to a sudden increase or decrease in view rate. Sometimes this is due to a clear time limit on a video's relevance.

A time series with rapid decline

More extreme than sudden loss of interest are the sudden rate “spikes” early in the life of a time series, containing most of the information There is massive activity at the beginning of the time series, and virtually none thereafter. I call these series lead balloons, after their trajectories.

A time series with enormous early rate spike. The view rate collapses so suddenly that it is nearly invisible.

It is not immediately clear if these spikes are because of genuine collapses in popularity of a video, or if they are technical artifact. In the case of the last example, the initial spike dwarfs all other activity in the time series, although it never stops entirely. I repeat it on a logarithmic scale, where we can see that the initial rate is orders of magnitude above later activity.

The same time series with enormous early rate spike, log vertical scale to show continued activity.

Presuming these spikes a a real phenomenon, one explanation for one would be that something, perhaps a mention on television, has promoted interest, but that the video itself has absolutely no viral potential.

Some sleuthing reveals that this example was video of a notorious brawl at the 2007/3/6 Inter Milan versus Valencia football game leading to a 7 month ban for Valencia player David Navarro. The video was uploaded shortly after the controversial match. It seems plausible that millions of soccer fans who switched off the uneventful game resorted to Youtube to watch the fight they missed at the end; But David Navarro has little viral potential; Once you have seen him brawling once, that is enough.

The majority of these lead balloons have no metadata available in my data set, and one cannot often not acquire any additional metadata even with effort, as videos in this category have often been removed from Youtube. This suggests that perhaps they represent controversial or illegal content which was briefly wildly popular but quickly censored. However, the view counter for deleted videos is, at time of writing, not visible, so we would expect that time series for deleted videos would simply be truncated entirely, not vastly reduced. There is no easy way of deciding this here, but I return to the issue later.

Research on similar systems suggests such sudden spikes are likely to be a common and important part of the dynamics. For example, celebrity mentions affect book sales [39, 108] and natural disasters affect charity donations. [31]

There are other classes of stylized dynamics, but the sorts listed here already comprise enough complexity and difficulty for one paper, and accordingly I leave the dataset analysis for the time being.

## Hypotheses¶

The data has many stylized features of other famous “social contagion” data; It has variable dynamics, a concentration of much activity into a small number of members of the data set and so on.

Whether this fits into the particular framework of the Hawkes process is another question. It seems likely that a branching process fit to such data would be unlikely to support a single background rate or branching ratio for all the data; We might hypothesis about the distribution of such parameters, e.g. that the generating process is an Omori kernel will a certain exponent. The hypothesis that there might be characteristic timescales or other stylized behavior for such data also seems reasonable. The question is whether the significance of such effects, if any, is quantifiable or identifiable with the tools at we have.