# The data¶

I present qualitative description of the cleaned data here. Technical details of the cleaning process are available in the supplement.

As much as possible mathematics and analysis will be reserved for later chapters, with the exception of some essential terms.

## Nomenclature¶

The data set comprises many separate time series, each comprising certain summary statistics for an underlying view-count process.

The underlying process, the increments of the view counter time series for a
given video I will call *occurrences*.
The summaries, of how many view counts have occurred at what time,
are *observations*.
Each time series is made of many observations.

I will denote to the underlying view-counter process as \(N_v(t)\), where \(t\) indexes time and the subscript \(v\) indexes over all time series. Normally I will omit the subscript, unless I need to distinguish between two time series.

For a given series time, I have only \(n\) observations of the value of
the view counter, on an interval \([0,T]\): \({\tau_i}_{1 \lt i \leq n}\)
where I set \(\tau_1=0,\tau_n=T\)
I write such observation tuples \(\{(\tau_i,N(\tau_i))\}_{1 < i \leq n}\).
It will always be clear from the context *which* time series a given set of
timestamps belong to, although it should be understood that there is an implicit
index: \(\{(\tau_{v,i},N_v(\tau_{v,i}))\}_{1 < i \leq n_v}\).

The dataset was gathered from 13. October 2006 until 25. May 2007
for use in an article published in 2008, [32]
Information was *scraped* from Youtube,
which is to say, extracted by
machine text processing of web page data by an automated web browser;
The web pages in question, in this case, are the pages for individual videos displayed on Youtube;
To pick one example, the time series encoded as `epUk3T2Kfno`

is available at
https://www.youtube.com/watch?v=epUk3T2Kfno
and corresponds to a video entitled *Otters holding hands*, uploaded on Mar 19, 2007, with summary information

Vancouver Aquarium: two sea otters float around, napping, holding hands. SO CUTE!

which is an accurate summary of the 88 second cinematic masterpiece.

Source code for the Youtube sampling is no longer available, and limited communication with the author has been possible, so I adopt a conservative approach to interpretation of the available data.

One unusual quality of the data is an administrative one:
at the time of data collection, there was no known prohibition against automated data collection from Youtube.
At the time of writing, however, the current Youtube Terms Of Service agreement for Switzerland
(date 2013/4/3) expressly prohibit the use of automated data collection.
Even if I can find a jurisdiction with more permissive Terms of Service, I would have to circumvent complex software defense mechanisms to prevent automated data extraction.
I am thus precluded from automated verification of hypotheses developed from this data;
I may, however, legally *manually* verify a small number of hypotheses, insofar as that is possible from normal information available to the user of a browser.
This fact will be significant in discussing optimal semiparametric regression strategies later on.

Timespans for individual video series span subsets of the overall interval, and are variously sampled at different rates. The observation interval for a different video can vary from seconds to days - After my data set cleaning, details of which are discussed elsewhere Data extraction and cleaning, the rate is approximately 3 observations per calendar day, but varies apparently randomly over time and between videos. There is no obvious correspondence between the observation rates of different videos' time series, or between the observation rate and qualities of the video itself, such as popularity.

The timestamp of the \(i\)th such increment I take to be \(\tau_i\).
One could consider taking this data as a noisy estimate of the true unobserved observation time \(\hat{\tau}_i\).
A principled approach to this data decontamination would then be to construct a stochastic process
model for the observation time
to reflect the stochastic relationship between *recorded* counter value and *true*
counter value.
One could also attempt to correct for view rates to allow for the time-zone a video is likely to be viewed from and when its viewers would be awake and so forth.
The sampling intervals are messy enough that I doubt we could extract such information.
An analysis of the robustness of the estimator under perturbation of
timestamps to estimate the significance of these assumptions would be wise.
I leave that to later work.

As I depend upon asymptotic results in the estimation packages, I cannot learn much from small time series. I discard all series with less than \(200\) observations. This value is somewhat arbitrary, but is chosen to include a “hump” in the frequency of time series with around \(220\) observations. This constitutes non-random censoring of the time series due to the data cleaning process, as discussed in the technical supplement. The data is likely already censored, however, as discussed in the technical supplement, and I put this problem aside for future research.

After filtering, \(253,326\) time series remain. These time series exhibit a range of different behavior, different sampling densities, total number of occurrences, and view rates.

I approximate the instantaneous rate of views for a given time series by a piecewise constant function for visualization.

For compatibility with the notation I use later, I denote this estimate \(\hat{\lambda}_{\mathrm{simple}}(t)\), and define it

\(\mathbb{I}_A\) is the indicator function for set \(A\).

An example is pictured.

Finally we are in a position to actually frame questions about this data.

We might ask if the spikes in this video can be explained by endogenous branching dynamics, or exogenous influence. What could explain the variability in this time series? Is it a video shared for its intrinsic interest, or it is responding to external events?

Sleuthing reveals that the subject of the video, Mexican singer-songwriter Valentin Elizalde, was assassinated at around the upload time of this video. That is a plausible explanation for the initial peak in interest. But the later resurgence?

An biography suggests one hypothesis:

When he was alive, he never had a best-selling album. But less than four months after his murder and half a year after "To My Enemies" became an Internet hit, Elizalde made it big. On March 3, when Billboard came out with its list of best-selling Latin albums in the United States, Elizalde occupied the top two spots. [96]

Was it Elizalde's success in Billboard magazine that lead to the spike in video views? I will return to this question later.

## Outliers and Dragon Kings¶

We need to consider whether the kind of behavior that we witness amongst large time series, in the sense of having many occurrences recorded, are similar to the results for small time series. For one, this kind of regularity is precisely the kind of thing that we would like to discover. For another thing, if there is no such regularity, that would be nice to know too, as the estimators I use scale very poorly in efficiency with increasing occurrence count.

I consider here the distribution of sizes *amongst* the time series

25% of the total occurrences recorded by the (filtered) data set are contained in only 671 time series.

If we wish to ultimately understand this data set, the extremely large number of total views concentrated in a small proportion of total videos will be significant in determining a sampling strategy.

It is tempting to draw comparison with Sornette's “Dragon King” distributions :cite:p`Sornette2009c`, although given the unknown data censoring process, I will not attempt to draw conclusion about the population of Youtube videos from sample here.

## Lead Balloons¶

The self-exciting model is interesting precisely *because* it can produce
variable dynamics.
As such, extreme rate variation within a time series or between time series is
not necessarily a problem for the model.
On the other hand, the Maximum Likelihood estimators that I develop here are
sensitive to outliers, so we need to see the kind of problems the data presents,
especially where they represent the kind of extreme behavior that will be an outlier with respect to the model.

There are time series where unambiguous external evidence leads us to suspect that the process has undergone an exogenous shock, leading to a sudden increase or decrease in view rate. Sometimes this is due to a clear time limit on a video's relevance.

More extreme than sudden loss of interest are the sudden rate “spikes”
early in the life of a time series, containing most of the information
There is massive activity at the beginning of the time series,
and virtually none thereafter.
I call these series *lead balloons*, after their trajectories.

It is not immediately clear if these spikes are because of genuine collapses in popularity of a video, or if they are technical artifact. In the case of the last example, the initial spike dwarfs all other activity in the time series, although it never stops entirely. I repeat it on a logarithmic scale, where we can see that the initial rate is orders of magnitude above later activity.

Presuming these spikes a a real phenomenon, one explanation for one would be that something, perhaps a mention on television, has promoted interest, but that the video itself has absolutely no viral potential.

Some sleuthing reveals that this example was video of a notorious brawl at the 2007/3/6 Inter Milan versus Valencia football game leading to a 7 month ban for Valencia player David Navarro. The video was uploaded shortly after the controversial match. It seems plausible that millions of soccer fans who switched off the uneventful game resorted to Youtube to watch the fight they missed at the end; But David Navarro has little viral potential; Once you have seen him brawling once, that is enough.

The majority of these lead balloons have no metadata available in my data set, and one cannot often not acquire any additional metadata even with effort, as videos in this category have often been removed from Youtube. This suggests that perhaps they represent controversial or illegal content which was briefly wildly popular but quickly censored. However, the view counter for deleted videos is, at time of writing, not visible, so we would expect that time series for deleted videos would simply be truncated entirely, not vastly reduced. There is no easy way of deciding this here, but I return to the issue later.

Research on similar systems suggests such sudden spikes are likely to be a common and important part of the dynamics. For example, celebrity mentions affect book sales [39, 108] and natural disasters affect charity donations. [31]

There are other classes of stylized dynamics, but the sorts listed here already comprise enough complexity and difficulty for one paper, and accordingly I leave the dataset analysis for the time being.

## Hypotheses¶

The data has many stylized features of other famous “social contagion” data; It has variable dynamics, a concentration of much activity into a small number of members of the data set and so on.

Whether this fits into the particular framework of the Hawkes process is another question. It seems likely that a branching process fit to such data would be unlikely to support a single background rate or branching ratio for all the data; We might hypothesis about the distribution of such parameters, e.g. that the generating process is an Omori kernel will a certain exponent. The hypothesis that there might be characteristic timescales or other stylized behavior for such data also seems reasonable. The question is whether the significance of such effects, if any, is quantifiable or identifiable with the tools at we have.