# Technical Notes¶

## Data extraction and cleaning¶

One problem with the data set is the size alone; I begin with an undocumented MySQL database with a disk footprint of approximately 40 gigabytes; Although certain queries run rapidly, most aggregate and summary statistics do not, either terminating due to resource-usage- errors. Based on naming conventions, I identify tables of particular interest; one apparently containing metadata for particular videos, and one containing time series information of video activity. These tables I store as plain Hierarchical Data Format files, divided into 256 "shards" based on the hash value video identifier.

The metadata table is of limited use because of various problems with incomplete or inconsistent data. metadata about many time series is not available, or contains various types of corrupt or invalid data. Below it a representative example. Note that it contains varying types of missing data, some marked as such, and some merely implausible:

video_id | author | title | upload_time | length |
---|---|---|---|---|

5BqQLtA7Q1g | Deathslayer31 | MH - Quality Test | 1171138984 | 66 |

E4udVcrzhG0 | TimelessWorld | NaN | 0 | 0 |

5vG4BaP_c0I | gimenoalbert | Start M always helps commercial | 1163615987 | 35 |

5FUbVVKaZhA | snappyscan | piÃ±u pol! | 1171544146 | 12 |

0T99eHrDLCk | RonBats | Fabienne de Vries | 1173617924 | 75 |

8df3B6YkJqc | T_UNDEFINED | circusmeisje | 1177431779 | 90 |

brcl0zQ7hTU | kfezzie | NaN | 0 | 0 |

3f6UbrDL3aM | T_UNDEFINED | breaking a tackle | 1174915331 | 8 |

bpklKD6DCu8 | leoleffa | NaN | 0 | 0 |

36jaVnqYjMo | rgarethbrown | Pablo e Cassio TIFOSI em uma canÃ§Ã£o de amor.... | 1172846785 | 54 |

a38jmP0FrYE | Drunky90 | Shine Take that | 1176389346 | 204 |

9itnxKn2yxY | T_UNDEFINED | delalalahaha | 1168571195 | 8 |

... | ... | ... | ... | ... |

What I cannot depict is the missing data - many records have apparently no metadata available, or if it is averrable, woudl require more extensive excavation from the database to extract it. Where available I use this metadata, but I do not restrict my investigation only to data points with available metadata.

Leaving metadata aside, I turn to the time series themselves.

I retrieve \(676,638,684\) distinct records from the database, corresponding to \(4,880,136\) distinct videos. Dividing these figures by one another might suggest I have nearly 5 million individual time series, each with more than a thousand observations.

This is not so, for two reasons:

- Random sampling of time series reveals that time series are not all similarly long. In fact, the data set is dominated by short data-sets, on the order of 10 data points.
- Even the remaining series can in fact be shorter than expected; The majority of the recorded observations are spurious and must be discarded, as will be explained below.

The cleaning and analysis that each dataset requires is complex enough that I cannot query these series *per se*.
Instead, I download them all and inspect each individually.

Firstly, a data sample:

video_id | run_time | view_count | |
---|---|---|---|

... | ... | ... | ... |

10 | -2IXE5DcWzg | 1165050318 | 921 |

11 | -2IXE5DcWzg | 1165081008 | 1035 |

12 | -2IXE5DcWzg | 1165084724 | 1035 |

13 | -2IXE5DcWzg | 1165115641 | 1306 |

14 | -2IXE5DcWzg | 1165139660 | 1662 |

15 | -2IXE5DcWzg | 1165146641 | 1726 |

16 | -2IXE5DcWzg | 1165177526 | 1756 |

17 | -2IXE5DcWzg | 1165177671 | 1756 |

18 | -2IXE5DcWzg | 1165191787 | 1876 |

19 | -2IXE5DcWzg | 1165209383 | 1876 |

20 | -2IXE5DcWzg | 1165235421 | 2001 |

21 | -2IXE5DcWzg | 1165241236 | 2001 |

22 | -2IXE5DcWzg | 1165243133 | 2001 |

23 | -2IXE5DcWzg | 1165264017 | 2067 |

24 | -2IXE5DcWzg | 1165274487 | 2067 |

25 | -2IXE5DcWzg | 1165306214 | 2349 |

... | ... | ... | ... |

`run_time`

I take to correspond to \(\tau_i\) values.
I assume it to be measured in *epoch timestamps* -
the number of seconds since new year 1970 UTC.
`view_count`

I take to denote \(N_v(\tau_i)\) and `video_id`

is a unique index \(v\) of the time series.

Note that many `view_count`

values are repeated.
Analysis of the data reveals many series like this, with repeated values.
This could be evidence that no views occurred in a given time window.
However, based on partial notes from the original author, and the sudden
extreme increments that are interspersed between these “null increments”,
there is a more probably explanation:
These are “cache hits”:
stale data presented to the user by the network,
for performance reasons, in lieu of current information.
I preprocess each time series to remove all non (strictly) monotonic increments,
and discard the rest.

With these caveats, I repeat the time series excerpt for video `-2IXE5DcWzg`

after preprocessing:

video_id | run_time | view_count | |
---|---|---|---|

... | ... | ... | ... |

10 | -2IXE5DcWzg | 1165036079 | 921 |

11 | -2IXE5DcWzg | 1165081008 | 1035 |

13 | -2IXE5DcWzg | 1165115641 | 1306 |

14 | -2IXE5DcWzg | 1165139660 | 1662 |

15 | -2IXE5DcWzg | 1165146641 | 1726 |

16 | -2IXE5DcWzg | 1165177526 | 1756 |

18 | -2IXE5DcWzg | 1165191787 | 1876 |

20 | -2IXE5DcWzg | 1165235421 | 2001 |

23 | -2IXE5DcWzg | 1165264017 | 2067 |

25 | -2IXE5DcWzg | 1165306214 | 2349 |

... | ... | ... | ... |

I have effectively discounted all view incrementts of size zero.
I am effectively also censoring all inactive time series;
We cannot “see” any time series with only only zero or one observations -
there must least two different view counts to interpolate.
There is no clear way to estimate how significant this proportion is given what
I know about the data;
There is no way of measuring the significance of thid choice precisely
It could easily be the vast majority of videos which fall into this
category.
After all, the phrase “long tail” was notoriously popularized by *Wired* in
2004 to describe the preponderance of asymmetric distributions of
popularity online [4], and we should suspect that
Youtube is such a system.
It would be entirely possible that most of the videos are *never* viewed,
and that this data cleaning has censored such videos from the analysis.
The simple solution is to exclude this unknown proportion from our analysis.
Therefore, throughout this work, it shoudl be understood that the estimates I construct are all *conditional on sustained activity*.

## On the complexity of the simplest possible thing¶

The first half of the analysis in this report uses the statistical library `pyhawkes`

, and the second half hand-built code.
The reason for this is technical rather than mathematical.

`pyhawkes`

is an amazing project; optimized and featureful, it supports a wide variety of density kernel types, has mathematically and technically sophisticated optimizations and so on.
It support multivariate and marked processes, a variety of different kernels etc.

It is also the kind of specialized racing vehicle that requires expert maintenance by qualified service personnel.

I did try to use it for the semi-parameteric regression, but ultimately, when my needs were simple --- optimizing parameters with respect to a simple loss function --- I found myself introducing bugs rather than removing them.

When I added features it got messier, in that I encountered different problems. I tried to implement the non-parametric background rate using an off-the-shelf Gaussian Kernel Density Estimator library; The performance of that library was poor, its support for variable width and shape kernels was limited, and to take derivatives with respect to the kernel parameters required me to re-implement large parts of that library.

In the end, rather than modifying and combining and partially reimplementing two high complexity libraries to achieve a mathematically simple end,
I judged it safer course to stitch together *simple* components to achieve a simple end.

The upshot is that my code - let us call it `excited`

- is not API compatible with `pyhawkes`

. Not even close.
It uses Python mostly, with numba to dynamically compile the inner loop.
It exploits the Scipy library Newton's method and L-BFGS solvers to find optima, which are technical innovations over ```
pyhawkes.
On the other hand, it does not implement Akaike's recursion relation to optimize calculation of exponential kernels, and is missing the other response kernels available in ``pyhawkes
```

.

This situation is not ideal;
In a perfect world, these features would all be combined into one package.
In the real world, however, I am enrolled in a *statistics* program rather than *software engineering*, and would be punished accordingly if I sacrificed thoroughness in my statistical analysis in order to observe niceties of software development.

It turned out that the simplest possible bit of code that could solve my statistical problem was in fact complex. Thus, although, access to the code is available upon request, consider yourself warned.