Technical Notes

Data extraction and cleaning

One problem with the data set is the size alone; I begin with an undocumented MySQL database with a disk footprint of approximately 40 gigabytes; Although certain queries run rapidly, most aggregate and summary statistics do not, either terminating due to resource-usage- errors. Based on naming conventions, I identify tables of particular interest; one apparently containing metadata for particular videos, and one containing time series information of video activity. These tables I store as plain Hierarchical Data Format files, divided into 256 "shards" based on the hash value video identifier.

The metadata table is of limited use because of various problems with incomplete or inconsistent data. metadata about many time series is not available, or contains various types of corrupt or invalid data. Below it a representative example. Note that it contains varying types of missing data, some marked as such, and some merely implausible:

video_id author title upload_time length
5BqQLtA7Q1g Deathslayer31 MH - Quality Test 1171138984 66
E4udVcrzhG0 TimelessWorld NaN 0 0
5vG4BaP_c0I gimenoalbert Start M always helps commercial 1163615987 35
5FUbVVKaZhA snappyscan piñu pol! 1171544146 12
0T99eHrDLCk RonBats Fabienne de Vries 1173617924 75
8df3B6YkJqc T_UNDEFINED circusmeisje 1177431779 90
brcl0zQ7hTU kfezzie NaN 0 0
3f6UbrDL3aM T_UNDEFINED breaking a tackle 1174915331 8
bpklKD6DCu8 leoleffa NaN 0 0
36jaVnqYjMo rgarethbrown Pablo e Cassio TIFOSI em uma canção de amor.... 1172846785 54
a38jmP0FrYE Drunky90 Shine Take that 1176389346 204
9itnxKn2yxY T_UNDEFINED delalalahaha 1168571195 8
... ... ... ... ...

What I cannot depict is the missing data - many records have apparently no metadata available, or if it is averrable, woudl require more extensive excavation from the database to extract it. Where available I use this metadata, but I do not restrict my investigation only to data points with available metadata.

Leaving metadata aside, I turn to the time series themselves.

I retrieve \(676,638,684\) distinct records from the database, corresponding to \(4,880,136\) distinct videos. Dividing these figures by one another might suggest I have nearly 5 million individual time series, each with more than a thousand observations.

This is not so, for two reasons:

  1. Random sampling of time series reveals that time series are not all similarly long. In fact, the data set is dominated by short data-sets, on the order of 10 data points.
  2. Even the remaining series can in fact be shorter than expected; The majority of the recorded observations are spurious and must be discarded, as will be explained below.

The cleaning and analysis that each dataset requires is complex enough that I cannot query these series per se. Instead, I download them all and inspect each individually.

Firstly, a data sample:

  video_id run_time view_count
... ... ... ...
10 -2IXE5DcWzg 1165050318 921
11 -2IXE5DcWzg 1165081008 1035
12 -2IXE5DcWzg 1165084724 1035
13 -2IXE5DcWzg 1165115641 1306
14 -2IXE5DcWzg 1165139660 1662
15 -2IXE5DcWzg 1165146641 1726
16 -2IXE5DcWzg 1165177526 1756
17 -2IXE5DcWzg 1165177671 1756
18 -2IXE5DcWzg 1165191787 1876
19 -2IXE5DcWzg 1165209383 1876
20 -2IXE5DcWzg 1165235421 2001
21 -2IXE5DcWzg 1165241236 2001
22 -2IXE5DcWzg 1165243133 2001
23 -2IXE5DcWzg 1165264017 2067
24 -2IXE5DcWzg 1165274487 2067
25 -2IXE5DcWzg 1165306214 2349
... ... ... ...

run_time I take to correspond to \(\tau_i\) values. I assume it to be measured in epoch timestamps - the number of seconds since new year 1970 UTC. view_count I take to denote \(N_v(\tau_i)\) and video_id is a unique index \(v\) of the time series.

Note that many view_count values are repeated. Analysis of the data reveals many series like this, with repeated values. This could be evidence that no views occurred in a given time window. However, based on partial notes from the original author, and the sudden extreme increments that are interspersed between these “null increments”, there is a more probably explanation: These are “cache hits”: stale data presented to the user by the network, for performance reasons, in lieu of current information. I preprocess each time series to remove all non (strictly) monotonic increments, and discard the rest.

With these caveats, I repeat the time series excerpt for video -2IXE5DcWzg after preprocessing:

  video_id run_time view_count
... ... ... ...
10 -2IXE5DcWzg 1165036079 921
11 -2IXE5DcWzg 1165081008 1035
13 -2IXE5DcWzg 1165115641 1306
14 -2IXE5DcWzg 1165139660 1662
15 -2IXE5DcWzg 1165146641 1726
16 -2IXE5DcWzg 1165177526 1756
18 -2IXE5DcWzg 1165191787 1876
20 -2IXE5DcWzg 1165235421 2001
23 -2IXE5DcWzg 1165264017 2067
25 -2IXE5DcWzg 1165306214 2349
... ... ... ...

I have effectively discounted all view incrementts of size zero. I am effectively also censoring all inactive time series; We cannot “see” any time series with only only zero or one observations - there must least two different view counts to interpolate. There is no clear way to estimate how significant this proportion is given what I know about the data; There is no way of measuring the significance of thid choice precisely It could easily be the vast majority of videos which fall into this category. After all, the phrase “long tail” was notoriously popularized by Wired in 2004 to describe the preponderance of asymmetric distributions of popularity online [4], and we should suspect that Youtube is such a system. It would be entirely possible that most of the videos are never viewed, and that this data cleaning has censored such videos from the analysis. The simple solution is to exclude this unknown proportion from our analysis. Therefore, throughout this work, it shoudl be understood that the estimates I construct are all conditional on sustained activity.

On the complexity of the simplest possible thing

The first half of the analysis in this report uses the statistical library pyhawkes, and the second half hand-built code. The reason for this is technical rather than mathematical.

pyhawkes is an amazing project; optimized and featureful, it supports a wide variety of density kernel types, has mathematically and technically sophisticated optimizations and so on. It support multivariate and marked processes, a variety of different kernels etc.

It is also the kind of specialized racing vehicle that requires expert maintenance by qualified service personnel.

I did try to use it for the semi-parameteric regression, but ultimately, when my needs were simple --- optimizing parameters with respect to a simple loss function --- I found myself introducing bugs rather than removing them.

When I added features it got messier, in that I encountered different problems. I tried to implement the non-parametric background rate using an off-the-shelf Gaussian Kernel Density Estimator library; The performance of that library was poor, its support for variable width and shape kernels was limited, and to take derivatives with respect to the kernel parameters required me to re-implement large parts of that library.

In the end, rather than modifying and combining and partially reimplementing two high complexity libraries to achieve a mathematically simple end, I judged it safer course to stitch together simple components to achieve a simple end.

The upshot is that my code - let us call it excited - is not API compatible with pyhawkes. Not even close. It uses Python mostly, with numba to dynamically compile the inner loop. It exploits the Scipy library Newton's method and L-BFGS solvers to find optima, which are technical innovations over pyhawkes. On the other hand, it does not implement Akaike's recursion relation to optimize calculation of exponential kernels, and is missing the other response kernels available in ``pyhawkes.

This situation is not ideal; In a perfect world, these features would all be combined into one package. In the real world, however, I am enrolled in a statistics program rather than software engineering, and would be punished accordingly if I sacrificed thoroughness in my statistical analysis in order to observe niceties of software development.

It turned out that the simplest possible bit of code that could solve my statistical problem was in fact complex. Thus, although, access to the code is available upon request, consider yourself warned.