The framework to use for deep learning if you groupthink like Google

July 11, 2016 — July 7, 2021

computers are awful
neural nets
premature optimization

Assumed audience:

people using tensorflow in 2016

Figure 1

A C++/Python/etc neural network toolkit by Google. I was using it for solving general machine-learning problems, and frequently enough that I have notes. My, do I have notes.

See, of course, nostalgebraist’s rant from December 2019 (supplemental from April 2021) about how I am not just imagining it, Tensorflow was (is?) a horrible mess, which lures you in with easy examples in tutorials that are completely unreflective of the nasty chaos of doing anything non-trivial with it, even morse so than average for software.

…“Datasets” and “TFRecords” containing “tf.Examples” (who knew serializing dicts of ints could be so painful?) and “Estimators” / “Strategies” (which do overlapping things but are mutually exclusive!) and “tf.functions” with “GradientTapes” because the “Strategies” apparently require lazily-defined eagerly-executed computations instead of eagerly-defined lazily-executed computations, and “object-based checkpoints” which are the new official™ thing to do instead of the old Saver checkpoints except the equally official™ “Estimators” do the old checkpoints by default, and oh by the way if you have code that just defines tensorflow ops directly instead of getting them via tf.keras objects (which do all sorts of higher-level management and thus can’t serve as safe drop-in equivalents for “legacy” code using raw ops, and by “legacy” I mean “early 2019″) then fuck you because every code example of a correct™ feature gets its ops from tf.keras, and aaaaaaaaaaaaaargh!!

Yes, very much so. I had very similar difficulties (some of the names were changed because I was doing it slightly earlier) not to mention it was a nightmare to even install the cursed thing. Anyway, consider yourself warned. I hated tensorflow enough to abandon it. I now use julia or jax instead and the advice is not current. AFAICT the (only?) modern reason to use tensorflow it its remaining advantage: Good tooling for Edge ML. Which is not my area.

Corollary: Some of the below content is obsolete and based on tensorflow 0.7-1.0, which is ancient now far from current.

1 Abstractions

No idea if any of these are still current.

  • Keras supports tensorflow and Theano as a backend, for comfort and convenience. See below for some notes.

  • Tensor2Tensor

    Tensor2Tensor, or T2T for short, is a library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research. T2T is actively used and maintained by researchers and engineers within the Google Brain team and a community of users.

  • tensorflowslim eases some boring bits.

  • sonnet is Deepmind’s tensorflow library and shares with keras layer-like abstractions and some helpers to make recurrent neural nets bearable. It has been ported to jax as Haiku.

There are some other frontends, which seem a bit less useful to my mind:

  • tflearn wraps the tensorflow machine in scikit-learn (Although the implementation is not very enlightening, nor the syntax especially clear.)

  • estimator is a tensorflow generic estimator class. Relationship to other wrappers is not clear to me, but finding out would be tedious, so I will never know.

My objection to these latter abstractions is that they seem to make the easy bits not easier but different, and the hard bits no easier. tflearn might be useful if you need to plug into an existing scikit-learn workflow.

2 Tutorials

See also keras tutorials below.

3 Debugging

Google’s own Tensorflow without a phd.

Joonwook Choi recommends:

Basic ways:

  • Not in mid training? Explicitly fetch, and print (or do whatever you want) using Session.run()
  • Tensorboard Histogram and Image Summary (see next section)
  • tf.Print(input, data, message=None, first_n=None, summarize=None, name=None) (link)
  • tf.Assert(condition, data, summarize=None, name=None) (link)

Advanced ways:

  • Interpose any python codelet in the computation graph
  • A step-by-step debugger
  • tfdbg_: The TensorFlow debugger

4 Getting data in

This is a depressingly complex topic; Likely it’s more lines of code than building your actual learning algorithm.

For example, things break differently if

  • you are inputting data of variable dimensions via python which requires a “feed”, which requires keeping references to a placeholder Op around, and ALWAYS resubmitting the data every time you run an op, even if the data is not required for the current Op, or

  • Or inputting a Variable (which may also be feeds, just to mess with you, and claim to also be variable dimensions but that never works for me) via C++.

These interact in various different ways that seem irritating, but are probably to do with enabling large scale data reading workflows, so that you might accidentally solve a problem for Google and they can get your solution for cheap.

Here’s a walk through of some of the details. And here are the manual pages for feeding and queueing

My experience is that stuff is so horribly messy that you should just build different graphs for the estimation and deployment phases of your model and implement them each according to convenience. This of course is asking for trouble with inconsistencies; You can mitigate that by building sub-graphs of the model and re-using them.

I’m not yet sure how to easily transmit the estimated parameters between graphs in these two separate phases… I’ll make notes about THAT when i come to it.

Idiom: Parsing text tags into Boolean feature vectors for tensorflow.

5 (Non-recurrent) convolutional networks

See CNNs for text classification.

NB CNN axis ordering is easy to mess up. The Theano guide to convolutions is clearer if you want to work out the actual dimensions your tensors should have. It also gives an intelligible account of how you invert convolutions for decoding.

The Tensorflow convolution guide is more lackadaisical, but it does get us there:

For the SAME padding, the output height and width are computed as:

out_height = ceil(float(in_height) / float(strides[1]))
out_width  = ceil(float(in_width) / float(strides[2]))

For the VALID padding, the output height and width are computed as:

out_height = ceil(float(in_height - filter_height + 1) / float(strides[1]))
out_width  = ceil(float(in_width - filter_width + 1) / float(strides[2]))

Tensorflow’s 4d tensor packing for images?

TensorFlow supports NHWC (default) and NCHW (cuDNN default). The best practice is to build models that work with both NCHW and NHWC as it is common to train using NCHW on GPU, and then do inference with NHWC on CPU.

NCHW is, to be clear, (batch, channels, height, width).

Theano by contrast, is AFAICT always NCHW.

6 Recurrent networks

The documentation for these is abysmal.

To write: How to create standard linear filters in Tensorflow.

For now, my recommendation is to simply use keras, which makes this easier inside tensorflow, or pytorch, which makes it easier overall.

tensorflow fold is a library which ingests structured data and simulates pytorch-style dynamic graphs dependent upon its structure.

6.1 Official documentation

The Tensorflow RNN documentation, as bad as it is, is not even easy to find, being scattered across several non-obvious locations without consistent crosslinks.

To make it actually make sense without unwarranted time wasting and guessing, you will then need to read other stuff.

6.2 Community guides

8 Getting models out

9 Training in the cloud because you don’t have NVIDIA sponsorship

See practical cloud computing, which has a couple of sections on that.

10 Extending

Tensorflow allows binary extensions but don’t really explain how it integrates with normal python builds. Here is an example from Uber.

11 Installing

Tedious. One could imply not install it but rather containerize via Tensorman. See also pop-os/tensorman: Utility for easy management of Tensorflow containers.

12 Misc HOWTOs

12.1 Nightly builds

Here (or build your own)

12.2 Dynamic graphs

Pytorch has JIT graphs and they are super hip, so now tensorflow has a dynamic graph mode, called Eager.

12.3 GPU selection

setGPU sets NVIDIA_VISIBLE_GPU to the least loaded GPU.

12.4 Silencing tensorflow

TF_CPP_MIN_LOG_LEVEL=1 primusrun python run_job.py args

13 Hessians and higher order optimisation

Basic Newton method optimisation example. basic example that also shows how to create a diagonal hessian.

Slightly outdated, Hessian matrix. There is a discussion on Jacobians in TF, including, e.g. fancy examples by jjough:

here’s mine — works for high-dimensional Jacobians (numerator and denominator have >1 dimension), undefined batch sizes, and tensors that are not statically known.

def tf_jacobian(tensor2, tensor1, feed_dict, sess = tf.get_default_session()):
    Computes the tensor d(tensor2)/d(tensor1) recursively.
    :param tensor2: numerator of Jacobian
    :param tensor1: denominator of Jacobian
    :param feed_dict: input data (need this if tensors are not statically known)
    :return: a tensor of dimension (dim_tensor2 x dim_tensor1)
    # can’t do tensor.get_shape() because it doesn’t work for undefined batch size
    shape = list(sess.run(tf.shape(tensor2), feed_dict))
    if shape:
        # split tensor2 along first dimension and recur
        # int trick from https://github.com/tensorflow/tensorflow/issues/7754
        tensor2_split = tf.split(axis = 0, num_or_size_splits = int(shape[0]), value = tensor2)
        grad_split = [tf_jacobian(tf.squeeze(M, squeeze_dims = 0), tensor1, feed_dict) for M in tensor2_split]
        return tf.stack(grad_split)
        # calculate gradient of scalar
        grad = tf.gradients(tensor2, tensor1)
        if grad[0] != None:
            return tf.squeeze(grad, squeeze_dims = [0])
            # replace any undefined gradients with zeros
            return tf.zeros_like(tensor1)

And here’s one for batched tensors:

def batch_tf_jacobian(
      sess = tf.get_default_session()):
    Computes the matrix d(tensor2)/d(tensor1) recursively.
    Tensorflow doesn’t really have its own Jacobian operator
    (tf.gradients sums over all dims of tensor2).

    :param tensor2: numerator of Jacobian, first dimension is batch
    :param tensor1: denominator of Jacobian, first dimension is batch
    :param feed_dict: input data (need this if tensors are not statically known)
    :return: batch Jacobian tensor
    shape2 = list(sess.run(tf.shape(tensor2), feed_dict))
    shape1 = list(sess.run(tf.shape(tensor1), feed_dict))

    jacobian = tf_jacobian(tensor2, tensor1, feed_dict)
    batch_size = shape2[0]

    batch_jacobian = [
              [i] + [0]*(len(shape2)-1) + [i] + [0]*(len(shape1)-1),
              [1] + [-1]*(len(shape2)-1) + [1] + [-1]*(len(shape1)-1)
          for i in range(batch_size)]
    batch_jacobian = [
      tf.squeeze(tensor, squeeze_dims = (0, len(shape2)))
      for tensor in batch_jacobian]
    batch_jacobian = tf.stack(batch_jacobian)
    return batch_jacobian

13.1 Manage tensorflow environments

Tensorflow+pip+conda. Or see python packaging more generally.

13.2 Optimisation tricks

Using traditional/experimental optimisers rather than SGD-type ones.

Simplify distributed training using Horovod.

14 Probabilistic networks

Tensorflow probability for probabilistic programming. Probably other ones too, but I have not done much probabilistic programming in TF. I have done other probabilistic programming in Pytorch.