Semi/weakly-supervised learning
On extracting nutrition from bullshit
July 24, 2017 — July 5, 2021
I’m not yet sure what this is, but I’ve seen these words invoked in machine learning problems with a partially-observed model, where you hope to simultaneously learn the parameters of the label generation process and the observation process: So if I have a bunch of crowd-sourced labels for my data and I wish to use them to train a classifier, but I suspect that my crowd is a little unreliable, then I try to do “weakly supervised” learning when I learn both the true labels and the crowd whimsy process, as a kind of hierarchical model of informative sampling. Or I might assume no explicit model for the crowd whimsy, but simply that similar data should not be too differently labelled, a.k.a. Label Propagation, which uses graph clustering to infer data labels.
Other methods?
1 Self supervision
2 Data augmentation
Here are some tools.
- facebookresearch/AugLy: A data augmentations library for audio, image, text, and video.
- AugLy: A new data augmentation library to help build more robust AI models
Snorkel is a system for rapidly creating, modelling, and managing training data, currently focused on accelerating the development of structured or “dark” data extraction applications for domains in which large labelled training sets are not available or easy to obtain.
Today’s state-of-the-art machine learning models require massive labelled training sets — which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labelling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process—learning, essentially, which labelling functions are more accurate than others—and then uses this to train an end model (for example, a deep neural network in TensorFlow).
Surprisingly, by modelling a noisy training set creation process in this way, we can take potentially low-quality labelling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems.