Intro to Semi-Supervised Learning

In a previous blog post, we showed that throwing more training data at a deep learning model has rapidly diminishing returns. If doubling your labeling budget won’t move the needle, what next? Consider semi-supervised learning (SSL) to unlock the information in unlabeled data.

What is Semi-Supervised Learning

Semi-supervised learning (SSL) is a machine learning approach that combines labeled and unlabeled data during training. Semi-supervised learning traditionally falls between unsupervised learning (no labeled training data) and supervised learning (no unlabeled training data).

However, one can also look at semi-supervised learning as the superset of those techniques, where unsupervised learning represents semi-supervised learning with no labeled training data, and supervised learning is semi-supervised learning with no unlabeled training data.

Semi-Supervised Learning in Masterful

Masterful offers two ways to apply SSL algorithms to improve your model accuracy.

First, the training loop in the Masterful CLI Trainer, which is built on top of the Masterful Python Training API, implements both supervised training and an SSL technique based on Noisy Student Training to automatically improve your model using unlabeled data. The power of this technique is demonstrated in our blog post comparing Masterful to Google Vertex on the FGVC-Aircraft Dataset. Learn more about this algorithm in The Guide to Training with unlabeled data.

Second, if you want to use your existing training loop and regularization scheme, you can generate Automatic Labels using the SSL via Automatic Labeling guide. This is a quick way to get started with SSL. Use the helper function masterful.ssl.analyze_data_then_save_to to analyze your datasets and save the analysis to disk. Then run masterful.ssl.load_from to return a tf.data.Dataset object you can pass to your training loop.