The Algorithms and Design.¶
The doc provides more details on the algorithms and design of Masterful.
Where the API Fits¶
Masterful provides a new API, built on top of Keras (PyTorch coming soon) to focus on an ML developer’s twin goals of training: maximum speed and maximum accuracy. This solves a common source of confusion working with deep learning frameworks: they are primarily designed to make it easy to build complex architectures. This was appropriate when advancements in architectures drove most of the state of the art improvements, but today data and training are far more relevant. For example, consider regularization. Using Keras directly, regularization might occur at the
tf.data.Dataset object via map calls to image transforms; within the
tf.keras.Model via dropout layers or kernel regularizers; or at the optimizer level via
tfa.SGDW’s decoupled weight decay. By contrast, in the Masterful API, regularization is treated as a logical grouping.
API that metalearns training and regularization policies (and some drop-in architectural choices). Built on Keras.
Keras / Pytorch Lightning
API that simplifies model architecture via deep neural network primitives like convolutions, rather than Tensorflow’s scientific computing primitives.
Tensorflow / PyTorch
API that simplifies the creation and compilation of vectorized scientific computing.
API that allows low level access to GPUs for scientific computing rather than computer graphics.
Underlying hardware to perform vectorized matrix math, useful for both computer graphics and neural networks.
Architecture is the structure of weights, biases, and activations that define a model. For example, a perceptron or logistic regression model’s architecture is a single multiplier weight per input, followed by a sum, followed by the addition of one bias weight, followed by a sigmoid activation. In computer vision, most practitioners are familiar with the basic ideas behind AlexNet, VGG, Inception, Resnet, EfficientNets. MobileNets, and Vision Transformers, as well as different heads for detection and classification like YOLO, SSD, U-Net, and Mask R-CNN. Architectures have arguably been the main source of excitement in deep learning.
Masterful treats model architecture as an input to the platform. However, Masterful controls some model of a model’s training-specific layers and attributes: dropout, residual layers for stochastic depth, kernel regularization, and momentum of batch norms.
Masterful includes Knowledge Distillation. Training a model is generally inferior to training a larger model and then distilling its knowledge into a smaller model’s architecture. Surprisingly, the smaller distilled model retains most of the improvements of the larger model.
Semi-Supervised Learning (SSL)¶
SSL specifically means learning from both unlabeled and labeled data. SSL is the ability for a CV model to extract information from both labeled and unlabeled images.
Masterful trains your model using unlabeled data through SSL techniques. Masterful’s approach draws from two of the three major lineages of SSL: feature reconstruction and contrastive learning of representations. (Masterful does not currently include techniques from the third major lineage, generative techniques aka image reconstruction). Three state-of-the-art papers broadly define the techniques included in Masterful: Noisy Student Training, SimCLR, and Barlow Twins.
The central challenge of productizing SSL into Masterful’s platform is that research is narrow and fragile. Defining a narrow problem like “classification on Imagenet on Resnet50” allows many parameters to get baked in. Masterful generalizes the basic concepts from SOTA research to additional tasks like detection and segmentation, arbitrary data domains like overhead geospatial, and additional model types.
Regularization means helping a model generalize to data it has not yet seen. Put another way, regularization is about fighting overfitting.
As a thought experiment, it is actually quite easy to achieve 100% accuracy (or mAP or other goodness of fit measure) on training data: just memorize a lookup table. But that would be an extreme example of overfitting: such a model would have absolutely zero ability to generalize to data that the model has not seen.
Within the regularization bucket, Masterful includes several categories of regularization. The central challenge is not the regularization technique itself, but rather, how to learn the optimal hyperparameters or policy of each technique, given that regularization techniques do not operate independently.
Several regularization techniques are implemented in code that touches architecture. For that reason, the are mistaken as architecture. The key question to distinguish between architecture and regularization is to ask, “is it used at inference?” If the answer is yes, it is architecture. Otherwise, it is for regularization.
An early and common technique for regularization is dropout. The original authors hypothesized that dropout allows approximation of a Bayesian approximation. In practice, the optimal intensity of dropout is not theoretically grounded or predictable, however, empirical studies indicate that the intensity of dropout is dependent on attributes of the dataset. (forthcoming: Masterful determines the intensity of dropout according to an adaptive algorithm during training).
Decaying a model’s kernel weights is brings a prior into a model’s weights. It is used in every state of the art architecture, but it is typically implemented incompatibly with modern optimizers, leading to the poor performance of Adam in practice. Masterful includes a corrrect implementation and also (forthcoming: meta-learns the optimal amount of weight decay).
A simple way to improve the accuracy of a model is to simply train N of them and take the average prediction. This is called ensembling and touches on ideas of Bayesian optimization. Put another way, divide a model into N subsets and progressively train each subset separately. From this perspective, the approach resembles Dropout. Ensembling is generally superior to simply first constructing a complex model by repeating a base architecture N times and training it once. Masterful includes an API to easily conduct ensembling.
Residual networks include skip connections, which create two possible paths for a model. Stochastic Depth forces the model to switch between these two approaches. It is a form of dropout and like dropout, can be interpreted as a form of Bayesian approximation.
Transforming a dataset’s images is a form of regularization. Data augmentations encompass three major forms of adjusting pixel values: either moving pixels to geometric rules like zooming or srotating (spatial augmentations), changing the value of a pixel slightly like brightness or saturation (color jitter), or various forms of blurring. But relying on transforms solely in pixel space, data augmentation can only indirectly affect the intermediate feature maps of a model. Techniques include color, brightness, hue, contrast, solarize, posterize, equalization, contrast, blur, mirror, translation, rotation, and shear. Masterful’s transformations are correctly adapted to also operate on detection and segmentation.
A new generation of techniques regularizes models by treating the labels as a probabilty distribution (aka soft label) instead of a one-hot label. The resulting data is clearly out of distribution, and yet these techniques sometimes help regularize a model. These includes cutmix, mixup, as well as label-smoothing regularization (LSR). LSR is notable as a regularization technique which can be viewed from three perpsectives: a label regularization, a form of Bayesian Maximum A Posterior (map) estimation, and a form of ensembling with a perfectly calibrated prior distribution.
Optimization means finding the best weights for a model and training data. Optimization is different from regularization because optimization does not consider generalization to unseen data. The goal of optimization is speed - find the best weights faster.
Plain old stochastc gradient descent with a very low learning rate is sufficient to find the best weights. But pure SGD is far slower than modern optimizers. So basically every innovation in optimizers, including momentum, RMSProp, Adam, LARS, and LAMB, are essentially about getting the best weights faster by calculate weight updates with not only the current gradient, but also statistical information about past gradients.
Optimization on production datasets is very different from optimization of huge, high entropy datasets like Imagenet.
Masterful applies multiple techniques to minimize wall-clock time and GPU hours, including pushing nearly every data augmentation technique to the GPU using Masterful’s purpose built transformations; Masterful’s custom training loop and scaffold model concept, and metalearning of optimal batch size, learning rate schedule, and epochs for your hardware.
Taking Advantage of GPU Hardware¶
Tensorflow 2 and PyTorch Lightning offer simple to use APIs to logically group a set of GPUs into a single logical GPU using mirror strategies. However, taking advantage of a large logical GPU is non-trivial. Size only allows larger batch sizes, and there are information theoretic limits to the usefulness of larger batch sizes. Even when batch sizes are scalable, a high learning rate is required to take advantage of large batch sizes. Masterful brings an information theoretic approach to meta-learning these parameters, as well as learning rate schedule to ensure expensive GPU hardware is fully utilized.
Breaking the CPU Bottleneck¶
Augmentation implementations are typically based on Keras Image Preprocessing, cv2, or PILLOW, meaning the operations run on CPU, creating a bottleneck.
To greatly improve the speed of the conventional augmentation pipeline during training, Masterful pushes augmentation operations to the GPU. Internally, this requires a ground-up implementation of every image augmentation in pure Tensorflow. Pillow and cv2 are not used. Many core design problems of those libraries are also resolved, such as eliminating non-convex combinations of magnitudes. Then, a unique scaffold model approach is applied, whereby the base model is wrapped by custom Keras layers and trained with a custom training loop.
Meta-learning Optimal Hyperparameters¶
The hyperparameters controlling each regularization, semi-supervised learning, and optimiization method is usually set using a heuristic, such as mirroring data 50% of the time. While such a heuristic is appropriate for ImageNet, mirroring street signs could literally be fatal. An alternative is meta-learning, but this space is generally ineffective.
For example, one state-of-the-art metalearning approach, specifically for data augmentation is AutoAugment. If a model takes 1 hour to converge, AutoAugment would require 625 days, rendering it unusable in practice.
Black box optimization - such as Bayesian optimization, Reinforcement Learning, or randomly sampled grid search - treats all hyperparameters as fungible dimensions in a search space. The optimal way to search that search space as simply training a model to completion to find signal on the value of that hyperparameter. While effective, the black box approaches generally requires many orders of magnitudes of full training runs, whereas Masterful’s metalearning requires less than an order of magnitude multiple of a full training run.
Masterful’s techniques are built to work with Masterful’s purpose built meta-learner, relieving the developer of manual guessing and checking of hyperparameters. Masterful’s meta-learner applies individual algorithms that are aware of the technique, data, and model. The result is more accurate and faster meta-learning.
Masterful’s novel contribution to meta-learning is a meta-learner that is more accurate, more robust to different domains of data, and performant. Under the hood, each hyperparameter is grouped into a logical group and different metalearners explore each group using an appropriate algorithm.
Masterful’s meta-learner for regularization draws on the concepts from AutoAugment, Frechet Inception Distance, and adversarial learning. The result is a two-pass metalearning algorithm that can analyze two orders of magnitude of search space in very little wall clock time becuase the first pass analysis only requires inference to cluster transformations. The second pass search, which requires full training runs, is then reduced to analyzing a single digit number of clusters through a beam-search based algorithm. The final result is a metalearning algorithm that would run in 2 hours, if a model takes 1 hour to converge.
Masterful’s meta-learner for optimization generally draws from information theoretic analysis of the data and model, and empirical analysis of hardware performance.
Andrew Ng, “MLOps: From Model-centric to Data-centric AI”, (2021). https://www.deeplearning.ai/wp-content/uploads/2021/06/MLOps-From-Model-centric-to-Data-centric-AI.pdf
Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, and Barret Zoph. “Revisiting resnets: Improved training and scaling strategies.” arXiv preprint arXiv:2103.07579 (2021).
Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, Nathan Srebro, Geometry of Optimization and Implicit Regularization in Deep Learning. https://arxiv.org/abs/1705.03071
Alex Hernández-García and Peter König. “Data augmentation instead of explicit regularization.” arXiv preprint arXiv:1806.03852 (2018).
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. “Self-training with noisy student improves imagenet classification.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687-10698. 2020.
Suman Ravuri, and Oriol Vinyals. “Seeing is not necessarily believing: Limitations of biggans for data augmentation.” (2019).
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. “A simple framework for contrastive learning of visual representations.” In International conference on machine learning, pp. 1597-1607. PMLR, 2020.
Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. “Autoaugment: Learning augmentation policies from data.” arXiv preprint arXiv:1805.09501 (2018).
Lars Kai Hansen and Peter Salamon. “Neural network ensembles.” IEEE transactions on pattern analysis and machine intelligence 12, no. 10 (1990): 993-1001.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 (2015).
Zhilu Zhang and Mert R. Sabuncu. “Self-distillation as instance-specific label smoothing.” arXiv preprint arXiv:2006.05065 (2020).
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. “Gans trained by a two time-scale update rule converge to a local nash equilibrium.” Advances in neural information processing systems 30 (2017).
Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. “mixup: Beyond empirical risk minimization.” arXiv preprint arXiv:1710.09412 (2017).
Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. “Cutmix: Regularization strategy to train strong classifiers with localizable features.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023-6032. 2019.
Spyros Gidaris and Andrei Bursuc, Teacher-student feature prediction approaches. 2021. https://gidariss.github.io/self-supervised-learning-cvpr2021/slides/teacher_student.pdf