Masterful CLI Trainer: YAML Config File¶
Introduction¶
The Masterful CLI Trainer configuration file is a low-code interface to the Masterful Training Platform. It allows you to focus on the high-level business constraints and lets Masterful handle the details of training your model given those constraints.
The configuration file is specified in YAML format, for ease of use and human readability.
The following sections will go into details on the contents of the file. For reference, below is a non-annotated, condensed version of the whole configuration file (this example is for a CIFAR10 dataset) to give you a sense of what is configurable inside:
dataset:
root_path: s3://masterful-public/datasets/cifar10
splits: [train, test]
label_map: label_map
optimize: True
model:
architecture: efficientnetb0_v1_small
num_classes: 10
input_shape: [32,32,3]
training:
task: classification
training_split: train
output:
formats: [saved_model, onnx]
path: ~/model_output
evaluation:
split: test
What is YAML¶
YAML is a human-friendly data serialization language for all programming languages. It is a common configuration file format used in a number of different research repositories and projects. Masterful uses YAML in its configuration file for its simplicity and flexibility in specifying basic constructs. The YAML wiki has a good overview on the format and pointers to different references and tutorials.
Configuration Sections¶
The Masterful YAML file has 5 top-level sections: dataset
, model
, training
, evaluation
, and output
. Each section is described in more detail below.
Dataset¶
The dataset
section defines the data used in your project. Specifically, this defines where the images and labels for your data
reside. The data must already be in Masterful format (see Dataset Format for more details).
The following is a complete example of the dataset
section for the Oxford Flowers dataset that has been converted to Masterful format and placed on the public S3 bucket s3://masterful-public/datasets/flowers_nilsback_zisserman
.
#
# Dataset Specification
#
dataset:
# The root directory of the dataset. This can be a local
# file system directory, an S3 bucket, or a GCP bucket.
root_path: s3://masterful-public/datasets/flowers_nilsback_zisserman
# The name of the splits to use in training. These will
# point to a CSV file in the "root_dir" of the same name,
# such as <split name>.csv. Each split can be referenced below
# in the "training" and "evaluation" sections. Splits defined
# here can either be labeled or unlabeled.
splits: [training, validation, test]
# OPTIONAL: The name of the label map file. The label map file
# is a CSV file where the first entry in each row is the integer
# label and the second entry is human readable string class name.
# The label map file is used to replace the class id's in the
# evaluation metrics, for easier reading. If it does not exist,
# then the class ids will be used. The label map file must end
# in ".csv" and be located at '<root_path>/<label_map>.csv'
label_map: label_map
# OPTIONAL: True if we should save an optimized version of the dataset
# locally, False otherwise. Optimizing the dataset locally
# adds a small, one-time dataset processing cost in order
# to convert the raw dataset into the optimized version. But doing
# so will significantly improve training times. The optimization
# conversion only happens once. Subsequent training runs
# will use the optimized version of the dataset.
optimize: True
# OPTIONAL: By default, optimized dataset are stored in the
# ~/.masterful/datasets directory. Set the 'cache_path' value
# below in order to change where the optimized datasets are stored
# locally.
cache_path: ~/masterful_datasets
Formally, the dataset
section contains the following attributes:
Attribute |
Optional? |
Type |
Description |
---|---|---|---|
|
N |
String |
The root directory of the dataset. This can be a local file system directory, an S3 bucket, or a GCP bucket. |
|
N |
List |
The name of the splits to use in training. These will point to a CSV file in the “root_dir” of the same name, such as |
|
Y |
String |
The name of the label map file. The label map file is a CSV file where the first entry in each row is the integer label and the second entry is human readable string class name. The label map file is used to replace the class id’s in the evaluation metrics, for easier reading. If it does not exist, then the class ids will be used. The label map file must end in |
|
Y |
Boolean |
True if we should save an optimized version of the dataset locally, False otherwise. Optimizing the dataset locally adds a small, one-time dataset processing cost in order to convert the raw dataset into the optimized version. But doing so will significantly improve training times. The optimization conversion only happens once. Subsequent training runs will use the optimized version of the dataset. |
|
Y |
String |
By default, optimized dataset are stored in the |
Dataset Optimization¶
As you can see above, the optimize
parameter is optional. However, Masterful highly recommends setting this value to True
, as training times for the majority of configurations will be greatly improved. The optimization process copies the dataset to the local machine (the machine you are training on) and converts it into a sharded and compressed format to maximize parallel input processing during training. This is especially important if your dataset is stored on AWS or S3, as the data will not need to be downloaded from the remote location on every batch. You must have enough local space to store the compressed dataset.
Note the optimized dataset is only created once. Subsequent training runs will re-use the optimized dataset, as long as it hasn’t changed. If the dataset has changed since the last time it was generated, Masterful we re-generate the optimized dataset.
Label Maps¶
A label map file can be specified to provide human readable output for things like evaluation metrics, where it is much easier to read Precision (airplane)
than Precision (0)
when scanning for a particular class. The label map file is a simple CSV file where the first column is the integer class id, and the second column is the string class name. For example, here is a label map file for the CIFAR10 dataset:
0, airplane
1, automobile
2, bird
3, cat
4, deer
5, dog
6, frog
7, horse
8, ship
9, truck
Model¶
The model
section defines the architecture of the model that will be trained by Masterful. Masterful uses a set of prebuilt architectures that have been extensively tested and cover a wide range of state of the art model architectures.
The following is a complete example of the model
section for a basic ResNet model architecture.
#
# Model Specification
#
model:
# The name of the architecture to use. The model
# returned at the end of training will be based on
# the architecture specified below, and will be ready
# to use for inference. The model will include all
# preprocessing and standardization, and will expect
# single, unresized 8-bit 3 channel RGB uint8/uint32 images
# with pixel ranges [0,255].
architecture: resnet50v2
# The number of classes in the training dataset and model predictions.
# For binary_classification, this should be set to 1.
num_classes: 102
# The input shape to use for training. All data will be transformed
# using an aspect-ratio preserving resize to this shape, which is
# in HWC format. Note larger input shapes will take more memory
# and be longer to train, but preserve the most detail. Smaller
# input shapes will train faster, but could lose useful detail
# in the image features. Input shape is specified as
# [height, width, num_channels]. num_channels must be 3.
input_shape: [224,224,3]
Formally, the model
section contains the following attributes:
Attribute |
Optional? |
Type |
Description |
---|---|---|---|
|
N |
String |
The name of the architecture to use. The model returned at the end of training will be based on the architecture specified, and will be ready to use for inference. The model will include all preprocessing and standardization, and will expect single, un-resized 8-bit 3 channel RGB uint8/uint32 images with pixel ranges |
|
N |
Integer |
The number of classes in the training dataset and model predictions. For |
|
Y |
List[Integer] |
The input shape to use for training. All data will be transformed using an aspect-ratio preserving resize to this shape, which is in |
Supported Models¶
The following are the models currently supported by Masterful. If there is an architecture that you need that is not mentioned below, please reach out directly on the masterful Slack community and we can add support for it to the platform.
Model Name |
Year |
Description |
---|---|---|
|
2015 |
ResNet-50 architecture from the paper Deep Residual Learning for Image Recognition |
|
2015 |
ResNet-101 architecture from the paper Deep Residual Learning for Image Recognition |
|
2015 |
ResNet-152 architecture from the paper Deep Residual Learning for Image Recognition |
|
2016 |
ResNet-50 architecture from the paper Identity Mappings in Deep Residual Networks |
|
2016 |
ResNet-101 architecture from the paper Identity Mappings in Deep Residual Networks |
|
2016 |
ResNet-152 architecture from the paper Identity Mappings in Deep Residual Networks |
|
2019 |
|
|
2019 |
|
|
2019 |
|
|
2019 |
|
|
2019 |
|
|
2019 |
|
|
2019 |
|
|
2019 |
|
|
2019 |
|
|
2019 |
The |
|
2019 |
The |
|
2015 |
The |
|
2015 |
The |
|
2017 |
The |
|
2016 |
The |
Training¶
The training
section defines how Masterful will use the data specified in the dataset
section to train the model defined in the model
section.
The following is an example of setting up the training
section:
#
# Training Specification
#
training:
# The task to perform. Currently, the trainer supports the
# following tasks:
# classification - Multi-class classification task.
# binary_classification - Binary classification task.
# multilabel_classification - Multi-class Multi-label classification task.
task: classification
# The dataset set to use for training. This must be a labeled
# dataset.
training_split: training
# OPTIONAL: The dataset split to use for validation. If no split
# is set here, then a validation split will be created automatically
# from the training dataset split.
validation_split: validation
# OPTIONAL: The unlabeled split to use for training.
unlabeled_split: unlabeled
Formally, the training
section contains the following attributes:
Attribute |
Optional? |
Type |
Description |
---|---|---|---|
|
N |
String |
The task to perform. Currently, the trainer supports the following tasks - |
|
N |
String |
The dataset set to use for training. This must be a labeled dataset. |
|
Y |
String |
The dataset split to use for validation. If no split is set here, then a validation split will be created automatically from the training dataset split. This must be a labeled dataset. |
|
Y |
String |
The unlabeled split to use for training. This must be an unlabeled dataset. |
Evaluation¶
At the end of training, Masterful can automatically evaluate your model to give you insights into the performance of the model on unseen data. The evaluation
section allows you to specify which dataset to use to evaluate the generalization performance of your model. Note that it is import to ensure that the dataset used in the evaluation
section is not used in the training
section.
The following is an example of the evaluation
section:
#
# Evaluation Specification
#
evaluation:
# The dataset split to use for evaluation. This should
# be a dataset split that is not used in training, otherwise
# your evaluation metrics will not be representative of the generalization
# performance of your model.
split: test
Formally, the evaluation
section has the following attributes:
Attribute |
Optional? |
Type |
Description |
---|---|---|---|
|
N |
String |
The dataset split to use for evaluation. This should be a dataset split that is not referenced in the |
Output¶
The output
section of the configuration file specifies the format of the artifacts to save after training. Masterful currently supports two model format that can be specified - Tensorflow Saved Models and Open Neural Network Exchange.
The following is an example output
section which saves the trained model in both ONNX format and Tensorflow Saved Model format:
#
# Output Specification
#
output:
# A list of output formats for the trained model.
#
# Supported output formats are:
# saved_model - Tensorflow Saved Model format (https://www.tensorflow.org/guide/saved_model)
# onnx - Open Neural Network Exchange model format (https://onnx.ai/)
formats: [saved_model, onnx]
# The path to save the output into.
path: ~/model_output
Formally, the output
section supports the following attributes:
Attribute |
Optional? |
Type |
Description |
---|---|---|---|
|
N |
List[String] |
A list of supported formats for the trained model. Supported formats include |
|
N |
String |
The path to save the artifacts into. |
For more information on the supported output formats in Masterful, please see the Output Formats guide for more details.
Inference vs Training Model¶
The model saved by Masterful is an inference model, which includes all of the preprocessing and standardization used during training of the model. Therefore, the input for the model when used during inference should be un-resized, 8-bit integer images in the range [0,255]
. This also means the model only excepts unbatched data - specifically single examples or examples with a batch size of 1.
For more information on how to use the saved models from Masterful, please see the Output guide for more details.
Additional Examples¶
Additional examples can be found at the public AWS S3 bucket s3://masterful-public/datasets/
.
>>> aws s3 ls s3://masterful-public/datasets/
>>> PRE flowers_nilsback_zisserman/
>>> PRE hot_dog/
>>> PRE quickstart/
>>> PRE svhn_cropped/
>>> PRE voc_2012/