Masterful CLI Trainer: Model Evaluation¶
Introduction¶
This guide walks you through the evaluation metrics calculated by Masterful at the end of training. After reading this guide, you should understand more about how Masterful measures the generalization performance of your model, and how well your model performs on the evaluation dataset you specify.
How To Evaluate Your Model¶
Masterful defines an evaluation
section in the YAML configuration file used for training. This section specifies which dataset to evaluate your model on. Masterful will choose a representative set of metrics for your computer vision task and measure them on the dataset you specify. For details on the evaluation
section of the configuration file, see Evaluation. For quick reference, the evaluation section specifies the following fields:
#
# Evaluation Specification
#
evaluation:
# The dataset split to use for evaluation. This should
# be a dataset split that is not used in training, otherwise
# your evaluation metrics will not be representative of the generalization
# performance of your model.
split: test
Formally, the evaluation
section has the following attributes:
Attribute |
Optional? |
Type |
Description |
---|---|---|---|
|
N |
String |
The dataset split to use for evaluation. This should be a dataset split that is not referenced in the |
Validation vs Test¶
A fundamental concept in machine learning is the relationship between a validation
dataset and a test
dataset (sometimes called a holdout
dataset). In general, a validation
dataset is used to measure the performance of your model during training, and a test
dataset is used to measure the generalization performance of your model after training. In general, these should be different datasets. Why? To prevent overfitting your model to one or the other. For example, if your validation
and test
datasets are the same, and you stop training your model based on an overfitting measure (the validation loss diverges from the training loss, the validation loss fails to improve, etc), then you have implicitly overfit your model to that validation
/test
dataset, and you will have no measure of how well your model generalizes to unseen data, which is generally how your model will perform at inference on new examples. Even though your model never calculated gradients against this data, it still used that data to calculate training hyperparameters (such as the number of training epochs) that will hurt your performance on unseen data.
Masterful will warn you if your evaluation
dataset is used during training. Be very careful if you choose to ignore this warning, and make sure you understand the consequences of doing so.
Simple Example¶
The following is a simple example of the classification metrics generated from the QuickStart tutorial using the Tensorflow Flowers dataset.
MASTERFUL [17:52:35]: ************************************
MASTERFUL [17:52:35]: Evaluating model on 367 examples from the 'test' dataset split:
MASTERFUL [17:52:36]: Loss: 0.2015
MASTERFUL [17:52:36]: Categorical Accuracy: 0.9373
MASTERFUL [17:52:39]: Average Precision: 0.9354
MASTERFUL [17:52:39]: Average Recall: 0.9335
MASTERFUL [17:52:39]: Confusion Matrix:
MASTERFUL [17:52:39]: | daisy| dandelion| rose| sunflower| tulip|
MASTERFUL [17:52:39]: daisy| 60| 2| 0| 0| 2|
MASTERFUL [17:52:39]: dandelion| 1| 86| 0| 3| 0|
MASTERFUL [17:52:39]: rose| 1| 2| 49| 2| 3|
MASTERFUL [17:52:39]: sunflower| 0| 0| 1| 71| 0|
MASTERFUL [17:52:39]: tulip| 1| 0| 4| 1| 78|
MASTERFUL [17:52:39]: Confusion matrix columns represent the prediction labels and the rows represent the real labels.
MASTERFUL [17:52:39]:
MASTERFUL [17:52:39]: Per-Class Metrics:
MASTERFUL [17:52:39]: Class daisy:
MASTERFUL [17:52:39]: Precision: 0.9524
MASTERFUL [17:52:39]: Recall : 0.9375
MASTERFUL [17:52:39]: Class dandelion:
MASTERFUL [17:52:39]: Precision: 0.9556
MASTERFUL [17:52:39]: Recall : 0.9556
MASTERFUL [17:52:39]: Class rose:
MASTERFUL [17:52:39]: Precision: 0.9074
MASTERFUL [17:52:39]: Recall : 0.8596
MASTERFUL [17:52:39]: Class sunflower:
MASTERFUL [17:52:39]: Precision: 0.9221
MASTERFUL [17:52:39]: Recall : 0.9861
MASTERFUL [17:52:39]: Class tulip:
MASTERFUL [17:52:39]: Precision: 0.9398
MASTERFUL [17:52:39]: Recall : 0.9286
The QuickStart tutorial is a multi-class classification model, where the model predicts a single class instance for each example, with a total of 5 possible classes in the dataset.
In this example, you can see that Masterful reports some standard metrics like loss
and categorical accuracy
on the test dataset split. It is critical to report the evaluation metrics on a dataset that was never seen during training. Masterful will warn you if the evaluation dataset was used during training, and take special care if you ignore this warning as you will most likely overfit your model to this dataset, and see reduced performance in production inference.
Masterful also reports the average precision and recall of the model. Precision and recall is defined on each class as:
precision = True Positives / (True Positives + False Positives)
recall = True Positives / (True Positives + False Negatives)
Average Precision
and Average Recall
are simply the average of the per-class precision and recall metrics.
As you can see above, for a multi-class classification task like this, Masterful reports the per-class metrics for precision and recall, as well as the confusion matrix for all labels. These can help narrow down issues in both your model and dataset, and see how your model performs on the individual classes you are interested in.
Task Specific Metrics¶
Masterful supports additional metrics for different computer vision tasks. For example, per-class metrics do not make sense for Binary Classification, since there is only one class. The following sections go into detail about the additional metrics supported on each computer vision task.
Multi-Class Classification¶
Supported metrics:
loss
accuracy
confusion matrix
average precision
average recall
Per-Class
Precision
Recall
Multi-Label Classification¶
Supported metrics:
loss
accuracy
mAP (Mean Average Precision)
Mean Precision at Recall=0.5
Mean Recall at Precision=0.5
Per-Class:
Precision at Recall=0.5
Recall at Precision=0.5
Average Precision
Confusion Matrix¶
The confusion matrix generated by Masterful works for both Binary and Multi-Class Classification. The following is a snippet of the confusion matrix output generated by Masterful:
MASTERFUL [17:52:39]: Confusion Matrix:
MASTERFUL [17:52:39]: | daisy| dandelion| rose| sunflower| tulip|
MASTERFUL [17:52:39]: daisy| 60| 2| 0| 0| 2|
MASTERFUL [17:52:39]: dandelion| 1| 86| 0| 3| 0|
MASTERFUL [17:52:39]: rose| 1| 2| 49| 2| 3|
MASTERFUL [17:52:39]: sunflower| 0| 0| 1| 71| 0|
MASTERFUL [17:52:39]: tulip| 1| 0| 4| 1| 78|
For a small number of labels, the console output is easy to read. The columns denote the predictions made by the model, and the rows denote the true labels for each example. For example, in the above output, you can see that for every instance of daisy
in the test
dataset, the model correctly predicted daisy
60 times. The model made 2 incorrect predictions of dandelion
, and two more incorrect predictions of tulip
.
For a larger number of labels that don’t fit in the width of the console, Masterful saves the confusion matrix to a CSV file. You can open this CSV file in your favorite spreadsheet editor (Google Sheets, Excel, Numbers) and you will have access to all of the raw data shown in the above console output. For example, the CSV output in a spreadsheet editor is designed to look like the following table:
Predictions |
||||||
---|---|---|---|---|---|---|
daisy |
dandelion |
rose |
sunflower |
tulip |
||
Labels |
daisy |
60 |
2 |
0 |
0 |
2 |
dandelion |
1 |
86 |
0 |
3 |
0 |
|
rose |
1 |
2 |
49 |
2 |
3 |
|
sunflower |
0 |
0 |
1 |
71 |
0 |
|
tulip |
1 |
0 |
4 |
1 |
78 |
Mean Average Precision (mAP)¶
For multi-label classification, Masterful calculates the Mean Average Precision
metric for your model. This metric has many different meanings and implementations. Masterful currently uses the Pascal VOC 2007 definition of Mean Average Precision
. This metric calculates the per-class precision values at 11 equally spaced recall values [0: 0.1: 1]
. The Average Precision
for each class is the mean of the 11 equally spaced precision values for each class, and the Mean Average Precision
(or mAP
) is the mean of the per-class Average Precision
values.