Masterful CLI Trainer: Model Evaluation

Introduction

This guide walks you through the evaluation metrics calculated by Masterful at the end of training. After reading this guide, you should understand more about how Masterful measures the generalization performance of your model, and how well your model performs on the evaluation dataset you specify.

How To Evaluate Your Model

Masterful defines an evaluation section in the YAML configuration file used for training. This section specifies which dataset to evaluate your model on. Masterful will choose a representative set of metrics for your computer vision task and measure them on the dataset you specify. For details on the evaluation section of the configuration file, see Evaluation. For quick reference, the evaluation section specifies the following fields:

#
# Evaluation Specification
#
evaluation:
  # The dataset split to use for evaluation. This should
  # be a dataset split that is not used in training, otherwise
  # your evaluation metrics will not be representative of the generalization
  # performance of your model.
  split: test

Formally, the evaluation section has the following attributes:

Attribute

Optional?

Type

Description

split

N

String

The dataset split to use for evaluation. This should be a dataset split that is not referenced in the training section of the configuration file, otherwise your evaluation metrics will not be representative of the generalization performance of your model.

Validation vs Test

A fundamental concept in machine learning is the relationship between a validation dataset and a test dataset (sometimes called a holdout dataset). In general, a validation dataset is used to measure the performance of your model during training, and a test dataset is used to measure the generalization performance of your model after training. In general, these should be different datasets. Why? To prevent overfitting your model to one or the other. For example, if your validation and test datasets are the same, and you stop training your model based on an overfitting measure (the validation loss diverges from the training loss, the validation loss fails to improve, etc), then you have implicitly overfit your model to that validation/test dataset, and you will have no measure of how well your model generalizes to unseen data, which is generally how your model will perform at inference on new examples. Even though your model never calculated gradients against this data, it still used that data to calculate training hyperparameters (such as the number of training epochs) that will hurt your performance on unseen data.

Masterful will warn you if your evaluation dataset is used during training. Be very careful if you choose to ignore this warning, and make sure you understand the consequences of doing so.

Simple Example

The following is a simple example of the classification metrics generated from the QuickStart tutorial using the Tensorflow Flowers dataset.

MASTERFUL [17:52:35]: ************************************
MASTERFUL [17:52:35]: Evaluating model on 367 examples from the 'test' dataset split:
MASTERFUL [17:52:36]:   Loss: 0.2015
MASTERFUL [17:52:36]:   Categorical Accuracy: 0.9373
MASTERFUL [17:52:39]:   Average Precision: 0.9354
MASTERFUL [17:52:39]:   Average Recall:    0.9335
MASTERFUL [17:52:39]:   Confusion Matrix:
MASTERFUL [17:52:39]:              |     daisy| dandelion|      rose| sunflower|     tulip|
MASTERFUL [17:52:39]:         daisy|        60|         2|         0|         0|         2|
MASTERFUL [17:52:39]:     dandelion|         1|        86|         0|         3|         0|
MASTERFUL [17:52:39]:          rose|         1|         2|        49|         2|         3|
MASTERFUL [17:52:39]:     sunflower|         0|         0|         1|        71|         0|
MASTERFUL [17:52:39]:         tulip|         1|         0|         4|         1|        78|
MASTERFUL [17:52:39]:     Confusion matrix columns represent the prediction labels and the rows represent the real labels.
MASTERFUL [17:52:39]:
MASTERFUL [17:52:39]:   Per-Class Metrics:
MASTERFUL [17:52:39]:     Class daisy:
MASTERFUL [17:52:39]:       Precision: 0.9524
MASTERFUL [17:52:39]:       Recall   : 0.9375
MASTERFUL [17:52:39]:     Class dandelion:
MASTERFUL [17:52:39]:       Precision: 0.9556
MASTERFUL [17:52:39]:       Recall   : 0.9556
MASTERFUL [17:52:39]:     Class rose:
MASTERFUL [17:52:39]:       Precision: 0.9074
MASTERFUL [17:52:39]:       Recall   : 0.8596
MASTERFUL [17:52:39]:     Class sunflower:
MASTERFUL [17:52:39]:       Precision: 0.9221
MASTERFUL [17:52:39]:       Recall   : 0.9861
MASTERFUL [17:52:39]:     Class tulip:
MASTERFUL [17:52:39]:       Precision: 0.9398
MASTERFUL [17:52:39]:       Recall   : 0.9286

The QuickStart tutorial is a multi-class classification model, where the model predicts a single class instance for each example, with a total of 5 possible classes in the dataset.

In this example, you can see that Masterful reports some standard metrics like loss and categorical accuracy on the test dataset split. It is critical to report the evaluation metrics on a dataset that was never seen during training. Masterful will warn you if the evaluation dataset was used during training, and take special care if you ignore this warning as you will most likely overfit your model to this dataset, and see reduced performance in production inference.

Masterful also reports the average precision and recall of the model. Precision and recall is defined on each class as:

precision = True Positives / (True Positives + False Positives)
recall = True Positives / (True Positives + False Negatives)

Average Precision and Average Recall are simply the average of the per-class precision and recall metrics.

As you can see above, for a multi-class classification task like this, Masterful reports the per-class metrics for precision and recall, as well as the confusion matrix for all labels. These can help narrow down issues in both your model and dataset, and see how your model performs on the individual classes you are interested in.

Task Specific Metrics

Masterful supports additional metrics for different computer vision tasks. For example, per-class metrics do not make sense for Binary Classification, since there is only one class. The following sections go into detail about the additional metrics supported on each computer vision task.

Binary Classification

Supported metrics:

  • loss

  • accuracy

  • confusion matrix

  • precision

  • recall

Multi-Class Classification

Supported metrics:

  • loss

  • accuracy

  • confusion matrix

  • average precision

  • average recall

  • Per-Class

    • Precision

    • Recall

Multi-Label Classification

Supported metrics:

  • loss

  • accuracy

  • mAP (Mean Average Precision)

  • Mean Precision at Recall=0.5

  • Mean Recall at Precision=0.5

  • Per-Class:

    • Precision at Recall=0.5

    • Recall at Precision=0.5

    • Average Precision

Confusion Matrix

The confusion matrix generated by Masterful works for both Binary and Multi-Class Classification. The following is a snippet of the confusion matrix output generated by Masterful:

MASTERFUL [17:52:39]:   Confusion Matrix:
MASTERFUL [17:52:39]:              |     daisy| dandelion|      rose| sunflower|     tulip|
MASTERFUL [17:52:39]:         daisy|        60|         2|         0|         0|         2|
MASTERFUL [17:52:39]:     dandelion|         1|        86|         0|         3|         0|
MASTERFUL [17:52:39]:          rose|         1|         2|        49|         2|         3|
MASTERFUL [17:52:39]:     sunflower|         0|         0|         1|        71|         0|
MASTERFUL [17:52:39]:         tulip|         1|         0|         4|         1|        78|

For a small number of labels, the console output is easy to read. The columns denote the predictions made by the model, and the rows denote the true labels for each example. For example, in the above output, you can see that for every instance of daisy in the test dataset, the model correctly predicted daisy 60 times. The model made 2 incorrect predictions of dandelion, and two more incorrect predictions of tulip.

For a larger number of labels that don’t fit in the width of the console, Masterful saves the confusion matrix to a CSV file. You can open this CSV file in your favorite spreadsheet editor (Google Sheets, Excel, Numbers) and you will have access to all of the raw data shown in the above console output. For example, the CSV output in a spreadsheet editor is designed to look like the following table:

Predictions

daisy

dandelion

rose

sunflower

tulip

Labels

daisy

60

2

0

0

2

dandelion

1

86

0

3

0

rose

1

2

49

2

3

sunflower

0

0

1

71

0

tulip

1

0

4

1

78

Mean Average Precision (mAP)

For multi-label classification, Masterful calculates the Mean Average Precision metric for your model. This metric has many different meanings and implementations. Masterful currently uses the Pascal VOC 2007 definition of Mean Average Precision. This metric calculates the per-class precision values at 11 equally spaced recall values [0: 0.1: 1]. The Average Precision for each class is the mean of the 11 equally spaced precision values for each class, and the Mean Average Precision (or mAP) is the mean of the per-class Average Precision values.