Object Detection with the Pascal VOC Dataset and Masterful

Open In Colab         Download]Download this Notebook


In the Classification guide, you looked at a simple classification example to get you up and running with the Masterful AutoML platform. In this guide, you will take a deeper look at Object Detection with Masterful. Specifically, you will learn how to train a model from the Tensorflow Object Detection API using Masterful.

The TensorFlow Object Detection API is an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models. This library provides a lot of high quality object detections models that can be used in Tensorflow. Normally you would train these models using the Tensorflow Object Detection API. However, there are many reasons why you might want to train them outside of the library. In particular, training these models with Masterful allows you to take advantage of any unlabeled data you might have using semi-supervised learning.

For a complete list of the models supported by the Tensorflow Object Detection API for Tensorflow 2.0, see here.

In this guide, you will take an existing pipeline configuration file that you have created for the Tensorflow Object Detection API and use it directly with Masterful to train and evaluate the model. For simplicity, you will be using the VOC 2007 dataset with object annotations to demonstrate how to setup the dataset and train the model with the data.

If you are familiar with the Tensorflow Object Detection API pipeline configuration protocol buffer, this guide demonstrates training with the model from the pipeline configuration and with a dataset from Tensorflow Datasets. The input configuration and eval configuration from the pipeline configuration is ignored in this example.


Please follow the Masterful installation instructions here in order to run this Quickstart.

In addition, this guide requires the installation of and familiarity with the Tensorflow Object Detection API for Tensorflow 2.0. See the installation instructions here.

import dataclasses
import object_detection
import tensorflow as tf
import masterful

masterful = masterful.register()
MASTERFUL: Your account has been successfully registered. Masterful v0.5.0 is loaded.

Prepare the Data

This guide will use the Pascal VOC 2007 dataset as a simple example of setting up an Object Detection workflow. The PASCAL Visual Object Classes Challenge contains both a Classification and Detection competition. In the Classification competition, the goal is to predict the set of labels contained in the image, while in the Detection competition the goal is to predict the bounding box and label of each individual object.

You will use the VOC 2007 dataset from the Tensorflow Datasets Catalog.

import tensorflow_datasets as tfds

# First step is to load the data from Tensorflow Datasets.
# You will use the training dataset to train the model, and the validation
# set to measure the progress of training. The test dataset
# is used at the end to measure the results of training the model.
# Importantly, Masterful will never see the test dataset,
# so you can be sure that your model is not overfit to any holdout datasets.
training_dataset = tfds.load(
validation_dataset = tfds.load(
test_dataset = tfds.load(

Convert Labels to Masterful Format

After you have the loaded the datasets, it is important to convert the labels into a format Masterful understands. There are two steps involved here.

  • Step 1: Convert the labels to Masterful format

  • Step 2: Pad the labels and images to uniform sizes so they can be batched

Masterful understands several different label and bounding box formats. See DataParams for the specific formats supported. In this example, you are going to use the Tensorflow bounding box format, which defines bounding boxes in terms of min and max values, normalized into the range [0,1]. Specifically, the bounding boxes are of the form [ymin, xmin, ymax, xmax].

Masterful extends this label format to support padding out the labels, as well as multiple bounding boxes per object. A Masterful Object Detection label for a single example has the form [num_boxes, label] where label is a tf.float32 vector of the form [valid, ymin, xmin, ymax, xmax, one_hot_class]. valid is a float value of either 1.0 or 0.0, and is used to represent padded bounding boxes. For example, a value of 1.0 represents a “good” bounding box, and a value of 0.0 represents “padding” added to the labels in order to support batching. Labels whose valid value is 0.0 are ignored during training. For example, if you have 10 classes in your dataset, then the labels for a single example will have the shape [num_boxes, 1 + 4 + 10]. If we allow a maximum number of bounding boxes per example of 20 (max_bounding_boxes = 20), and use a batch size of 8 (batch_size = 8), then the per-batch labels will have the shape [batch_size, max_bounding_boxes, 1 + 4 + 10].

Masterful provides a utility to help you convert the labels into Masterful format, and prepare them for padding and batching. All you need to do is extract the bounding boxes and class labels from the dataset and Masterful will handle the conversion for you.

INPUT_SHAPE = (64, 64, 3)

from masterful.data.preprocessing import (

def convert_and_pad_labels(features_dict):
    image = features_dict["image"]
    classes = features_dict["objects"]["label"]
    boxes = features_dict["objects"]["bbox"]

    # First convert the labels and pad them to the
    # maximum number of bounding boxes, so that you
    # can batch them later. Tensorflow datasets bounding boxes
    # come in Tensorflow format (ymin, xmin, ymax, xmax)
    # so you specify that below.
    labels = convert_and_pad_boxes(

    # Normalize the size of all the input images to the expected input
    # size for the model. The below does a bounding box safe resize that
    # will pad the short edge to the final square input shape.
    # The model you are using for this guide expects input images
    # to be sized to (64, 64), so you specify that square image size below.
    image, labels = resize_and_pad(image, labels, size=INPUT_SHAPE[0])
    image = tf.clip_by_value(image, 0.0, 255.0)
    return image, labels

training_dataset = training_dataset.map(
    convert_and_pad_labels, num_parallel_calls=tf.data.AUTOTUNE
validation_dataset = validation_dataset.map(
    convert_and_pad_labels, num_parallel_calls=tf.data.AUTOTUNE
test_dataset = test_dataset.map(
    convert_and_pad_labels, num_parallel_calls=tf.data.AUTOTUNE

Build the Model

For this guide, you will adapt a model from the Tensorflow Object Detection API Model Zoo for Tensorflow 2. The list of available models can be found here.

The model used below is a SSD MobileNet v2 detector. Note in this example, you are only using the model definition from the pipeline configuration. Other entries in the pipeline configuration are ignored.

PIPELINE_CONFIG = "https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/configs/tf2/ssd_mobilenet_v2_320x320_coco17_tpu-8.config"

# Load the pipeline configuration from the repository
# into a string
import urllib.request

with urllib.request.urlopen(PIPELINE_CONFIG) as url:
    pipeline_config_str = url.read()

# Parse the pipeline configuration proto string
# into a pipeline configuration proto object
from google.protobuf import text_format
from object_detection.protos import pipeline_pb2

pipeline_config = text_format.Parse(
    pipeline_config_str, pipeline_pb2.TrainEvalPipelineConfig()

# Update the config with your specific requirements, namely the number
# of classes and the model input size
pipeline_config.model.ssd.num_classes = NUM_CLASSES
pipeline_config.model.ssd.image_resizer.fixed_shape_resizer.height = INPUT_SHAPE[0]
pipeline_config.model.ssd.image_resizer.fixed_shape_resizer.width = INPUT_SHAPE[1]

# Next build the model. The Tensorflow Object Detection API
# provides a model builder class which can take a model config
# and return a `DetectionModel` instance.
from object_detection.builders import model_builder

object_detection_model = model_builder.build(pipeline_config.model, is_training=True)

Setup Masterful Training

The Masterful AutoML platform learns how to train your model by focusing on five core organizational principles in deep learning: architecture, data, optimization, regularization, and semi-supervision.

Architecture is the structure of weights, biases, and activations that define a model. In this example, the architecture is defined by the object detection model you created above.

Data is the input used to train the model. In this example, you are using a labeled training dataset of from the VOC detection challenge. More advanced usages of the Masterful AutoML platform can take into account unlabeled and synthetic data as well, using a variety of different techniques.

Optimization means finding the best weights for a model and training data. Optimization is different from regularization because optimization does not consider generalization to unseen data. The central challenge of optimization is speed - find the best weights faster.

Regularization means helping a model generalize to data it has not yet seen. Another way of saying this is that regularization is about fighting overfitting.

Semi-Supervision is the process by which a model can be trained using both labeled and unlabeled data.

Architecture and Data Parameters

The first step when using Masterful is to learn the optimal set of parameters for each of the five buckets above. You start by learning the architecture and data parameters of the model and training dataset.

In the code below, you are telling Masterful that your model is performing a detection task (masterful.enums.Task.DETECTION) with 20 labels (num_classes=NUM_CLASSES), and that the input range of the image features going into your model are in the range [0,255] (input_range=masterful.enums.ImageRange.ZERO_255). Also, the model outputs logits rather than a softmax classification (prediction_logits=True).

Furthermore, in the training dataset, you are providing dense labels (label_sparse=False) rather than sparse labels.

For more details on architecture and data parameters, see the API specifications for ArchitectureParams and DataParams.

# Create the model parameters describing the model
# architecture.
model_params = masterful.architecture.ArchitectureParams(

# Create the data parameters describing the input data structure.
training_dataset_params = masterful.data.DataParams(
    label_shape=(MAX_BOUNDING_BOXES, 1 + 4 + NUM_CLASSES),

# The validation dataset parameters are the same as the training
# dataset parameters.
validation_dataset_params = dataclasses.replace(training_dataset_params)

Optimization Parameters

Next you learn the optimization parameters that will be used to train the model. Below, you use Masterful to learn the standard set of optimization parameters to train your model for a detection task.

For more details on the optmization parameters, please see the OptimizationParams API specification.

optimization_params = masterful.optimization.learn_optimization_params(
Batch Size:  20%|██████████████▌                                                          | 1/5 [00:00<00:00, 1669.04steps/s]INFO:tensorflow:depth of additional conv before box predictor: 0
Callbacks: 100%|████████████████████████████████████████████████████████████████████████████| 5/5 [01:11<00:00, 14.34s/steps]

Semi-Supervised Learning Parameters

The next step before training is to learn the optimal set of semi-supervision parameters. For this guide, you are not using any unlabeled or synthetic data as part of training, so most forms of semi-supervision will be disabled by default.

For more details on the semi-supervision parameters, please see the SemiSupervisedParams API specification.

ssl_params = masterful.ssl.learn_ssl_params(training_dataset, training_dataset_params)

Regularization Parameters

The regularization parameters used can have a dramatic impact on the final performance of your trained model. Learning these parameters can be a time-consuming and domain specific challenge. Masterful can speed up this process by learning these parameters for you. In general, this can be an expensive operation. A rough order of magnitude for learning these parameters is 2x the time it takes to train your model. However, this is still dramatically faster than manually finding these parameters yourself, and these parameters can be reused in future training sessions. In the example below, you will use the learn_regularization_params API to learn these parameters directly from your dataset and model.

For more details on the regularization parameters, please see the RegularizationParams API specification.

# In order to speed up the guide and demonstrate the full workflow,
# take only a small subset of the training and validation data.
# In a real training workflow, you would use the full datasets.
training_dataset = training_dataset.take(128)
validation_dataset = validation_dataset.take(128)

# Override the optimization parameters to only train for 1 epoch
# to demonstrate the workflow. A real training workflow should use the
# learned parameters directly.
optimization_params.epochs = 1
optimization_params.warmup_epochs = 0

regularization_params = masterful.regularization.learn_regularization_params(
MASTERFUL [13:40:13]: Meta-Learning Regularization Parameters...
MASTERFUL [13:40:18]: Warming up model for analysis.
MASTERFUL [13:40:18]: Analyzing baseline model performance. Training until validation loss stabilizes...
Baseline Training: 100%|████████████████████████████████████████████████████████████████████| 4/4 [00:57<00:00, 14.37s/steps]
MASTERFUL [13:41:28]: Baseline training complete.
MASTERFUL [13:41:28]: Meta-Learning Basic Data Augmentations...
Node 1/4: 100%|███████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.74steps/s]
Node 2/4: 100%|███████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.72steps/s]
Node 3/4: 100%|███████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.73steps/s]
Node 4/4: 100%|███████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.68steps/s]
MASTERFUL [13:42:12]: Meta-Learning Data Augmentation Clusters...
Distance Analysis: 100%|████████████████████████████████████████████████████████████████| 143/143 [03:19<00:00,  1.40s/steps]
Node 1/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.57steps/s]
Node 2/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.43steps/s]
Node 3/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.45steps/s]
Node 4/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.42steps/s]
Node 5/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.42steps/s]
Distance Analysis: 100%|██████████████████████████████████████████████████████████████████| 66/66 [01:31<00:00,  1.38s/steps]
Node 6/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.46steps/s]
Node 7/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.33steps/s]
Node 8/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.29steps/s]
Node 9/10: 100%|██████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.31steps/s]
Node 10/10: 100%|█████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.30steps/s]
MASTERFUL [13:49:17]: Meta-Learning Label Based Regularization...
Node 1/2: 100%|███████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.40steps/s]
Node 2/2: 100%|███████████████████████████████████████████████████████████████████████████| 80/80 [00:09<00:00,  8.34steps/s]
MASTERFUL [13:49:40]: Meta-Learning Weight Based Regularization...
MASTERFUL [13:49:41]: Analysis finished in 9.42441023985545 minutes.
MASTERFUL [13:49:41]: Learned parameters dresser-shore-honesty saved at /home/yaoshiang/.masterful/policies/dresser-shore-honesty.

Train the Model

Now, you are ready to train your model using the Masterful AutoML platform. In the next cell, you will see the call to masterful.training.train, which is the entry point to the training and meta-learning engine of the Masterful AutoML platform. Notice there is no need to batch your data (Masterful will find the optimal batch size for you). No need to shuffle your data (Masterful handles this for you). You hand Masterful a model and a dataset, and Masterful will figure the rest out for you.

Note that in the section above, you overrode the number of training epochs to be 1, to speed up this guide. For obvious reasons, this will not fully train your model, but instead is sufficient to demonstrate the training workflow.

training_report = masterful.training.train(
MASTERFUL [13:49:42]: Training model with semi-supervised learning disabled.
MASTERFUL [13:49:42]: Performing basic dataset analysis.
MASTERFUL [13:49:43]: Training model with:
MASTERFUL [13:49:43]:   128 labeled examples.
MASTERFUL [13:49:43]:   128 validation examples.
MASTERFUL [13:49:43]:   0 synthetic examples.
MASTERFUL [13:49:43]:   0 unlabeled examples.
MASTERFUL [13:49:43]: Training model with learned parameters dresser-shore-honesty in two phases.
MASTERFUL [13:49:43]: The first phase is supervised training with the learned parameters.
MASTERFUL [13:49:43]: The second phase is semi-supervised training to boost performance.
MASTERFUL [13:49:46]: Warming up model for supervised training.
MASTERFUL [13:49:46]: Starting Phase 1: Supervised training until the validation loss stabilizes...
Supervised Training: 100%|██████████████████████████████████████████████████████████████████| 4/4 [01:35<00:00, 23.78s/steps]
MASTERFUL [13:51:34]: Semi-Supervised training disabled in parameters.
MASTERFUL [13:51:36]: Training complete in 1.8629210988680522 minutes.

Evaluate the Model

Once you have trained yur model, how do you know that it performs well? The next step is to evaluate your model. Typically, you do this through the Tensorflow Object Detection API, which can take your pipeline configuration and run it in evaluation mode instead of training mode. Masterful however can evaluate your model directly as well.

For example, the TrainingReport returned by Masterful provides a Keras model wrapper of your Tensorflow Object Detection API model, so you can use standard Keras evaluation metrics to look at some intrinsic metrics, like the classification and localization loss values.

# You can use the model returned in the training report
# to evaluate the loss of the model on the test dataset. This
# model is a Keras model that wraps the TF DetectionModel and
# allows you to use Keras model semantics.
    test_dataset.batch(optimization_params.batch_size, drop_remainder=True),
77/77 [==============================] - 3s 39ms/step - loss: 4.7236 - Loss/localization_loss: 1.2828 - Loss/classification_loss: 3.3777 - Loss/regularization_loss: 0.0631 - Loss/total_loss: 4.7236
{'loss': 4.686111927032471,
 'Loss/localization_loss': 1.245176076889038,
 'Loss/classification_loss': 3.377870798110962,
 'Loss/regularization_loss': 0.06306508183479309,
 'Loss/total_loss': 4.686111927032471}

COCO Evaluation Metrics

A more standard way of measuring object detection performance is to evaluate using the MSCOCO evaluation metrics standard. The evaluation metrics are described here, and there is a common library pycocotools which provides implementations of these metrics. Masterful provides an easy wrapper for these tools in CocoEvaluationMetrics

In order to use this evaluator, you need to tell the evaluator how to convert the predictions from the detection model into labels that can be used by the evaluator. Masterful provides a built-in prediction converter for Tensorflow Object Detection models in predictions_to_labels

# Masterful also provides a COCO evaluator, to measure
# the performance of your model using the COCO evaluation
# metrics.
from masterful.evaluation.detection.coco import CocoEvaluationMetrics
from masterful.architecture.detection import predictions_to_labels

# The COCO evaluation metrics needs to understand the class
# mappings between your class mappings and the semantic names,
# for a human readable output. You can put anything you want here,
# as long as you have an entry for each class label. Below are
# the class names for the 20 VOC labels.

# Categories is a dictionary mapping the class id to the semantic
# label above.
categories = [
    {"id": class_id, "name": str(class_name)}
    for class_id, class_name in zip(range(NUM_CLASSES), VOC_CLASS_NAMES)
evaluator = CocoEvaluationMetrics(categories)

# Evaluate the model on the test dataaset, which has
# never been seen by your model before. The predictions
# to labels function tells the evaluator how to interpret
# the predictions of the model.
100%|████████████████████████████████████████████████████████████████████████████████████| 4952/4952 [10:16<00:00,  8.03it/s]
creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=4.09s).
Accumulating evaluation results...
DONE (t=0.92s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.004
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.002
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.001
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.004
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = -1.000
{'DetectionBoxes_Precision/mAP': 1.5811807574565788e-05,
 'DetectionBoxes_Precision/mAP@.50IOU': 9.36131713204026e-05,
 'DetectionBoxes_Precision/mAP@.75IOU': 1.3159210657907898e-06,
 'DetectionBoxes_Precision/mAP (small)': 4.334404081056226e-06,
 'DetectionBoxes_Precision/mAP (medium)': 0.004429474737902552,
 'DetectionBoxes_Precision/mAP (large)': -1.0,
 'DetectionBoxes_Recall/AR@1': 0.0017551707664954534,
 'DetectionBoxes_Recall/AR@10': 0.0021644518540018024,
 'DetectionBoxes_Recall/AR@100': 0.0021644518540018024,
 'DetectionBoxes_Recall/AR@100 (small)': 0.0011612660192992603,
 'DetectionBoxes_Recall/AR@100 (medium)': 0.0044334654592250505,
 'DetectionBoxes_Recall/AR@100 (large)': -1.0}