Object Detection of Pedestrians with the Masterful CLI Trainer

Open In Colab        DownloadDownload this Notebook

In this guide, you will take a deeper look at Object Detection with the Masterful CLI Trainer to train a state of the art model.

This guide will use the CalTech Pedestrians dataset, which is a large video dataset of pedestrians taken from a moving vehicle as it drives around different cities.

NOTE: Pedestrian detection is an advanced use that requires a large model and long training times. If you are running this notebook in Colab, please make sure you are running on a V100 instance minimum. For a gentler introduction to Object Detection in the Masterful CLI, see Detecting Digits, which is an MNIST-like introduction to Object Detection.

Define the Application

In this guide, your challenge is to build a CV model that can take an image taken from a car (aka Street Level Imagery or SLI) and identify which parts of that image contain pedestrians. Training such a model could become the heart of an Advanced Driver Assistance System: using cameras and deep learning, a model like this could alert a driver if their car is about to hit a pedestrian, or could use pedestrian information to predict a future path, or even use pedestrian information to identify potential passengers to pickup.

Since you want to identify objects in the image, and their location, the CV task for this problem is object detection.

TL;DR

Don’t want to read the rest of this guide, and want to start training immediately? The following command shows you how to start training with Masterful, using a configuration file and dataset on S3.

masterful-train --config https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/caltech_pedestrians.yaml

Prepare the Data

For this guide, you will use the Caltech Pedestrians dataset, which is an early Street Level Imagery dataset which contains richly annotated video, recorded from a moving vehicle, with challenging images of low resolution and frequently occluded people. In this section, you will take the raw dataset and convert it into a format that Masterful can understand. Along the way, you will visualize a few examples of the data to see both the input to the model, as well as the predictions after training the model.

The original dataset can be downloaded from here. Note this is the USA only version of this dataset, and includes SLI from US based cities only.

The dataset consists of 10 separate video sequences, in tar format. The video files are 640x480 30Hz comprising roughly 10 hours of total video data. By convention, the first 5 sets (set00 - set05) are used for training, and the other 5 sets (set06 - set10) are used for testing. Each set consists of multiple video sequence files (.seq) which can be extracted using an opencv.VideoCapture object.

Along with the video captures, the dataset includes an annotations folder which contains the per-frame annotations for every frame in every video. These annotations contain the bounding box information for each of the pedestrians in the videos.

As is typical with most public datasets, your first challenge is to convert the dataset from the raw dump of data into a form that is understable by an ML system. In this instance, you will convert the dataset into a set of PNG image files per frame, and a Masterful formatted label file for each image.

Converting each dataset is typically a one-off operation that is different for every dataset you want to train with. In the Appendix, you can follow the detailed instructions on how to convert the raw Pedestrians dataset into Masterful format for training that you will use here. For brevity, you can use the already converted dataset located at the public S3 bucket s3://masterful-public/datasets/caltech_pedestrians_usa/. In this bucket, you will see the following files:

caltech_pedestrians_usa\
  label_map.csv
  test.csv
  train_1hz.csv
  data\

For the above training data, the original videos were 30 frames per second. You don’t need to train on all 30 frames per second however, since a lot of video data is redundant. In the above example, the train_1hz.csv file has selected 1 frame per second from the original video to use as the training data. This will give you a good estimate of the models performance while keeping the training time to a reasonable amount. In future guides, you will explore how to improve the performance of this model through the addition of both labeled and unlabeled data, and how to select which data to add to efficiently improve the performance of this model.

One thing to note about the converted dataset above is that you are training on every 30th frame of data, but the test (or holdout) dataset keeps every from of data, and is dramatically larger than the training dataset. Why? Understanding the performance of your model, and trying to predict the generalization performance of your model, is incredibly important. You should always strive to have the largest, most complete set of evaluation data as possible. In this case, you have almost ~70k frames of evaluation data, with only ~2k frames of training data. While this means that evaluating your trained model will take a lot longer, this is only done once at the end and will provide a comprehensive understanding of how your model is likely to perform on real production data. And while 70k frames of data might seem like alot, remember this is only coming from 5 different videos, so the overall amount of diversity in your training data is quite low.

Explore the Data

You should always visually inspect your dataset to get a sense of what the model will see, and to roughly verify that your dataset conversion routine worked properly and you are not training with corrupted or incorrect data. This is especially important for object detection since it is very easy to generate incorrect bounding boxes during dataset conversion.

[1]:
# Install dependencies necessary to run the following
# code.
!pip install opencv-python-headless --quiet
!pip install masterful --quiet

# Import the packages used below.
import matplotlib.pyplot as plt
import os
import requests
import tarfile
import tempfile
import tensorflow as tf
import urllib.request

# Import and activate the Masterful package.
import masterful
masterful = masterful.activate()

# Helper function to visualize bounding boxes from the dataset
# on the source image.
from masterful.data.visualize import create_annotated_image

# Helper function to display a progress when downloading
# a file using HTTP.
from masterful.utils.downloader import progress_bar_factory

# This is necessary for running inside of Colab/Jupyter,
# since the CLI trainer is run outside of the kernel
# (as a script command).
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
  tf.config.experimental.set_memory_growth(gpu, True)
MASTERFUL: Your account has been successfully activated. Masterful v0.5.3 is loaded.
[2]:
DATASET_ROOT = "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/"
TRAINING_CSV_URL = os.path.join(DATASET_ROOT, "train_1hz.csv")

with tempfile.TemporaryDirectory() as temp_directory:
  training_csv_name = os.path.join(temp_directory, "training.csv")
  _ =  urllib.request.urlretrieve(TRAINING_CSV_URL, training_csv_name)

  examples = []
  with open(training_csv_name) as training_csv:
    examples = training_csv.readlines()

f, axarr = plt.subplots(2, 2, figsize=(15,15))
_ = axarr[0, 0].imshow(create_annotated_image(examples[0], DATASET_ROOT))
_ = axarr[0, 1].imshow(create_annotated_image(examples[1], DATASET_ROOT))
_ = axarr[1, 0].imshow(create_annotated_image(examples[2], DATASET_ROOT))
_ = axarr[1, 1].imshow(create_annotated_image(examples[3], DATASET_ROOT))
../_images/notebooks_guide_cli_object_detection_pedestrians_6_0.png

Configure the CLI Trainer

The Masterful CLI Trainer is a command line tool that trains a production quality model with no code required. The Masterful CLI Trainer take a YAML configuration file as input. The configuration file fully specifies everything necessary for training, including the dataset, model, export formats, and evaluation metrics.

Choosing a Model

Masterul provides many different state of the art object detection models that you can choose from. In general, choosing a model can have different constraints than training the model (are you deploying on server or edge? Runtime or latency constraints? Memory constraints?). You generally want to choose a model that is large enough to fit your data, but not so large as to overfit the training data and “memorize” the results, which can lead to poor generalization performance.

The Masterful Object Detection Model Zoo supports the following model architectures.

Model Name

Year

Description

ssd_mobilenet_v1_fpn

2018

Single-Shot Detector with MobileNet V1 Feature Pyramid backbone, from the paper Focal Loss for Dense Object Detection

ssd_mobilenet_v2

2018

Single-Shot Detector with MobileNet V2 backbone, from the paper Focal Loss for Dense Object Detection

ssd_mobilenet_v2_fpnlite

2018

Single-Shot Detector with MobileNet V2 Feature Pyramid Lite backbone, from the paper Focal Loss for Dense Object Detection

ssd_resnet50_v1_fpn

2018

Single-Shot Detector with ResNet-50 V1 Feature Pyramid backbone, from the paper Focal Loss for Dense Object Detection

ssd_resnet101_v1_fpn

2018

Single-Shot Detector with ResNet-101 V1 Feature Pyramid backbone, from the paper Focal Loss for Dense Object Detection

ssd_resnet152_v1_fpn

2018

Single-Shot Detector with ResNet-152 V1 Feature Pyramid backbone, from the paper Focal Loss for Dense Object Detection

retinanet_50

2018

RetinaNet with ResNet-50 V1 Feature Pyramid backbone, from the paper Focal Loss for Dense Object Detection

retinanet_101

2018

RetinaNet with ResNet-101 V1 Feature Pyramid backbone, from the paper Focal Loss for Dense Object Detection

retinanet_152

2018

RetinaNet with ResNet-152 V1 Feature Pyramid backbone, from the paper Focal Loss for Dense Object Detection

ssd_efficientdet_d0

2019

EfficientDet-D0 model from the paper EfficientDet: Scalable and Efficient Object Detection

ssd_efficientdet_d1

2019

EfficientDet-D1 model from the paper EfficientDet: Scalable and Efficient Object Detection

ssd_efficientdet_d2

2019

EfficientDet-D2 model from the paper EfficientDet: Scalable and Efficient Object Detection

ssd_efficientdet_d3

2019

EfficientDet-D3 model from the paper EfficientDet: Scalable and Efficient Object Detection

ssd_efficientdet_d4

2019

EfficientDet-D4 model from the paper EfficientDet: Scalable and Efficient Object Detection

ssd_efficientdet_d5

2019

EfficientDet-D5 model from the paper EfficientDet: Scalable and Efficient Object Detection

ssd_efficientdet_d6

2019

EfficientDet-D6 model from the paper EfficientDet: Scalable and Efficient Object Detection

ssd_efficientdet_d7

2019

EfficientDet-D7 model from the paper EfficientDet: Scalable and Efficient Object Detection

faster_rcnn_resnet50_v1

2016

Faster R-CNN model with a ResNet-50 V1 backbone, from the paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

faster_rcnn_resnet101_v1

2016

Faster R-CNN model with a ResNet-101 V1 backbone, from the paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

faster_rcnn_resnet152_v1

2016

Faster R-CNN model with a ResNet-152 V1 backbone, from the paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

faster_rcnn_inception_resnet_v2

2016

Faster R-CNN model with an Inception ResNet V2 backbone, from the paper Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

centernet_hourglass_104

2019

CenterNet model with an Hourglass-104 backbone, from the paper Objects as Points

centernet_resnet50_v1_fpn

2019

CenterNet model with a ResNet-50 V1 backbone, from the paper Objects as Points

centernet_resnet101_v1_fpn

2019

CenterNet model with a ResNet-101 V1 backbone, from the paper Objects as Points

centernet_resnet50_v2

2019

CenterNet model with a ResNet-50 V2 backbone, from the paper Objects as Points

centernet_mobilenet_v2_fpn

2019

CenterNet model with a MobileNet V2 backbone, from the paper Objects as Points

In the example below, you will select a Faster R-CNN model with a ResNet-50 backbone. This model has a good trade-off between training time and quality, and is a pretty good place to start model exploring.

The Configuration File

The following section is a condensed YAML configuration file for training against this dataset. It points to the dataset created above, and selects a faster_rcnn_resnet50_v1 model, which is a Faster R-CNN model with a ResNet-50 backbone.

[3]:
MODEL_OUTPUT_DIRECTORY = tempfile.TemporaryDirectory().name
TRAINING_CONFIG = f"""
dataset:
  root_path: https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/
  splits: [train_1hz, test]
  label_map: label_map
  optimize: True

model:
  architecture: faster_rcnn_resnet50_v1
  num_classes: 2
  input_shape: [640,640,3]

training:
  task: detection
  training_split: train_1hz

output:
  formats: [saved_model]
  path: {MODEL_OUTPUT_DIRECTORY}

evaluation:
  split: test
"""

Train the Model

Typically the YAML configuration file will exist on S3 or on your local disk. In this instance, you defined it in a local variable above, so you need to write the string into a local configuration file so that the Masterful CLI Trainer can see it.

Below, we will setup the CLI and configuration file to run in this notebook. However, if you want to train this model outside of the notebook, against the dataset on S3, you can run:

masterful-train --config https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/caltech_pedestrians.yaml
[4]:
# Write the configuration file locally to disk so
# that the CLI trainer can see it.
temp_directory = tempfile.TemporaryDirectory()
yaml_path = os.path.join(temp_directory.name, "training.yaml")
with open(yaml_path, "w") as yaml:
  yaml.write(TRAINING_CONFIG)
tf.keras.backend.clear_session()

Now that you have a configuration file, you can use the Masterful CLI Trainer to train the model.

NOTE: Object detection models are quite large. Depending on your GPU, this could take several hours to complete, so plan accordingly.

[5]:
# Use the Masterful CLI to train the model
!masterful-train --config $yaml_path
MASTERFUL: Your account has been successfully activated. Masterful v0.5.2 is loaded.
MASTERFUL [22:37:19]: Training with configuration '/tmp/tmpuxs7vele/training.yaml':
----------  ------------------------------------------------------------------------------------------------
dataset     root_path  https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/
            splits     ['train_1hz', 'test']
            label_map  label_map
            optimize   True
model       architecture  faster_rcnn_resnet50_v1
            num_classes   2
            input_shape   [640, 640, 3]
training    task            detection
            training_split  train_1hz
output      formats  ['saved_model']
            path     /tmp/tmplrymi9nr
evaluation  split  test
----------  ------------------------------------------------------------------------------------------------
MASTERFUL [22:37:22]: Building model 'faster_rcnn_resnet50_v1'...
MASTERFUL [22:37:26]: Using model faster_rcnn_resnet50_v1 with:
MASTERFUL [22:37:26]:     28331535 total parameters
MASTERFUL [22:37:26]:     28278415 trainable parameters
MASTERFUL [22:37:26]:     53120 untrainable parameters
MASTERFUL [22:37:38]: Dataset Summary:
MASTERFUL [22:37:38]:   Training Dataset: 4272 examples.
MASTERFUL [22:37:38]:   Validation Dataset: 0 examples.
MASTERFUL [22:37:38]:   Test Dataset: 120946 examples.
MASTERFUL [22:37:38]:   Unlabeled Dataset: 0 examples.
MASTERFUL [22:37:38]: Training Dataset Analysis:
100%|███████████████████████████████████████| 4272/4272 [01:25<00:00, 50.19it/s]

MASTERFUL [22:39:03]: Training dataset analysis finished at 22:39:03 in 85 seconds (1m 25s), returned:
-----------------------------  ----------------------------------------
Total Examples                 4272
Label Counts                   people  1181
                               person  5275
Label Distribution             people  0.182931
                               person  0.817069
Balanced                       No
Per Channel Mean               [106.19673805 109.58412932 107.36567739]
Per Channel StdDev             [66.35775249 67.29427702 69.57032057]
Min Height                     480
Min Width                      640
Max Height                     480
Max Width                      640
Average Height                 480
Average Width                  640
Largest Image                  (480, 640, 3)
Smallest Image                 (480, 640, 3)
Duplicates                     0
Average Object Area            3222
Smallest Object Area           0
Largest Object Area            88160
Num Small Objects              2853
Num Medium Objects             3146
Num Large Objects              457
Num Empty Images (No Labels):  2020
Total Objects                  6456
-----------------------------  ----------------------------------------
MASTERFUL [22:39:03]: Test Dataset Analysis:
100%|███████████████████████████████████| 120946/120946 [39:45<00:00, 50.70it/s]

MASTERFUL [23:18:50]: Test dataset analysis finished at 23:18:50 in 2387 seconds (39m 47s), returned:
-----------------------------  ----------------------------------------
Total Examples                 120946
Label Counts                   people   14483
                               person  139953
Label Distribution             people  0.0937799
                               person  0.90622
Balanced                       No
Per Channel Mean               [106.17528664 109.16589515 106.32206454]
Per Channel StdDev             [65.08241765 65.83417949 68.13062451]
Min Height                     480
Min Width                      640
Max Height                     480
Max Width                      640
Average Height                 480
Average Width                  640
Largest Image                  (480, 640, 3)
Smallest Image                 (480, 640, 3)
Duplicates                     130
Average Object Area            1511
Smallest Object Area           0
Largest Object Area            65675
Num Small Objects              90849
Num Medium Objects             60518
Num Large Objects              3069
Num Empty Images (No Labels):  55947
Total Objects                  154436
-----------------------------  ----------------------------------------
MASTERFUL [23:18:50]: WARNING: Duplicates detected in dataset split 'test'.
MASTERFUL [23:18:50]: WARNING: You can find the duplicate entries using the tool:
MASTERFUL [23:18:50]: WARNING: python -m masterful.data.duplicate_detector --config=/tmp/tmpuxs7vele/training.yaml
MASTERFUL [23:18:50]: Cross-Dataset Analysis:
MASTERFUL [23:18:51]: Cross-Dataset analysis finished at 23:18:51 in 0 seconds (0s), returned:
---------  --------------
train_1hz  train_1hz  0
           test       0
test       train_1hz    0
           test       130
---------  --------------
MASTERFUL [23:18:51]: Meta-Learning architecture parameters...
MASTERFUL [23:18:51]: Architecture learner finished at 23:18:51 in 0 seconds (0s), returned:
------------------------------  --------------------------------------------------
task                            Task.DETECTION
num_classes                     2
ensemble_multiplier             1
custom_objects
model_config                    faster_rcnn {
                                  num_classes: 2
                                  image_resizer {
                                    fixed_shape_resizer {
                                      height: 640
                                      width: 640
                                    }
                                  }
                                  feature_extractor {
                                    type: "faster_rcnn_resnet50_keras"
                                    batch_norm_trainable: true
                                  }
                                  first_stage_anchor_generator {
                                    grid_anchor_generator {
                                      height_stride: 16
                                      width_stride: 16
                                      scales: 0.25
                                      scales: 0.5
                                      scales: 1.0
                                      scales: 2.0
                                      aspect_ratios: 0.5
                                      aspect_ratios: 1.0
                                      aspect_ratios: 2.0
                                    }
                                  }
                                  first_stage_box_predictor_conv_hyperparams {
                                    op: CONV
                                    regularizer {
                                      l2_regularizer {
                                        weight: 0.0
                                      }
                                    }
                                    initializer {
                                      truncated_normal_initializer {
                                        stddev: 0.009999999776482582
                                      }
                                    }
                                  }
                                  first_stage_nms_score_threshold: 0.0
                                  first_stage_nms_iou_threshold: 0.699999988079071
                                  first_stage_max_proposals: 300
                                  first_stage_localization_loss_weight: 2.0
                                  first_stage_objectness_loss_weight: 1.0
                                  initial_crop_size: 14
                                  maxpool_kernel_size: 2
                                  maxpool_stride: 2
                                  second_stage_box_predictor {
                                    mask_rcnn_box_predictor {
                                      fc_hyperparams {
                                        op: FC
                                        regularizer {
                                          l2_regularizer {
                                            weight: 0.0
                                          }
                                        }
                                        initializer {
                                          variance_scaling_initializer {
                                            factor: 1.0
                                            uniform: true
                                            mode: FAN_AVG
                                          }
                                        }
                                      }
                                      use_dropout: false
                                      dropout_keep_probability: 1.0
                                      share_box_across_classes: true
                                    }
                                  }
                                  second_stage_post_processing {
                                    batch_non_max_suppression {
                                      score_threshold: 0.0
                                      iou_threshold: 0.6000000238418579
                                      max_detections_per_class: 100
                                      max_total_detections: 300
                                    }
                                    score_converter: SOFTMAX
                                  }
                                  second_stage_localization_loss_weight: 2.0
                                  second_stage_classification_loss_weight: 1.0
                                  use_matmul_crop_and_resize: true
                                  clip_anchors_to_image: true
                                  use_matmul_gather_in_matcher: true
                                  use_static_balanced_label_sampler: true
                                  use_static_shapes: true
                                }
backbone_only                   False
input_shape                     (640, 640, 3)
input_range                     ImageRange.ZERO_255
input_dtype                     <dtype: 'float32'>
input_channels_last             True
prediction_logits               True
prediction_dtype                <dtype: 'float32'>
prediction_structure            TensorStructure.DICT
prediction_shape
total_parameters
total_trainable_parameters
total_non_trainable_parameters
------------------------------  --------------------------------------------------
MASTERFUL [23:18:51]: Meta-learning training dataset parameters...
MASTERFUL [23:18:54]: Training dataset learner finished at 23:18:54 in 3 seconds (3s), returned:
-------------------------  -----------------------------
num_classes                2
task                       Task.DETECTION
image_shape                (640, 640, 3)
image_range                ImageRange.ZERO_255
image_dtype                <dtype: 'float32'>
image_channels_last        True
label_dtype                <dtype: 'float32'>
label_shape                (22, 7)
label_structure            TensorStructure.SINGLE_TENSOR
label_sparse               False
label_bounding_box_format  BoundingBoxFormat.TENSORFLOW
-------------------------  -----------------------------
MASTERFUL [23:18:54]: Meta-learning test dataset parameters...
MASTERFUL [23:18:55]: Test dataset learner finished at 23:18:55 in 1 seconds (1s), returned:
-------------------------  -----------------------------
num_classes                2
task                       Task.DETECTION
image_shape                (640, 640, 3)
image_range                ImageRange.ZERO_255
image_dtype                <dtype: 'float32'>
image_channels_last        True
label_dtype                <dtype: 'float32'>
label_shape                (14, 7)
label_structure            TensorStructure.SINGLE_TENSOR
label_sparse               False
label_bounding_box_format  BoundingBoxFormat.TENSORFLOW
-------------------------  -----------------------------
MASTERFUL [23:18:55]: Meta-Learning optimization parameters...
Callbacks: 100%|███████████████████████████████| 5/5 [03:16<00:00, 39.29s/steps]
MASTERFUL [23:22:12]: Optimization learner finished at 23:22:12 in 198 seconds (3m 18s), returned:
-----------------------  -----------------------------------------------------------------
batch_size               4
drop_remainder           True
epochs                   1000000
learning_rate            0.0004419417236931622
learning_rate_schedule
learning_rate_callback   <keras.callbacks.ReduceLROnPlateau object at 0x7faf94764fd0>
warmup_learning_rate     1e-06
warmup_epochs            5
optimizer                <tensorflow_addons.optimizers.lamb.LAMB object at 0x7faf94716320>
loss
loss_weights
early_stopping_callback  <keras.callbacks.EarlyStopping object at 0x7faf94716dd8>
metrics
readonly_callbacks
-----------------------  -----------------------------------------------------------------
MASTERFUL [23:22:18]: Meta-Learning Regularization Parameters...
MASTERFUL [23:22:54]: Warming up model for analysis.
MASTERFUL [23:23:02]:   Warming up batch norm statistics (this could take a few minutes).
MASTERFUL [23:35:35]:   Warming up training for 4805 steps.
100%|████████████████████████████████████| 4805/4805 [17:28<00:00,  4.58steps/s]
MASTERFUL [23:53:04]:   Validating batch norm statistics after warmup for stability (this could take a few minutes).
MASTERFUL [00:05:39]: Analyzing baseline model performance. Training until validation loss stabilizes...
Baseline Training: 100%|█████████████| 30943/30943 [1:33:26<00:00,  5.52steps/s]
MASTERFUL [01:42:57]: Baseline training complete.
MASTERFUL [01:42:57]: Meta-Learning Basic Data Augmentations...
Node 1/4: 100%|██████████████████████| 21340/21340 [1:03:43<00:00,  5.58steps/s]
Node 2/4: 100%|██████████████████████| 21340/21340 [1:03:37<00:00,  5.59steps/s]
Node 3/4: 100%|██████████████████████| 21340/21340 [1:06:11<00:00,  5.37steps/s]
Node 4/4: 100%|██████████████████████| 21340/21340 [1:05:01<00:00,  5.47steps/s]
MASTERFUL [06:17:04]: Meta-Learning Data Augmentation Clusters...
Distance Analysis: 100%|███████████████████| 143/143 [29:49<00:00, 12.52s/steps]
Node 1/10: 100%|█████████████████████| 21340/21340 [1:16:14<00:00,  4.66steps/s]
Node 2/10: 100%|█████████████████████| 21340/21340 [1:16:21<00:00,  4.66steps/s]
Node 3/10: 100%|█████████████████████| 21340/21340 [1:16:20<00:00,  4.66steps/s]
Node 4/10: 100%|█████████████████████| 21340/21340 [1:16:30<00:00,  4.65steps/s]
Node 5/10: 100%|█████████████████████| 21340/21340 [1:16:04<00:00,  4.68steps/s]
Distance Analysis: 100%|█████████████████████| 66/66 [13:44<00:00, 12.49s/steps]
Node 6/10: 100%|█████████████████████| 21340/21340 [1:21:00<00:00,  4.39steps/s]
Node 7/10: 100%|█████████████████████| 21340/21340 [1:20:40<00:00,  4.41steps/s]
Node 8/10: 100%|█████████████████████| 21340/21340 [1:20:52<00:00,  4.40steps/s]
Node 9/10: 100%|█████████████████████| 21340/21340 [1:21:00<00:00,  4.39steps/s]
Node 10/10: 100%|████████████████████| 21340/21340 [1:20:51<00:00,  4.40steps/s]
MASTERFUL [20:52:44]: Meta-Learning Label Based Regularization...
Node 1/2: 100%|██████████████████████| 21340/21340 [1:16:44<00:00,  4.64steps/s]
Node 2/2: 100%|██████████████████████| 21340/21340 [1:16:42<00:00,  4.64steps/s]
MASTERFUL [23:35:09]: Meta-Learning Weight Based Regularization...
MASTERFUL [23:35:10]: Analysis finished in 1452.4810159842173 minutes.
MASTERFUL [23:35:10]: Learned parameters court-excited-approval saved at /home/sam/.masterful/policies/court-excited-approval.
MASTERFUL [23:35:10]: Regularization learner finished at 23:35:10 in 87178 seconds (1d 0h 12m 58s), returned:
-------------------------  -----------------------------------------------
shuffle_buffer_size        3845
mirror                     0.0
rot90                      0.0
rotate                     0
mixup                      0.0
cutmix                     0.0
label_smoothing            0
hsv_cluster                1
hsv_cluster_to_index       [[ 1  3  5  6  8 11]
                            [ 2  3  5  6  8 11]
                            [ 2  2  2  5  7 11]
                            [ 3  3  5  6  9 11]
                            [ 2  4  6 10 11 11]]
hsv_magnitude_table        [[ 10   0  20  60 100  30  70  40  90  50  80]
                            [ 70  30  60  50  80  90  10 100   0  20  40]
                            [ 80 100  50  20   0  60  30  40  10  70  90]
                            [ 90  30  50  40  80  10   0 100  20  70  60]
                            [  0  70  80  30  60  20  10  50 100  90  40]]
contrast_cluster           1
contrast_cluster_to_index  [[ 1  1  5  8  8 11]
                            [ 2  4  4  6  9 11]
                            [ 2  5  8  8  9 11]
                            [ 1  4  5  6 10 11]
                            [ 3  6  7  9 11 11]
                            [ 1  3  4  8  9 11]]
contrast_magnitude_table   [[100  70  90   0  40  20  10  80  30  60  50]
                            [100  70  20  90  50  40   0  60  30  80  10]
                            [ 90  80  70  30   0  60  10  40 100  50  20]
                            [ 60  70  80  30  40   0  50 100  90  10  20]
                            [ 20  40  60   0 100  80  50  70  10  30  90]
                            [  0  20  70 100  90  60  30  10  50  80  40]]
blur_cluster               1
blur_cluster_to_index      [[ 2  3  8  9 10 11]
                            [ 2  7  8  8 10 11]]
blur_magnitude_table       [[  0 100  30  90  50  60  70  20  40  10  80]
                            [ 70  80 100  40  20  60  30  90  10  50   0]]
spatial_cluster            0
spatial_cluster_to_index   [[ 1  4  6  6 10 11]
                            [ 1  3  5  7  8 11]
                            [ 2  4  4  8  9 11]
                            [ 4  6  7  7  8 11]
                            [ 1  1  2  6  9 11]
                            [ 2  4  9 10 11 11]]
spatial_magnitude_table    [[ 30  20  70  50 100  10  60  80  90   0  40]
                            [ 90   0  10  30  60  70  80  50  40 100  20]
                            [ 10 100  40  50  20   0  60  90  30  80  70]
                            [ 70 100  90  50  30  40  80  10  60  20   0]
                            [100  40  70  60  50   0  90  10  30  20  80]
                            [ 90  20  50   0  10  60  80  30  70  40 100]]
synthetic_proportion       [0.0]
-------------------------  -----------------------------------------------
MASTERFUL [23:35:10]: Learning SSL parameters...
MASTERFUL [23:35:11]: SSL learner finished at 23:35:11 in 1 seconds (1s), returned:
----------  --
algorithms  []
----------  --
MASTERFUL [23:35:12]: Training model with semi-supervised learning disabled.
MASTERFUL [23:35:16]: Performing basic dataset analysis.
MASTERFUL [23:35:41]: Masterful will use 427 labeled examples as a validation set since no validation data was provided.
MASTERFUL [23:35:41]: Training model with:
MASTERFUL [23:35:41]:   3845 labeled examples.
MASTERFUL [23:35:41]:   427 validation examples.
MASTERFUL [23:35:41]:   0 synthetic examples.
MASTERFUL [23:35:41]:   0 unlabeled examples.
MASTERFUL [23:35:43]: Training model with learned parameters court-excited-approval in two phases.
MASTERFUL [23:35:43]: The first phase is supervised training with the learned parameters.
MASTERFUL [23:35:43]: The second phase is semi-supervised training to boost performance.
MASTERFUL [23:35:50]: Warming up model for supervised training.
MASTERFUL [23:35:57]:   Warming up batch norm statistics (this could take a few minutes).
MASTERFUL [23:49:12]:   Warming up training for 4805 steps.
100%|████████████████████████████████████| 4805/4805 [25:18<00:00,  3.17steps/s]
MASTERFUL [00:14:31]:   Validating batch norm statistics after warmup for stability (this could take a few minutes).
MASTERFUL [00:27:38]: Starting Phase 1: Supervised training until the validation loss stabilizes...
Supervised Training: 100%|███████████| 37345/37345 [2:48:30<00:00,  3.69steps/s]
MASTERFUL [03:21:43]: Semi-Supervised training disabled in parameters.
MASTERFUL [03:21:45]: Training complete in 226.01122649510702 minutes.
MASTERFUL [03:22:31]: ************************************
MASTERFUL [03:22:31]: Evaluating model on 120946 examples from the 'test' dataset split:
Evaluating: 100%|███████████████████████| 120946/120946 [58:21<00:00, 34.54it/s]
MASTERFUL [04:20:53]:   Loss: 0.0311
MASTERFUL [04:20:53]:   Loss/Rpnloss/Localization Loss: 0.0000
MASTERFUL [04:20:53]:   Loss/Rpnloss/Objectness Loss: 0.0001
MASTERFUL [04:20:53]:   Loss/Boxclassifierloss/Localization Loss: 0.0099
MASTERFUL [04:20:53]:   Loss/Boxclassifierloss/Classification Loss: 0.0211
MASTERFUL [04:20:53]:   Loss/Regularization Loss: 0.0000
MASTERFUL [04:20:53]:   Loss/Total Loss: 0.0311
MASTERFUL [04:20:53]: Calculating MSCOCO evaluation metrics:
MSCOCO Metrics: 100%|█████████████████| 120946/120946 [4:10:49<00:00,  8.04it/s]
MASTERFUL [08:33:39]:   DetectionBoxes_Precision/mAP: 0.0840
MASTERFUL [08:33:39]:   DetectionBoxes_Precision/mAP@.50IOU: 0.1916
MASTERFUL [08:33:39]:   DetectionBoxes_Precision/mAP@.75IOU: 0.0617
MASTERFUL [08:33:39]:   DetectionBoxes_Precision/mAP (small): 0.0328
MASTERFUL [08:33:39]:   DetectionBoxes_Precision/mAP (medium): 0.1510
MASTERFUL [08:33:39]:   DetectionBoxes_Precision/mAP (large): 0.2792
MASTERFUL [08:33:39]:   DetectionBoxes_Recall/AR@1: 0.0823
MASTERFUL [08:33:39]:   DetectionBoxes_Recall/AR@10: 0.1644
MASTERFUL [08:33:39]:   DetectionBoxes_Recall/AR@100: 0.1662
MASTERFUL [08:33:39]:   DetectionBoxes_Recall/AR@100 (small): 0.1098
MASTERFUL [08:33:39]:   DetectionBoxes_Recall/AR@100 (medium): 0.2380
MASTERFUL [08:33:39]:   DetectionBoxes_Recall/AR@100 (large): 0.4572
MASTERFUL [08:33:39]:   DetectionBoxes_PerformanceByCategory/mAP/person: 0.1607
MASTERFUL [08:33:39]:   DetectionBoxes_PerformanceByCategory/mAP/people: 0.0073
Confusion Matrix: 100%|███████████████| 120946/120946 [3:37:59<00:00,  9.25it/s]
MASTERFUL [12:11:38]:   Confusion Matrix:
MASTERFUL [12:11:38]:               |     person|     people| background|
MASTERFUL [12:11:38]:         person|    52001.0|     1451.0|    86501.0|
MASTERFUL [12:11:38]:         people|     3480.0|      854.0|    10149.0|
MASTERFUL [12:11:38]:     background|     7823.0|     1468.0|    53028.0|
MASTERFUL [12:11:38]:     Confusion matrix columns represent the prediction labels and the rows represent the real labels.
MASTERFUL [12:11:39]: Saving model output to /tmp/tmplrymi9nr/session-00000.
MASTERFUL [12:11:46]:     Saving saved_model output to /tmp/tmplrymi9nr/session-00000/saved_model
MASTERFUL [12:12:14]: Saving evaluation metrics to /tmp/tmplrymi9nr/session-00000/evaluation_metrics.csv.
MASTERFUL [12:12:14]: Saving regularization params to /tmp/tmplrymi9nr/session-00000/regularization.params.
MASTERFUL [12:12:14]: Saving confusion matrix to /tmp/tmplrymi9nr/session-00000/confusion_matrix.csv.
MASTERFUL [12:12:14]: Total elapsed training time: 2255 minutes (1d 13h 34m 54s).
MASTERFUL [12:12:14]: Launch masterful-gui to visualize the training results: policy name 'court-excited-approval'

Analyze the Results

At the end of training, Masterful will evaluate your model based on the test (sometimes called holdout) dataset you specified in the evaluation section of your configuration file. The evaluation results are different for each task, since every computer vision task has both general metrics (such as loss) as well as task-specific metrics (IoU for semantic segmentation for example). These results are printed to the console, as well as saved into a CSV file in the output directory specified in your configuration file.

Object Detection Evaluation Metrics

For Object Detection, Masterful reports three main categories of metrics: Model Loss, COCO Metrics, and Confusion matrix.

Model Loss

The first set of metrics Masterful reports is the overall loss of the model. For object detection models, Masterful reports the total loss as well as any sub-losses that are specific to the model architecture. For example, the loss is typically composed of a localization loss (which reflects the location and size of the bounding box) and a classification loss (which reflects which object is in the box). Different models have different sub-losses, and you can gauge how your model is doing on the different tasks based on the final results of these sub-metrics.

The following is the loss results from the above training run (this might be different if you rerun the above cells):

MASTERFUL [13:34:37]: ************************************
MASTERFUL [13:34:38]: Evaluating model on 64999 examples from the 'test' dataset split:
64999/64999 [==============================] - 3277s 50ms/step - loss: 0.3804 - Loss/RPNLoss/localization_loss: 0.1288 - Loss/RPNLoss/objectness_loss: 0.0497 - Loss/BoxClassifierLoss/localization_loss: 0.0720 - Loss/BoxClassifierLoss/classification_loss: 0.1299 - Loss/regularization_loss: 0.0000e+00 - Loss/total_loss: 0.3804
MASTERFUL [14:29:15]:   Loss: 0.0647
MASTERFUL [14:29:15]:   Loss/Rpnloss/Localization Loss: 0.0000
MASTERFUL [14:29:15]:   Loss/Rpnloss/Objectness Loss: 0.0005
MASTERFUL [14:29:15]:   Loss/Boxclassifierloss/Localization Loss: 0.0144
MASTERFUL [14:29:15]:   Loss/Boxclassifierloss/Classification Loss: 0.0498
MASTERFUL [14:29:15]:   Loss/Regularization Loss: 0.0000
MASTERFUL [14:29:15]:   Loss/Total Loss: 0.0647

As you can see in the above output, the total loss of the model is MASTERFUL [14:29:15]:   Loss: 0.0647 (which matches the Loss/Total Loss output as well). You can check the localization and classification loss as well, and you can see that the classification loss 0.0498 is slightly worse than the localization loss 0.0144. These seem lke pretty good loss values. They are relatively low and close to zero, which is the goal of the optimizer in machine learning training. However, it’s really difficult to understand intuitively what these mean in terms of your models overall performance. Are these good values? Or bad ones? Let’s dive into the rest of the metrics to answer these questions.

COCO Metrics

Common Objects in Context (COCO) is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:

  • Object segmentation

  • Recognition in context

  • Superpixel stuff segmentation

  • 330K images (>200K labeled)

  • 1.5 million object instances

  • 80 object categories

  • 91 stuff categories

  • 5 captions per image

  • 250,000 people with keypoints

The dataset itself is used in many research papers as a standard object detection benchmark. Importantly, COCO also defines a standard set of metrics for evaluating object detection models, commonly called the COCO (MS-COCO) Evaluation Metrics.

The following 12 metrics are used for characterizing the performance of an object detector as part of the COCO detection challenge:

MSCOCO Metrics

  • Unless otherwise specified, AP and AR are averaged over multiple Intersection over Union (IoU) values. Specifically we use 10 IoU thresholds of .50:.05:.95. This is a break from tradition, where AP is computed at a single IoU of .50 (which corresponds to our metric APIoU=.50). Averaging over IoUs rewards detectors with better localization.

  • AP is averaged over all categories. Traditionally, this is called “mean average precision” (mAP). We make no distinction between AP and mAP (and likewise AR and mAR) and assume the difference is clear from context.

  • AP (averaged across all 10 IoU thresholds and all 80 categories) will determine the challenge winner. This should be considered the single most important metric when considering performance on COCO. In COCO, there are more small objects than large objects. Specifically: approximately 41% of objects are small (area < 322), 34% are medium (322 < area < 962), and 24% are large (area > 962). Area is measured as the number of pixels in the segmentation mask.

  • AR is the maximum recall given a fixed number of detections per image, averaged over categories and IoUs. AR is related to the metric of the same name used in proposal evaluation but is computed on a per-category basis.

  • All metrics are computed allowing for at most 100 top-scoring detections per image (across all categories).

  • The evaluation metrics for detection with bounding boxes and segmentation masks are identical in all respects except for the IoU computation (which is performed over boxes or masks, respectively).

For more information about things like Average Precision and Recall and IoU in the context of Object Detection, there are several good walkthroughs below with worked examples. They will give you a better intuitive sense of what the metrics are trying to convey:

Below, you will find the COCO metrics for the model you trained above:

MASTERFUL [12:01:29]: Calculating MSCOCO evaluation metrics:
MSCOCO Metrics: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 64999/64999 [2:13:00<00:00,  8.15it/s]
MASTERFUL [14:15:46]:   DetectionBoxes_Precision/mAP: 0.0823
MASTERFUL [14:15:46]:   DetectionBoxes_Precision/mAP@.50IOU: 0.1887
MASTERFUL [14:15:46]:   DetectionBoxes_Precision/mAP@.75IOU: 0.0580
MASTERFUL [14:15:46]:   DetectionBoxes_Precision/mAP (small): 0.0311
MASTERFUL [14:15:46]:   DetectionBoxes_Precision/mAP (medium): 0.1448
MASTERFUL [14:15:47]:   DetectionBoxes_Precision/mAP (large): 0.3358
MASTERFUL [14:15:47]:   DetectionBoxes_Recall/AR@1: 0.0805
MASTERFUL [14:15:47]:   DetectionBoxes_Recall/AR@10: 0.1607
MASTERFUL [14:15:47]:   DetectionBoxes_Recall/AR@100: 0.1636
MASTERFUL [14:15:47]:   DetectionBoxes_Recall/AR@100 (small): 0.1052
MASTERFUL [14:15:47]:   DetectionBoxes_Recall/AR@100 (medium): 0.2385
MASTERFUL [14:15:47]:   DetectionBoxes_Recall/AR@100 (large): 0.5039
MASTERFUL [14:15:47]:   DetectionBoxes_PerformanceByCategory/mAP/person: 0.1560
MASTERFUL [14:15:47]:   DetectionBoxes_PerformanceByCategory/mAP/people: 0.0087

The primary aggregated metric above is DetectionBoxes_Precision/mAP, which is the Average Precision calculated over all of the IoU thresholds. Secondarily, the DetectionBoxes_Precision/mAP@.50IOU corresponds to the Average Precision at just the 0.5 IoU threshold, which is a good intuitive indicator of performance and corresponds to the older Pascal VOC object detection metrics. The models scores relatively low on both of these metrics (0.0823 and 0.1887) respectively. As you can see in the per-category metrics (DetectionBoxes_PerformanceByCategory), the model scores significantly higher on the person class over the people class, which means the model is better at detecting individual people rather than groups of people. This is illuminating a defect in the training dataset. The people class is really just a failure of the labelers to identify individual pedestrians in the image, and is a measure of a human labelers confusion about when to group people together versus when to label them individually. Importantly, this confusion has transferred over to the model. You can see the mAPIoU=0.5 metrics are 2.5x the mAP value, and the per-category metrics demonstrate this is entirely dominated by the performance on the people class alone. So the model performance on people for IoU=0.5 is likely ~0.40, which is a very good score. One way to verify this is to look at a few example predictions on the dataset and visually confirm the model is making good predictions.

Confusion Matrix

The final set of metrics Masterful provides for Object Detection is the confusion matrix of predictions versus labels. The confusion matrix helps figure out where and how the model makes mistakes.

Below is the output for the confusion matrix from the training process:

Confusion Matrix: 100%|████████████████████████████████████████████████████████████| 64999/64999 [2:05:53<00:00,  8.61it/s]
MASTERFUL [18:52:13]:   Confusion Matrix:
MASTERFUL [18:52:13]:            |  person|  people|
MASTERFUL [18:52:13]:      person| 61609.0|  2180.0|
MASTERFUL [18:52:13]:      people|  4578.0|   839.0|
MASTERFUL [18:52:13]:     Confusion matrix columns represent the prediction labels and the rows represent the real labels.

In the above output, the columns represent the predictions made for the model. So the person column represents all of the examples where the model predicted person. For this column, each row represents the label in the dataset. So for the person column, the row person represents all of the examples where the model predicted person and the label in the dataset was also person. These are considered the true positives - the examples where the model made a correct positive prediction. In the person column, the people row represents the examples where the model predicted person, but the true label was actually people. These are considered false positives - examples where the model correctly detected an object but predicted the wrong class. However, because a detection task requires both correctly identifying the presence of an object as well as correctly classify the type of that object, it is possible that the model predicts the existence of an object that does not exist. This is quantified in the background row of the above matrix. For example, In the person column, the background cell represents all of the examples where the model predicted a person, but there was no actual person in that location or image. These are false postives where the model predicted the incorrect existence of a person, rather than correctly predicting the existence but mis-categorizing the object instance. But if the background row contains predictions where there should not have been any, what does the background column represent? The background column represents predictions that were not made. In other words, for the background column and person row, this cell represents the examples where a person existed but the model failed to predict a person at this position. This represents a false negative prediction from the model.

As you can see, the diagonal of the confusion matrix represents all of the correct positive predictions of the model. The off-diagonals represent different mistakes the model made in its predictions.

In the above explanation, you have see the location of the true positives (the diagonals) and the location of the false positives (off-diagonals) and false negatives (background column). One question you might ask, is where are the true negatives - the examples where there were no instances of an object and the model correctly made no predictions? There are two ways to interpret false negatives from the above matrix. The first is as true positives on the background class - the diagonal entry in the background column and row. These are the instances in the dataset with no objects and the model correctly predicted no objects. In this dataset however, every example has an object. You can see this in the test dataset analysis output for the Num Empty Images (No Labels) row:

MASTERFUL [10:13:32]: Test Dataset Analysis:
-----------------------------  ----------------------------------------
Total Examples                 100
Label Counts                   0  70
                               1  57
Label Distribution             0  0.112
                               1  0.0912
.
.
.
Num Empty Images (No Labels):  0
Total Objects                  625

Another way to categorize true negatives is that every true positive on one class is a corresponding true negative on another class - correctly predicting the existence of one class is also correctly predicting the absence of another class, or all other classes. However, neither of these definitions provide much value, and importantly they can overestimate derived metrics. For example, accuracy is commonly defined as the number of “correct” predictions (true positives plus true negatives) over the total number of objects. In object detection however, this will drastically over-estimate the accuracy of the model given the presence of the true negatives term, which when averaged over all of the classes you would most likely end up with more true negatives than object instances in the dataset. This is historically why object detection has relied on precision and recall metrics, rather than accuracy, due to the lack of a true negative term in the definition of the former.

View the Predictions

While the mAP metrics are a good quantitative indicator of the performance of your model, visualizing the predictions can help you get a qualititative sense of how well your trained model is performing. Below you can see the predictions the above trained model has on the examples you visualized above. The red boxes are the labels from the dataset, and the green boxes are the predictions made by the model.

[3]:
# Download the pretrained model so that you can
# visualize the results.
MODEL_URL = "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/caltech_pedestrians_saved_model.tar.gz"

model = None
with tempfile.TemporaryDirectory() as temp_directory:
  # Download the pretrained model from S3.
  saved_model_path = os.path.join(temp_directory, "saved_model.tar.gz")
  _ =  urllib.request.urlretrieve(MODEL_URL, saved_model_path, progress_bar_factory("Downloading Model: "))

  # Extract the model weights from the tar file.
  with tarfile.open(saved_model_path) as tar:
    tar.extractall(temp_directory)
  saved_model_path = os.path.join(temp_directory, "saved_model")

  # Load the trained tensorflow saved model.
  model = tf.saved_model.load(saved_model_path)

  # Plot a few different instances from the dataset.
  f, axarr = plt.subplots(2, 2, figsize=(15,15))
  _ = axarr[0, 0].imshow(create_annotated_image(examples[0], DATASET_ROOT, model))
  _ = axarr[0, 1].imshow(create_annotated_image(examples[1], DATASET_ROOT, model))
  _ = axarr[1, 0].imshow(create_annotated_image(examples[2], DATASET_ROOT, model))
  _ = axarr[1, 1].imshow(create_annotated_image(examples[3], DATASET_ROOT, model))
Downloading Model: 100% (106217119 of 106217119) || Elapsed Time: 0:00:04 Time:  0:00:04
../_images/notebooks_guide_cli_object_detection_pedestrians_23_1.png

As you can see in the above examples, the model does a decent job of finding the pedestrians in the first two images. In the third image, the model misses a highly occluded pedestrian behind the white truck on the left, but this an incredibly challenging instance to detect, given the low resolution of the video and the fact that at most, maybe the head of the pedestrian is visible through the windshield of the truck. Likewise, in the fourth image, the model misses the group of people that are highly occluded behind the cars on the left. Another challenging miss given the people are barely visible. So even though the overall mAP of the model 0.0823 and the mAPIoU=0.5 0.1887 is relatively low, it looks like our first attempt at training the model has done a pretty good job at detecting pedestrians in low resolution video images!

Using the Model for Inference

The Output Formats guide has more information about how to use the models output by Masterful. Object Detection has a few other intricacies that should be addressed though, since the model outputs both a location of an object, a class instance for that object, and a measure of the confidence in the prediction. First, the below code shows how to make a prediction using the model we loaded above on the examples you visualized in the View the Predictions section:

[5]:
# This image corresponds to the first image (upper left)
# in the `View the Predictions` section above.
IMAGE_URL = "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/data/images/set00_V007_1770.png"

# Download the image to use for generating a prediction
# and convert it into a tensor.
http_response = requests.get(IMAGE_URL, stream=True)
image = tf.io.decode_image(http_response.raw.read())

image_height = image.shape[0]
image_width = image.shape[1]
print(f"Image Shape: height={image_height} width={image_width}.")

# The default inference function for tensorflow saved
# models is named `serving_default`.
inference_fn = model.signatures["serving_default"]

# The inference function is a one-arg callable, whose
# input argument is `image` - the image to predict on,
# and which returns a dictionary of outputs. The dictionary
# contains an item whose key is `prediction`, which is the
# predictions of the model.
predictions = inference_fn(image=image)['prediction']

# Convert the predictions to NumPY for easier processing.
predictions = predictions.numpy()[0]

# The predictions from an object detection model
# have the shape [batch size, number of boxes, 6].
# The batch size will always be 1 for inference models,
# since they expect only one image for to predict on.
# The number of boxes predicted depends on the model
# configuration and represents the maximum number of boxes
# predicted for each example. A typical value is 100.
print(f"Predictions shape is {predictions.shape}")
print(f"Maximum number of boxes predicted is {predictions.shape[0]}")

# The final dimension of the prediction represents the boxes,
# classes, and confidences predicted by the model. This dimension
# is of form: [ymin, xmin, ymax, xmax, class id, confidence].
# The box values are normalized against the image size, and are
# in the range [0,1].
#
# The confidence score is a value in the range [0,1] which represents
# the confidence of the prediction, similar to the confidence
# of a classification model represented by the softmax probabities
# of the class predictions. The step to determingin the `true`
# predictions is to threshold the predictions by the confidence
# score. A typical scheme is to use only the predictions whose
# confidence is greater than 0.5 as the threshold, but you can
# experiment with different threshold values for your use case.
threshold = 0.5
true_predictions = predictions[predictions[:, 5] > threshold]
print(f"Found {true_predictions.shape[0]} confident predictions out of {predictions.shape[0]} total predictions.")

# Print the model predictions above the given threshold
label_map = {0: 'person', 1: 'people'}
for i in range(true_predictions.shape[0]):
  print(f"Prediction {i}:")
  print(f"  Class:  {label_map[true_predictions[i][4]]}")
  print(f"  Left:   {int(true_predictions[i][1]*image_width)}")
  print(f"  Top:    {int(true_predictions[i][0]*image_height)}")
  print(f"  Right:  {int(true_predictions[i][3]*image_width)}")
  print(f"  Bottom: {int(true_predictions[i][2]*image_height)}")
Image Shape: height=480 width=640.
Predictions shape is (300, 6)
Maximum number of boxes predicted is 300
Found 4 confident predictions out of 300 total predictions.
Prediction 0:
  Class:  person
  Left:   596
  Top:    148
  Right:  629
  Bottom: 240
Prediction 1:
  Class:  person
  Left:   193
  Top:    159
  Right:  209
  Bottom: 198
Prediction 2:
  Class:  person
  Left:   593
  Top:    155
  Right:  607
  Bottom: 214
Prediction 3:
  Class:  person
  Left:   595
  Top:    153
  Right:  611
  Bottom: 222

You can compare the values for the raw predictions above with the visual results from the first image in the View the Predictions section above, and see that there are two predictions corresponding to the two green boxes annotated above.

Caltech pedestrians is a pretty old dataset by modern standards, capturing images using obsolete cameras. So how well does the model perform on modern SLI imagery? Typically, modern cameras would be “out of distribution” data for this model, and you would expect to see domain shift impacting the performance of the model when evaluated against modern camera captures. Below however you can use the model to detect pedestrians in modern captures and see how well it performs.

[23]:
IMAGE_URLS = [
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/additional_examples/sli_1.png",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/additional_examples/sli_2.png",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/additional_examples/sli_3.png",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/additional_examples/sli_4.png",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/additional_examples/sli_5.png",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/additional_examples/sli_6.png",
]

# Download the images to use for generating a prediction
# and convert them into a tensor.
images = []
for image_url in IMAGE_URLS:
  http_response = requests.get(image_url, stream=True)
  image = tf.io.decode_image(http_response.raw.read())

  # The images are 4-channel PNG, you only need the RGB channels
  # for prediction.
  images.append(image[..., 0:3])

# Use the model to predict on each image
predictions = []
for image in images:
  pred = inference_fn(image=image)['prediction']

  # Convert the predictions to NumPY for easier processing.
  pred = pred.numpy()

  # Threshold the predictions by confidence
  threshold = 0.5
  true_pred = pred[pred[:, 5] > threshold]
  predictions.append(true_pred)

import numpy as np
from PIL import Image, ImageDraw
def annotate_image_with_boxes(image, boxes):
  pil_image = Image.fromarray(image.numpy(), 'RGB')
  drawing = ImageDraw.Draw(pil_image)

  # Convert the normalized coordinates into pixel coordinates
  image_height = image.shape[0]
  image_width = image.shape[1]
  for i in range(boxes.shape[0]):
    left = boxes[i][1] * image_width
    top = boxes[i][0] * image_height
    right = boxes[i][3] * image_width
    bottom = boxes[i][2] * image_height
    drawing.rectangle((left, top, right, bottom), outline=(0, 255, 0), width=4)
  return np.array(pil_image)


f, axarr = plt.subplots(6, 1, figsize=(50,50))
for i in range(len(IMAGE_URLS)):
  _ = axarr[i].imshow(annotate_image_with_boxes(images[i], predictions[i]))

../_images/notebooks_guide_cli_object_detection_pedestrians_29_0.png

As you can see above, the model makes some good predictions, but also fails to identify quite a few pedestrians. For example, the pedestrian carrying a surfboard in the bottom image. The model actually make some good predictions at night, even though the Caltech Pedestrians dataset you trained on has no night time training images in it, which tells us the model, even with the limited training data used, generalizes decently well when using the Masterful platform.

Next Steps

In the above sections, you have learned how to train the model using the Masterful CLI, evaluate the performance of the model, and use the model to make predictions on your production data.

In the next series of guides, you will learn how to improve the performance of this model without labeling any additional data - using the power of Semi-Supervised Learning inside of Masterful. This is a powerful technique to maximize your production models without increasing your labeling budget or waiting for additional data to come back from your labeling service.

Appendix

In this section you will learn how to convert the videos provided by Caltech into the representations we used above. Converting datasets is different for every dataset and use case, so you will not be able to use this code anywhere else. But the steps for each dataset are similar, so you will get a sense for the ovrall process and hopefully this will make it easier to convert your own datasets.

Extracting the Images from the Video

The video files in the Caltech dataset are stored as raw video sequences, which can be opened using the OpenCV library. In this section, you will extract the raw images from each video into separate PNG files.

[2]:
# Install dependencies necessary to run the following
# code.
!pip install progressbar2 --quiet

import cv2
import glob
import os
import tarfile
import tempfile
import urllib.request

# These are mirrors of the original data which is located
# in google drive at:
# https://drive.google.com/drive/folders/1IBlcJP8YsCaT81LwQ2YwQJac8bf1q8xF
SET_URLS = [
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set00.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set01.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set02.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set03.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set04.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set05.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set06.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set07.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set08.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set09.tar",
  "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/set10.tar",
]
SETS = { set_url: f"set{i:02}" for i, set_url in enumerate(SET_URLS) }

# Download the tar files to a local directory so that
# you can extract and process them.
data_directory = tempfile.TemporaryDirectory()

# The video sequence data will live under
# <data_directory>/video_sequences
sequence_directory = os.path.join(data_directory.name, "video_sequences")
os.makedirs(sequence_directory, exist_ok=True)

for url in SETS:
  tar_file = os.path.join(data_directory.name, f"{SETS[url]}.tar")
  _ = urllib.request.urlretrieve(url, tar_file, progress_bar_factory(f"Downloading {SETS[url]}: "))

  # Extract the downloaded tar file
  with tarfile.open(tar_file) as tar:
    tar.extractall(sequence_directory)

# Write all of the images to an 'images' directory.
images_directory = os.path.join(data_directory.name, "images")
os.makedirs(images_directory, exist_ok=True)

# Use OpenCV to save each image as a PNG.
def save_image(set, sequence, sequence_idx, sequence_frame):
  image_name = '{}/{}_{}_{}.png'.format(
    images_directory,
    os.path.basename(set),
    os.path.basename(sequence).split('.')[0],
    sequence_idx)
  cv2.imwrite(image_name, sequence_frame)

# Extract all of the images from the video sequences
total_images = 0
for set_name in sorted(glob.glob(os.path.join(sequence_directory, "set*"))):
  for sequence_name in sorted(glob.glob('{}/*.seq'.format(set_name))):
    video_capture = cv2.VideoCapture(sequence_name)
    i = 0
    while True:
        ret, frame = video_capture.read()
        if not ret:
            break
        save_image(set_name, sequence_name, i, frame)
        i += 1
    total_images += i
    print(f"Extracted: {sequence_name}")
print(f"Wrote {total_images} total images.")
set00100% (1154611200 of 1154611200) |###| Elapsed Time: 0:03:18 Time:  0:03:18
set01100% (531722240 of 531722240) |#####| Elapsed Time: 0:01:39 Time:  0:01:39
set02100% (1090877440 of 1090877440) |###| Elapsed Time: 0:03:18 Time:  0:03:18
set03100% (1211873280 of 1211873280) |###| Elapsed Time: 0:03:32 Time:  0:03:32
set04100% (1036124160 of 1036124160) |###| Elapsed Time: 0:03:12 Time:  0:03:12
set05100% (1208064000 of 1208064000) |###| Elapsed Time: 0:03:32 Time:  0:03:32
set06100% (1561139200 of 1561139200) |###| Elapsed Time: 0:04:19 Time:  0:04:19
set07100% (1000857600 of 1000857600) |###| Elapsed Time: 0:03:17 Time:  0:03:17
set08100% (946391040 of 946391040) |#####| Elapsed Time: 0:02:34 Time:  0:02:34
set09100% (1090088960 of 1090088960) |###| Elapsed Time: 0:02:55 Time:  0:02:55
set10100% (1099048960 of 1099048960) |###| Elapsed Time: 0:03:18 Time:  0:03:18
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V011.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V012.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V013.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set00/V014.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set01/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set01/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set01/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set01/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set01/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set01/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set02/V011.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V011.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set03/V012.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set04/V011.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V011.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set05/V012.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V011.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V012.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V013.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V014.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V015.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V016.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V017.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set06/V018.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set07/V011.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set08/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set09/V011.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V000.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V001.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V002.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V003.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V004.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V005.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V006.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V007.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V008.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V009.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V010.seq
Extracted: /tmp/tmpkhsbed83/video_sequences/set10/V011.seq
Wrote 249884 total images.

The Masterful Label Format for Detection

Object detection labels consist of four integer points for the bounding box and a single 0-indexed integer class identifier similar to multi-class classification. The bounding boxes are specified in pixel coordinates relative to the image size, and are specified in [xmin, ymin, xmax, ymax] format, separated by commas. A single label consists of [xmin, ymin, xmax, ymax, class_id], all separated by commas, with multiple instances possible for each image. For example, here is a snippet of an object detection CSV file:

images/image1.jpg,25,156,47,180,0,314,125,328,139,6
images/image2.jpg,104,180,118,194,6
images/image3.jpg,258,161,280,183,0,174,111,196,133,9,28,59,50,81,3

In the above example, the image image1.jpg has two objects in the image, one of class 0 and one of class 6. Class 0 is defined by the bounding box coordinates with the upper left equal to pixel (25,156) and the lower right equal to pixel (47,180), which corresponds to a bounding box of width=22 and height=24. The object of class 6 is at coordinates upper left (314,125) and lower right (328,139), which corresponds to a bounding box of width=14 and height=14. The second image image2.jpg has a single instance of object class 6 in it, and the third image image3.jpg has 3 objects in it, of classes 0, 9, and 3.

More information about Masterful label formats can be found here.

Convert the Annotations to Masterful Labels

The following code converts the annotations.zip file from the Caltech Pedestrians dataset into a Masterful formatted training and test CSV file, which can be used with the Masterful CLI. The raw annotations files are binary Matlib files, and it takes quite a of internet sleuthingto understand the format and parse them correctly.

[7]:
!pip install scipy --quiet
import os
import glob
from scipy.io import loadmat
from collections import defaultdict

ANNOTATIONS_URL = "https://masterful-public.s3.us-west-1.amazonaws.com/datasets/caltech_pedestrians_usa/mirror/annotations.zip"

# Download the annotations locally
zip_file = os.path.join(data_directory.name, f"annotations.zip")
_ = urllib.request.urlretrieve(ANNOTATIONS_URL, zip_file, progress_bar_factory("Annotations: "))

import zipfile
with zipfile.ZipFile(zip_file, 'r') as zip_file:
  zip_file.extractall(data_directory.name)
annotations_directory = os.path.join(data_directory.name, "annotations")

# Parse all of the video bounding box (vbb) information
# in every VBB file for every video set.
all_objects = 0
data = defaultdict(dict)

# The annotations directory consists setXX directories
# for each video set.
for directory_name in sorted(glob.glob(os.path.join(annotations_directory, 'set*'))):
  set_name = os.path.basename(directory_name)
  data[set_name] = defaultdict(dict)
  for annotation_file_name in sorted(glob.glob('{}/*.vbb'.format(directory_name))):
    # Each VBB file is a binary MATLIB annotation file.
    vbb = loadmat(annotation_file_name)
    frame_number = int(vbb['A'][0][0][0][0][0])
    object_lists = vbb['A'][0][0][1][0]
    max_objects = int(vbb['A'][0][0][2][0][0])
    object_starting_point = vbb['A'][0][0][3][0]
    object_label = [str(v[0]) for v in vbb['A'][0][0][4][0]]
    object_str = vbb['A'][0][0][5][0]
    object_end_point = vbb['A'][0][0][6][0]
    object_hidden = vbb['A'][0][0][7][0]
    altered = int(vbb['A'][0][0][8][0][0])
    log = vbb['A'][0][0][9][0]
    log_length = int(vbb['A'][0][0][10][0][0])

    video_name = os.path.splitext(os.path.basename(annotation_file_name))[0]
    data[set_name][video_name]['frame_number'] = frame_number
    data[set_name][video_name]['max_objects'] = max_objects
    data[set_name][video_name]['log'] = log.tolist()
    data[set_name][video_name]['log_length'] = log_length
    data[set_name][video_name]['altered'] = altered
    data[set_name][video_name]['frames'] = defaultdict(list)

    num_objects = 0
    for frame_id, obj in enumerate(object_lists):
      if len(obj) > 0:
        for id, pos, occl, lock, posv in zip(obj['id'][0], obj['pos'][0],
                                             obj['occl'][0], obj['lock'][0],
                                             obj['posv'][0]):
          keys = obj.dtype.names
          id = int(id[0][0]) - 1  # MATLAB is 1-origin
          pos = pos[0].tolist()
          occl = int(occl[0][0])
          lock = int(lock[0][0])
          posv = posv[0].tolist()

          datum = dict(zip(keys, [id, pos, occl, lock, posv]))
          datum['lbl'] = str(object_label[datum['id']])
          datum['str'] = int(object_str[datum['id']])
          datum['end'] = int(object_end_point[datum['id']])
          datum['hide'] = int(object_hidden[datum['id']])
          datum['init'] = int(object_starting_point[datum['id']])
          data[set_name][video_name]['frames'][frame_id].append(datum)
          num_objects += 1
      else:
        # This frame is labeled but has no objects in it.
        data[set_name][video_name]['frames'][frame_id].append(None)

    print(directory_name, annotation_file_name, num_objects)
    all_objects += num_objects

print('Number of objects:', all_objects)

# Treat the three different `person` labels as the
# same class. `person?` and `person-fa` represent
# much harder instances to classify.
label_to_class = {
    "person": 0,
    "person?": 0,
    "person-fa": 0,
    "people": 1,
}

# From the original paper, sets 06-10 are used for testing,
# the rest are for training.
train_sets = {f"set{i:02}" for i in range(6)}
test_sets = {f"set{i:02}" for i in range(6, 11, 1)}

# Select every 30 frames (1 frame per second) in the
# train dataset.
every_n_frames = 30

# Convert the raw parsed data into individual CSV files.
# The images exist in the relative path "images".
root = "images"
train_rows = []
test_rows = []
for set_name in data.keys():
  for video_name in data[set_name].keys():
    for frame_id in data[set_name][video_name]['frames'].keys():
      row = f"{root}/{set_name}_{video_name}_{frame_id}.png"
      for object in data[set_name][video_name]['frames'][frame_id]:
        if object is not None:
          row += f",{object['pos'][0]},{object['pos'][1]},{object['pos'][0]+object['pos'][2]},{object['pos'][1]+object['pos'][3]},{label_to_class[object['lbl']]}"
      row += "\n"

      if set_name in train_sets:
        if (frame_id % every_n_frames) == 0:
          train_rows.append(row)
      elif set_name in test_sets:
        test_rows.append(row)
      else:
        raise ValueError(f"Unknown set name {set_name}")

# Shuffle the training set
import random

random.shuffle(train_rows)

# Write out the files
training_csv_path = os.path.join(data_directory.name, "train_1hz.csv")
with open(training_csv_path, "w") as csv:
  for row in train_rows:
    csv.write(row)
test_csv_path = os.path.join(data_directory.name, "test.csv")
with open(test_csv_path, "w") as csv:
  for row in test_rows:
    csv.write(row)
print(f"Dataset root path: {data_directory.name}")
print(f"Training CSV path: {training_csv_path}")
print(f"Test CSV path: {test_csv_path}")
print(f"Images directory: {images_directory}")
Annotations: 100% (13664674 of 13664674) || Elapsed Time: 0:00:03 Time:  0:00:03
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V000.vbb 4519
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V001.vbb 7303
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V002.vbb 2334
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V003.vbb 642
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V004.vbb 1879
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V005.vbb 97
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V006.vbb 7203
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V007.vbb 10449
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V008.vbb 4978
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V009.vbb 6472
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V010.vbb 7056
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V011.vbb 7132
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V012.vbb 3622
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V013.vbb 7748
/tmp/tmpkhsbed83/annotations/set00 /tmp/tmpkhsbed83/annotations/set00/V014.vbb 9031
/tmp/tmpkhsbed83/annotations/set01 /tmp/tmpkhsbed83/annotations/set01/V000.vbb 4488
/tmp/tmpkhsbed83/annotations/set01 /tmp/tmpkhsbed83/annotations/set01/V001.vbb 7511
/tmp/tmpkhsbed83/annotations/set01 /tmp/tmpkhsbed83/annotations/set01/V002.vbb 9224
/tmp/tmpkhsbed83/annotations/set01 /tmp/tmpkhsbed83/annotations/set01/V003.vbb 5982
/tmp/tmpkhsbed83/annotations/set01 /tmp/tmpkhsbed83/annotations/set01/V004.vbb 4067
/tmp/tmpkhsbed83/annotations/set01 /tmp/tmpkhsbed83/annotations/set01/V005.vbb 5297
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V000.vbb 0
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V001.vbb 405
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V002.vbb 0
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V003.vbb 554
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V004.vbb 0
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V005.vbb 0
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V006.vbb 0
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V007.vbb 465
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V008.vbb 578
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V009.vbb 2319
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V010.vbb 2378
/tmp/tmpkhsbed83/annotations/set02 /tmp/tmpkhsbed83/annotations/set02/V011.vbb 2103
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V000.vbb 0
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V001.vbb 231
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V002.vbb 329
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V003.vbb 2829
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V004.vbb 407
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V005.vbb 2262
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V006.vbb 339
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V007.vbb 179
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V008.vbb 13989
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V009.vbb 7018
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V010.vbb 1921
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V011.vbb 2975
/tmp/tmpkhsbed83/annotations/set03 /tmp/tmpkhsbed83/annotations/set03/V012.vbb 1474
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V000.vbb 1001
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V001.vbb 688
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V002.vbb 2468
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V003.vbb 3044
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V004.vbb 3042
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V005.vbb 1638
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V006.vbb 993
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V007.vbb 2207
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V008.vbb 521
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V009.vbb 0
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V010.vbb 1088
/tmp/tmpkhsbed83/annotations/set04 /tmp/tmpkhsbed83/annotations/set04/V011.vbb 773
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V000.vbb 1909
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V001.vbb 212
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V002.vbb 766
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V003.vbb 1121
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V004.vbb 856
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V005.vbb 1857
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V006.vbb 0
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V007.vbb 668
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V008.vbb 110
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V009.vbb 288
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V010.vbb 1257
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V011.vbb 4638
/tmp/tmpkhsbed83/annotations/set05 /tmp/tmpkhsbed83/annotations/set05/V012.vbb 1251
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V000.vbb 785
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V001.vbb 1790
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V002.vbb 6747
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V003.vbb 253
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V004.vbb 416
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V005.vbb 267
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V006.vbb 561
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V007.vbb 295
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V008.vbb 2765
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V009.vbb 163
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V010.vbb 5097
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V011.vbb 483
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V012.vbb 3140
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V013.vbb 2031
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V014.vbb 329
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V015.vbb 3745
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V016.vbb 777
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V017.vbb 2654
/tmp/tmpkhsbed83/annotations/set06 /tmp/tmpkhsbed83/annotations/set06/V018.vbb 124
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V000.vbb 9259
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V001.vbb 2708
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V002.vbb 4383
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V003.vbb 4933
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V004.vbb 2351
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V005.vbb 5638
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V006.vbb 3853
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V007.vbb 497
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V008.vbb 4641
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V009.vbb 3335
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V010.vbb 5780
/tmp/tmpkhsbed83/annotations/set07 /tmp/tmpkhsbed83/annotations/set07/V011.vbb 7213
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V000.vbb 2380
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V001.vbb 2887
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V002.vbb 1172
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V003.vbb 59
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V004.vbb 350
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V005.vbb 570
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V006.vbb 1695
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V007.vbb 329
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V008.vbb 992
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V009.vbb 2753
/tmp/tmpkhsbed83/annotations/set08 /tmp/tmpkhsbed83/annotations/set08/V010.vbb 245
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V000.vbb 1431
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V001.vbb 2321
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V002.vbb 4736
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V003.vbb 4280
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V004.vbb 1396
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V005.vbb 610
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V006.vbb 2064
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V007.vbb 241
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V008.vbb 485
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V009.vbb 1716
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V010.vbb 2123
/tmp/tmpkhsbed83/annotations/set09 /tmp/tmpkhsbed83/annotations/set09/V011.vbb 273
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V000.vbb 2451
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V001.vbb 3282
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V002.vbb 800
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V003.vbb 325
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V004.vbb 600
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V005.vbb 1078
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V006.vbb 1335
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V007.vbb 321
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V008.vbb 4129
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V009.vbb 5769
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V010.vbb 5970
/tmp/tmpkhsbed83/annotations/set10 /tmp/tmpkhsbed83/annotations/set10/V011.vbb 6255
Number of objects: 346621
Dataset root path: /tmp/tmpkhsbed83
Training CSV path: /tmp/tmpkhsbed83/train_1hz.csv
Test CSV path: /tmp/tmpkhsbed83/test.csv
Images directory: /tmp/tmpkhsbed83/images
[ ]: