Masterful CLI Trainer: Data Directory Format

This guide walks you through the data directory format used by the Masterful command line interface. After reading this guide, you should be able to create, import, and train on your own custom datasets using Masterful.

Overview

In order to train a model, Masterful needs to know the location of your data and any labels associated with that data. Masterful has defined a simple CSV label format. Labels used by Masterful are defined by a single CSV file listing all of the examples contained in that dataset. Each example consists of an image (the path to the image is stored in the CSV file) and a label (binary and multi-class classification), set of labels (multi-label classification), or no label at all (unlabeled data).

Note: The Masterful Dataset Format only supports Binary Classification, Multi-Class Classification, and Multi-Label Classification currently. Support for Semantic Segmentation, Instance Segmentation, and Object Detection are coming soon.

CSV Format

The following is a snippet of the train.csv CSV file located here and used in the Quickstart Guide.

dandelion/19438516548_bbaf350664.jpg, 1
dandelion/5875763050_82f32f2eed_m.jpg, 1
roses/1446090416_f0cad5fde4.jpg, 2
tulips/6770436217_281da51e49_n.jpg, 4
dandelion/645330051_06b192b7e1.jpg, 1
daisy/7749368884_1fc58c67ff_n.jpg, 0
roses/2347579838_dd6d2aaefc_n.jpg, 2

In the above example, each row of the CSV file contains a single example in the dataset named train. The basic format of each row is <relative path to image>, [<label1>, <label2>, ...]. Only the path to the image is required, the labels (or set of labels) are optional since Masterful supports both labeled and unlabeled data.

For each example, the path to the image is a relative path, and that path is relative to the location of the CSV file that references it. For example, the above CSV file is located at https://masterful-public.s3.us-west-1.amazonaws.com/datasets/quickstart/train.csv. The first example is:

dandelion/19438516548_bbaf350664.jpg, 1

This means the full path to the image in the first example is located in the directory dandelion relative to the location of the train.csv file, which is https://masterful-public.s3.us-west-1.amazonaws.com/datasets/. Therefore, the full path to the image in the first example is https://masterful-public.s3.us-west-1.amazonaws.com/datasets/dandelion/19438516548_bbaf350664.jpg.

Note that the folder structure above is entirely arbitrary and has no meaning for creating the dataset. You can put all images into a single directory, or put all images in the same directory as the CSV file, it does not matter.

The labels for each example must be 0-based integer labels, in the range [0, num_classes). The label is optional, but all examples in the CSV file must either contain 1 or more labels, or have no label at all (you cannot mix and match labeled and unlabeled data).

Dataset Splits

Each CSV file defines a single dataset. However, the more accurate term for each CSV file is a dataset split, and the set of all CSV files (splits) defines your dataset. For example, in the dataset described above, the folder structure looks like:

quickstart/
  daisy/
  dandelion/
  roses/
  sunflowers/
  tulips/
  training.yaml
  test.csv
  train.csv
  validation.csv

There are three CSV files located in the folder: train.csv, test.csv, and validation.csv. The names of the CSV files are arbitrary, but for this example, each one defines a dataset that will be used for a different purpose during training. For example, train.csv contains the labeled data that will be used to train the model. This is the data the model will see and use for gradient based back-propagation optimization. validation.csv will be used to evaluate the model during training, and while this data is not seen by the model during optimization, it is used by the meta-learning engine to measure the response of the model and the training loop to determine things like early stopping and overfitting. test.csv is never seen by your model during training, and is only used after the model is trained to evaluate your model and measure the overall generalization performance of your model on unseen data. As mentioned before, the names of the CSV files are arbitrary, they do not define how they data is used. How to use each dataset is specified in the Masterful CLI Trainer Configuration File.

Labels

Each example in the CSV file supports 0 or more labels, in order to support Binary Classification, Multi-Class Classification, and Multi-Label Classification datasets. Examples without labels constitute an unlabeled dataset, which is used as part of the Semi-Supervised Learning packages inside of Masterful.

Binary Classification

In binary classification, each label can either be 0 or 1, and consitutes the absence or presence of a class. The following is an example of a binary classification dataset:

images/image1.jpg, 0
images/image2.jpg, 1
images/image3.jpg, 0

Any value other that 0 or 1 is invalid in a binary classification dataset, and there must be one and only one integer label following the relative image path.

Multi-Class Classification

Multi-Class Classification is the standard computer vision classification task and is used in the presence of 3 or more mutually exclusive classes (2-class multi-class classification in mathematically equivalent to binary classification). If we let N be the number of classes in the dataset, where N >= 3, then for each row in the CSV file, there must be one and only one 0-based integer label in the range [0,N). For example, here is a snippet from a N = 3 class classification dataset:

images/image1.jpg, 0
images/image2.jpg, 1
images/image3.jpg, 2
images/image4.jpg, 0

Multi-Label Classification

Multi-Label Classification is similar to Multi-Class Classification, but the classes are not mutually exclusive and each example can have one or more labels. If we let N be the number of classes in the dataset, where N >= 2, then for each row in the CSV file, there must be 1 or more 0-based integer labels in the range [0,N). For example, here is a snippet from a N = 5 class multi-label classification dataset:

images/image1.jpg, 0
images/image2.jpg, 0, 1
images/image3.jpg, 2, 4
images/image4.jpg, 1, 3, 4

In the example above, you can see that each row can be annotated with 1 or more classes.

Unlabeled Data

The CSV file format also supports unlabeled data, in which case each example only contains the relative image path for each example. Below is a snippet of an unlabeled dataset:

images/image1.jpg
images/image2.jpg
images/image3.jpg
images/image4.jpg

Supported Locations

The Masterful CLI Trainer supports dataset hosted on Google Cloud Storage, AWS S3, and local disks.

Additional Examples

Additional examples can be found at the public AWS S3 bucket s3://masterful-public/datasets/.

>>> aws s3 ls s3://masterful-public/datasets/
>>>                        PRE flowers_nilsback_zisserman/
>>>                        PRE hot_dog/
>>>                        PRE quickstart/
>>>                        PRE svhn_cropped/
>>>                        PRE voc_2012/