# Masterful CLI Trainer: Data Directory Format

This guide walks you through the data directory format used
by the Masterful command line interface. After reading this guide, you should be able to create, import, and train on your own custom datasets using Masterful.

## Overview

In order to train a model, Masterful needs to know the location of your data and any labels associated with that data. Masterful has defined a simple CSV label format. Labels used by Masterful are defined by a single CSV file listing all of the examples contained in that dataset. Each example consists of an image (the path to the image is stored in the CSV file) and a label (binary and multi-class classification), set of labels (multi-label classification), bounding boxes (object detection), or no label at all (unlabeled data).

**Note**: The Masterful Dataset Format only supports Binary Classification, Multi-Class Classification, Multi-Label Classification, and Object Detection currently. Support for Semantic Segmentation and Instance Segmentation are coming soon.

## CSV Format

The following is a snippet of the `train.csv` CSV file located [here](https://masterful-public.s3.us-west-1.amazonaws.com/datasets/quickstart/train.csv) and used in the [Quickstart Guide](../notebooks/tutorial_quickstart_cli). This snippet is for a Multi-Class Classification example.

```
dandelion/19438516548_bbaf350664.jpg, 1
dandelion/5875763050_82f32f2eed_m.jpg, 1
roses/1446090416_f0cad5fde4.jpg, 2
tulips/6770436217_281da51e49_n.jpg, 4
dandelion/645330051_06b192b7e1.jpg, 1
daisy/7749368884_1fc58c67ff_n.jpg, 0
roses/2347579838_dd6d2aaefc_n.jpg, 2
```

In the above example, each row of the CSV file contains a single example in the dataset named `train`. The basic format of each row is `<relative path to image>, [<label1>, <label2>, ...]`. Only the path to the image is required, the labels (or set of labels) are optional since Masterful supports both labeled and unlabeled data.

For each example, the path to the image is a **relative** path, and that path is relative to the location of the CSV file that references it. For example, the above CSV file is located at `https://masterful-public.s3.us-west-1.amazonaws.com/datasets/quickstart/train.csv`. The first example is:  

`dandelion/19438516548_bbaf350664.jpg, 1`  

This means the **full** path to the image in the first example is located in the directory `dandelion` relative to the location of the `train.csv` file, which is `https://masterful-public.s3.us-west-1.amazonaws.com/datasets/`. Therefore, the **full** path to the image in the first example is `https://masterful-public.s3.us-west-1.amazonaws.com/datasets/dandelion/19438516548_bbaf350664.jpg`.

Note that the folder structure above is entirely arbitrary and has no meaning for creating the dataset. You can put all images into a single directory, or put all images in the same directory as the CSV file, it does not matter.

The labels for each example **must** be 0-based integer labels, in the range `[0, num_classes)`. The label is optional, but all examples in the CSV file must either contain 1 or more labels, or have no label at all (you cannot mix and match labeled and unlabeled data).

## Dataset Splits

Each CSV file defines a single dataset. However, the more accurate term for each CSV file is a dataset **split**, and the set of all CSV files (splits) defines your **dataset**. For example, in the dataset described above, the folder structure looks like:  

```
quickstart/
  daisy/
  dandelion/
  roses/
  sunflowers/
  tulips/
  training.yaml
  test.csv
  train.csv
  validation.csv
```

There are three CSV files located in the folder: `train.csv`, `test.csv`, and `validation.csv`. The names of the CSV files are arbitrary, but for this example, each one defines a dataset that will be used for a different purpose during training. For example, `train.csv` contains the labeled data that will be used to train the model. This is the data the model will see and use for gradient based back-propagation optimization. `validation.csv` will be used to evaluate the model during training, and while this data is not seen by the model during optimization, it is used by the meta-learning engine to measure the response of the model and the training loop to determine things like early stopping and overfitting. `test.csv` is never seen by your model during training, and is only used **after** the model is trained to evaluate your model and measure the overall generalization performance of your model on unseen data. As mentioned before, the names of the CSV files are arbitrary, they do **not** define how they data is used. How to use each dataset is specified in the [Masterful CLI Trainer Configuration File](../markdown/guide_cli_yaml_config).

## Labels

Each example in the CSV file supports 0 or more labels, in order to support Binary Classification, Multi-Class Classification, and Multi-Label Classification datasets. Examples without labels constitute an unlabeled dataset, which is used as part of the Semi-Supervised Learning packages inside of Masterful. Object Detection labels consist of the bounding boxes and classes for each object in the image.

### Binary Classification

In binary classification, each label can either be 0 or 1, and consitutes the absence or presence of a class. The following is an example of a binary classification dataset:  

```
images/image1.jpg, 0
images/image2.jpg, 1
images/image3.jpg, 0
```

Any value other that 0 or 1 is invalid in a binary classification dataset, and there must be **one and only one** integer label following the relative image path.

### Multi-Class Classification

Multi-Class Classification is the standard computer vision classification task and is used in the presence of 3 or more mutually exclusive classes (2-class multi-class classification in mathematically equivalent to binary classification). If we let `N` be the number of classes in the dataset, where `N >= 3`, then for each row in the CSV file, there must be **one and only one** 0-based integer label in the range `[0,N)`. For example, here is a snippet from a `N = 3` class classification dataset:  

```
images/image1.jpg, 0
images/image2.jpg, 1
images/image3.jpg, 2
images/image4.jpg, 0
```

### Multi-Label Classification

Multi-Label Classification is similar to Multi-Class Classification, but the classes are not mutually exclusive and each example can have one or more labels. If we let `N` be the number of classes in the dataset, where `N >= 2`, then for each row in the CSV file, there must be **1 or more** 0-based integer labels in the range `[0,N)`. For example, here is a snippet from a `N = 5` class multi-label classification dataset:  

```
images/image1.jpg, 0
images/image2.jpg, 0, 1
images/image3.jpg, 2, 4
images/image4.jpg, 1, 3, 4
```

In the example above, you can see that each row can be annotated with 1 or more classes.

### Object Detection

Object detection labels consist of four points for the bounding box and a single 0-indexed integer class identifier similar to multi-class classification. The bounding boxes are specified in pixel coordinates relative to the image size, and are specified in [xmin, ymin, xmax, ymax] format, separated by commas. A single label consists of [xmin, ymin, xmax, ymax, class_id], all separated by commas, with multiple instances possible for each image. For example, here is a snippet of an object detection CSV file:  

```
images/image1.jpg,25,156,47,180,0,314,125,328,139,6
images/image2.jpg,104,180,118,194,6
images/image3.jpg,258,161,280,183,0,174,111,196,133,9,28,59,50,81,3
``` 

In the above example, the image `image1.jpg` has two objects in the image, one of class `0` and one of class `6`. Class `0` is defined by the bounding box coordinates with the upper left equal to pixel `(25,156)` and the lower right equal to pixel `(47,180)`, which corresponds to a bounding box of `width=22` and `height=24`. The object of class `6` is at coordinates upper left `(314,125)` and lower right `(328,139)`, which corresponds to a bounding box of `width=14` and `height=14`. The second image `image2.jpg` has a single instance of object class `6` in it, and the third image `image3.jpg` has 3 objects in it, of classes `0`, `9`, and `3`.  

### Semantic Segmentation

Semantic segmentation labels consist of pixel-level masks used to identify the class of each pixel. The Masterful Semantic Segmentation CSV format allows the masks to be specified next to the image the mask describes. Here is a snippet of a segmentation CSV file: 

```
images/image1.jpg, masks/mask1.png
images/image2.png, masks/mask2.png
images/image3.jpg, masks/mask3.png
```

Masks must be single channel, 8-bit PNGs of the same resolution as the image they are describing. So a 64x64 input image would require a 64x64 segmentation mask. The mask channel is 0-indexed, so classes referenced in the mask start at 0,1,2, etc.

### Unlabeled Data

The CSV file format also supports unlabeled data, in which case each example only contains the relative image path for each example. Below is a snippet of an unlabeled dataset:  

```
images/image1.jpg
images/image2.jpg
images/image3.jpg
images/image4.jpg
```

## Supported Locations

The Masterful CLI Trainer supports dataset hosted on [Google Cloud Storage](https://cloud.google.com/storage), [AWS S3](https://aws.amazon.com/s3/), and local disks.

## Additional Examples

Additional examples can be found at the public AWS S3 bucket `s3://masterful-public/datasets/`.

```shell
>>> aws s3 ls s3://masterful-public/datasets/
>>>                        PRE flowers_nilsback_zisserman/
>>>                        PRE hot_dog/
>>>                        PRE quickstart/
>>>                        PRE svhn_cropped/
>>>                        PRE voc_2012/
```