Data Input Types¶
Title: Masterful supports many input data types
Author: sam
Date created: 2022/04/27
Last modified: 2022/04/1
Description: Data input types in Masterful.
Introduction¶
Masterful supports several different input data types for your training data. In this guide, you will learn about these different types, and how to use them with Masterful. This will make it easy to integrate existing training data into the Masterful platform.
Tensorflow Dataset¶
Masterful is built to support tf.data.Dataset objects natively.
The easiest way to create a dataset is to use the Dataset.from_tensor_slices API, as explained here. Masterful can consume the created Dataset object directly.
# Create 32 random 3-channel images of 16x16 pixels.
images = tf.random.uniform(shape=(32,16,16,3))
unlabeled_dataset = tf.data.Dataset.from_tensor_slices((images,))
labels = tf.random.uniform(shape=(32,), minval=0, maxval=10, dtype=tf.int32)
labeled_dataset = tf.data.Dataset.from_tensor_slices((images, labels))
In order to train with a Dataset, the dataset must not be batched (Masterful will batch the data for you).
# This unlabeled dataset can be used with Masterful
dataset = tf.data.Dataset.from_tensor_slices((tf.random.uniform((32, 16, 16, 3))))
# This dataset cannot be used with Masterful.
dataset = dataset.batch(16) # DO NOT batch the dataset
Also, it is important that the items returned by the dataset are a tuple of only one or two items. In the case of unlabeled data, the data returned by the dataset should correspond to the image data. In the case of labeled data, the tuple should contain the features (images) and the labels.
images = tf.random.uniform(shape=(32,16,16,3))
labels = tf.random.uniform(shape=(32,), minval=0, maxval=10, dtype=tf.int32)
# This dataset is properly formatted for Masterful
labeled_dataset = tf.data.Dataset.from_tensor_slices((images, labels))
# This dataset will not work with Masterful, it has too many
# items per example.
extra_data = tf.ones_like(labels)
incorrect_dataset = tf.data.Dataset.from_tensor_slices((images, labels, extra_data)) # Too many items returned for each example
Tensorflow also provides a large catalog of built-in datasets, as part of the Tensorflow Datasets catalog.
import tensorflow as tf
import tensorflow_datasets as tfds
# Masterful can use TFDS datasets as well.
dataset = tfds.load('mnist', split='train', as_supervised=True)
Most Tensorflow Datasets in the catalog return a dictionary of items, so it is important to extract the features and labels from the dictionary before passing to Masterful
dataset = tfds.load('mnist', split='train')
# Extract the images and labels from the feature dictionary.
dataset = dataset.map(lambda x: (x['image'], x['label']))
Most datasets in the catalog can automatically perfom the above extraction for you, using the as_supewrvised
argument.
# `as_supervised` will automatically extract the images
# and labels.
dataset = tfds.load('mnist', split='train', as_supervised=True)
Numpy (and Tensor) Arrays¶
Masterful supports consuming Numpy and Tensor arrays directly as well. This works well if you dataset is small and fits entirely into memory.
# x_train and y_train are numpy arrays
(x_train, y_train), _ = tf.keras.datasets.cifar10.load_data()
In order to use the above arrays with Masterful, you will pass in the tuple of images and labels to any of the Masterful APIs that require a “dataset”.
training_data_params = masterful.data.learn_data_params(
dataset=(x_train, y_train),
task=masterful.enums.Task.CLASSIFICATION,
image_range=masterful.enums.ImageRange.ZERO_255,
num_classes=10,
sparse_labels=True,
)
training_report = masterful.training.train(
...
training_dataset=(x_train, y_train),
training_dataset_params=training_dataset_params,
...
)
Keras Sequence¶
Masterful also supports Keras Sequence objects. Keras Sequences are a form of generators for Keras that are a safer way to do multiprocessing over regular generators. This structure guarantees that the network will only train once on each sample per epoch which is not the case with generators.
Note that tf.data.Dataset are the preferred way of training in Tensorflow.
input_shape = (16, 16, 3)
class DataGenerator(tf.keras.utils.Sequence):
"""Keras sequence that returns batches of dummy data."""
def __init__(self):
# Returns only a single batch of data.
self._length = 1
def __len__(self):
return self._length
def __getitem__(self, index):
"""Returns batches of dummy data."""
images = np.zeros((32,) + input_shape)
labels = np.ones((32,))
return (images, labels)
keras_sequence = DataGenerator()
# Even though Sequences generate *batches* of data,
# Masterful knows how to handle it correctly.
training_data_params = masterful.data.learn_data_params(
dataset=keras_sequence,
task=masterful.enums.Task.CLASSIFICATION,
image_range=masterful.enums.ImageRange.ZERO_255,
num_classes=10,
sparse_labels=True,
)
training_report = masterful.training.train(
...
training_dataset=keras_sequence,
training_dataset_params=training_dataset_params,
...
)
Python Generators¶
Masterful also supports Python generators. In order to use a generator, you need to tell Masterful both the generator function to use, as well as the output signature for the examples returned by the generator. This is passed as a tuple to Masterful.
input_shape = (16, 16, 3)
def generator():
# A generator function is a zero-arg function that
# yields images and labels
i = 0
max_items = 32
while i < max_items:
yield tf.zeros(input_shape), tf.ones(())
i += 1
# The output signature is a tuple of tensor specs
output_signature = (tf.TensorSpec(input_shape, dtype=tf.float32),
tf.TensorSpec((), dtype=tf.int32))
# Pass the generator and the output signature to Masterful
# as a tuple.
training_data_params = masterful.data.learn_data_params(
dataset=(generator, output_signature),
task=masterful.enums.Task.CLASSIFICATION,
image_range=masterful.enums.ImageRange.ZERO_255,
num_classes=10,
sparse_labels=True,
)
training_report = masterful.training.train(
...
training_dataset=(generator, output_signature),
training_dataset_params=training_dataset_params,
...
)