How To Build A Simple Test Data Loader?

A data load task allows you to take various data from a source and load it into a destination data mart, data lake, or data warehouse. You have the option to transform the data before loading it. You create a data load task in a project or folder. Data Integration includes one default project to get you started, but if you’d like to create your own, see Using Projects and Folders.

Simple test data loader

Naive base

Before we get into parallel processing, we should build a simple, naive version of our data loader. To initialize our data loader, we simply store the provided dataset, batch size and collate fn. We’ll also create a self.index variable that will store the next index to be retrieved from the dataset. We now have an essentially fully functional data loader. The only problem is that get() reads in one dataset element at a time using the same process that would be used for training.

Introducing the workers

A simple way to do this is to provide each worker with an index queue for that worker load and an output queue where the worker can put the retrieved data. All the worker needs to do is repeatedly check its index queue and retrieve the data if the queue is not empty.

Multi process data loader

Using our worker function, we can define a multi-process data loader that subclasses our naive data loader. This data loader creates num workers when it initializes. We have a single output queue that is shared by all worker processes, each of which has its own index queue.

Testing

As a simple test, we can mock a dataset that takes some time to load an element by simply calling time.sleep() before returning the item. We can also mimic the training loop by iterating through the dataloader, sleeping each step, to mock the time it would take to forward propagate, backpropagate, and update the network weights.

Dataset

We will now go through the details of how to set up a Python Dataset class that will characterize the key properties of the dataset you want to generate. We make the latter inherit torch.utils.data. Dataset properties so we can use nice features like multiprocessing later.

Data file was modified to have the following pattern:

⦁ Train →Images | Labels (Class1-Class2-Class3)
⦁ Test → Images | Labels (class1-class2-class3)

Transforms

One problem we can see from the above is that the samples are not the same size. Most neural networks expect images of a fixed size.

⦁ Scale: to scale the image

⦁ RandomCrop: to randomly crop the image. This is a data extension.

⦁ ToTensor: convert numpy images to torch images

Conclusion

You can run your PyTorch script using the command and you will see that during the training phase data is generated in parallel by the CPU, which can then be fed to the GPU for neural network calculations.