DataLoader (PyTorch)

Main parameters that are important:

  • num_workers=4,
  • pin_memory=True,
  • persistent_workers=True,

When you specify a batch size of 10, how does the DataLoader organize the dataset?

First, the DataLoader generates a list of indices from your dataset:

  • If shuffle=True, it shuffles the indices.
  • Otherwise, it just uses [0, 1, 2, …, len(dataset)-1].

Example for a dataset of length 100:

indices = [0, 1, 2, ..., 99]

Batch Construction: It groups the indices into chunks of size 10:

[[0–9], [10–19], ..., [90–99]]

if you specify shuttle=True, then

shuffled_indices = [83, 5, 91, 23, 44, 67, 12, 18, 2, 30,  ...]

Then your batches are

batch_1 = [83, 5, 91, 23, 44, 67, 12, 18, 2, 30]
batch_2 = [75, 96, 0, 10, 3, 66, 45, 7, 88, 59]

Why would you ever want to shuffle your batches?

shuffle=True is crucial for training machine learning models — especially neural networks — because it helps prevent overfitting, model bias, and bad generalization.