DataLoader (PyTorch)
Main parameters that are important:
- num_workers=4,
- pin_memory=True,
- persistent_workers=True,
When you specify a batch size of 10, how does the DataLoader organize the dataset?
First, the DataLoader generates a list of indices from your dataset:
- If shuffle=True, it shuffles the indices.
- Otherwise, it just uses [0, 1, 2, …, len(dataset)-1].
Example for a dataset of length 100:
indices = [0, 1, 2, ..., 99]
Batch Construction: It groups the indices into chunks of size 10:
[[0–9], [10–19], ..., [90–99]]
if you specify shuttle=True
, then
shuffled_indices = [83, 5, 91, 23, 44, 67, 12, 18, 2, 30, ...]
Then your batches are
batch_1 = [83, 5, 91, 23, 44, 67, 12, 18, 2, 30]
batch_2 = [75, 96, 0, 10, 3, 66, 45, 7, 88, 59]
Why would you ever want to shuffle your batches?
shuffle=True
is crucial for training machine learning models — especially neural networks — because it helps prevent overfitting, model bias, and bad generalization.