Hyperparameter Tuning (Neural Network Training)
Below is everything that I have learned to train a neural network in practice. Also see CNN to learn about how to use them in practice.
New Guide by Google for NN tuning came out: https://github.com/google-research/tuning_playbook
In parameter, we use model ensembles to get 2% extra performance in the following steps:
- Train multiple independent models
- At test time, average their results
We can also perform Polyak averaging, where instead of using actual parameter vector, we keep a moving average of the parameter vector and use that at test time:
while True:
data_batch = dataset.sample_data_batch()
loss = network.forward(data_batch)
dx = network.backward()
x += - learning_rate * dx
x_test = 0.995 * x_test + 0.005*x # use for test setNeural Net Recipes
Most Common Neural Net Mistakes:
- You didn’t try to overfit a single batch first
- You forgot to toggle train/eval mode for the net
- You forgot .zero_grad() (in PyTorch) before .backward()
- You passed softmaxed outputs to a loss that expects raw logits
See A Recipe for Training Neural Networksby Andrej Karpathy, published in 2019.
I wrote an article on it on Medium, see Coding a Neural Network.
Hyperparameter Tuning
https://cs231n.github.io/neural-networks-3/ Hyperparameters are choices about the algorithm that we set rather than learn.
They are very problem-dependent and we must try them all out to see what works best.
For hyperparameter tuning, use the validation Dataset.
Some parameters:
- Learning rate / step size (Most important to figure out, this determines how fast to update for Gradient Descent)
- setting it to small will make learning too slow
- setting it to big will make it overstep and unstable (i.e. loss with sometimes get bigger, sometimes smaller between each training step)
- Maybe try learning rate decay
- Regularization Parameters such as Dropout
Andrej Karpathy says that our loss shouldn’t look like a hockey stick!! Omg, all my training used to always be like that. The reason it’s like a hockey still is because of the easy gains from bad initialization. If you have good initialization, then your loss curve should be rather straight.
At initialization, each of the class should have about uniform probability of being selected. So you should know approximately what the loss is.
There are two ways to search for hyperparameters:
- Grid Search
- Random Search (recommended approach) → https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
Also see [[notes/Computer Vision#Tips for doing well on benchmarks/winning competitions|Computer Vision#Tips for doing well on benchmarks/winning competitions]]
CS231n Lec 6: 7-step workflow
- Check initial loss — at init, classes should be roughly uniform, so for -class softmax the loss should be (e.g. for ImageNet). If it’s wildly off, your model is broken before you start.
- Overfit a small sample — ~5–20 examples. Turn off regularization. Loss should drive to ~0. If not, the model can’t learn anything.
- Find LR that makes loss go down — use full data + small weight decay. Try LRs in , look for significant drop within ~100 iterations.
- Coarse hyperparam grid, train ~1–5 epochs — find a working region for LR + weight decay.
- Refine grid, train longer.
- Look at loss/accuracy curves (see below).
- GOTO 5 — iterate.
Reading train/val accuracy curves
| Curve shape | Diagnosis | Action |
|---|---|---|
| Train and val both still climbing | Underfit on time | Train longer |
| Huge train/val gap, val drops | Overfitting | More regularization or more data |
| Train ≈ val, both rising slowly | Underfitting capacity | Train longer or use a bigger model |
Random search > grid search
Bergstra & Bengio 2012: if some hyperparams matter more than others (almost always true — LR dominates), grid search wastes evaluations on the unimportant axis. Random search covers the important axis with more distinct values for the same budget.
Source
CS231n Lec 6 slides 89–96 (hyperparam workflow, loss curves, random vs grid search).