Hyperparameter Tuning (Neural Network Training)

Below is everything that I have learned to train a neural network in practice. Also see CNN to learn about how to use them in practice.

New Guide by Google for NN tuning came out: https://github.com/google-research/tuning_playbook

In parameter, we use model ensembles to get 2% extra performance in the following steps:

Train multiple independent models
At test time, average their results

We can also perform Polyak averaging, where instead of using actual parameter vector, we keep a moving average of the parameter vector and use that at test time:

while True:
	data_batch = dataset.sample_data_batch()
	loss = network.forward(data_batch)
	dx = network.backward()
	x += - learning_rate * dx
	x_test =  0.995 * x_test + 0.005*x # use for test set

Regularization

Neural Net Recipes

Most Common Neural Net Mistakes:

You didn’t try to overfit a single batch first
You forgot to toggle train/eval mode for the net
You forgot .zero_grad() (in PyTorch) before .backward()
You passed softmaxed outputs to a loss that expects raw logits

See A Recipe for Training Neural Networksby Andrej Karpathy, published in 2019.

I wrote an article on it on Medium, see Coding a Neural Network.

Hyperparameter Tuning

https://cs231n.github.io/neural-networks-3/ Hyperparameters are choices about the algorithm that we set rather than learn.

They are very problem-dependent and we must try them all out to see what works best.

For hyperparameter tuning, use the validation Dataset.

Some parameters:

Learning rate / step size (Most important to figure out, this determines how fast to update for Gradient Descent)
- setting it to small will make learning too slow
- setting it to big will make it overstep and unstable (i.e. loss with sometimes get bigger, sometimes smaller between each training step)
- Maybe try learning rate decay
Regularization Parameters such as Dropout

Andrej Karpathy says that our loss shouldn’t look like a hockey stick!! Omg, all my training used to always be like that. The reason it’s like a hockey still is because of the easy gains from bad initialization. If you have good initialization, then your loss curve should be rather straight.

At initialization, each of the class should have about uniform probability of being selected. So you should know approximately what the loss is.

There are two ways to search for hyperparameters:

Grid Search
Random Search (recommended approach) → https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

Also see [[notes/Computer Vision#Tips for doing well on benchmarks/winning competitions|Computer Vision#Tips for doing well on benchmarks/winning competitions]]

🛠️ Steven Gong

Table of Contents

Hyperparameter Tuning (Neural Network Training)

Neural Net Recipes

Hyperparameter Tuning

Graph View

Backlinks