A new and improved version of Machine Learning Crash Course is coming in August 2024. Stay tuned!

Training Neural Networks

Backpropagation is the most common training algorithm for neural networks. It makes gradient descent feasible for multi-layer neural networks. TensorFlow handles backpropagation automatically, so you don't need a deep understanding of the algorithm. To get a sense of how it works, walk through the following: Backpropagation algorithm visual explanation. As you scroll through the preceding explanation, note the following:

How data flows through the graph.
How dynamic programming lets us avoid computing exponentially many paths through the graph. Here "dynamic programming" just means recording intermediate results on the forward and backward passes.

Training Neural Nets

Backprop: What You Need To Know

Gradients are important

If it's differentiable, we can probably learn on it

Backprop: What You Need To Know

Gradients are important

If it's differentiable, we can probably learn on it

Gradients can vanish

Each additional layer can successively reduce signal vs. noise
ReLus are useful here

Backprop: What You Need To Know

Gradients are important

If it's differentiable, we can probably learn on it

Gradients can vanish

Each additional layer can successively reduce signal vs. noise
ReLus are useful here

Gradients can explode

Learning rates are important here
Batch normalization (useful knob) can help

Backprop: What You Need To Know

Gradients are important

If it's differentiable, we can probably learn on it

Gradients can vanish

Each additional layer can successively reduce signal vs. noise
ReLus are useful here

Gradients can explode

Learning rates are important here
Batch normalization (useful knob) can help

ReLu layers can die

Keep calm and lower your learning rates

Normalizing Feature Values

We'd like our features to have reasonable scales

Roughly zero-centered, [-1, 1] range often works well
Helps gradient descent converge; avoid NaN trap
Avoiding outlier values can also help

Can use a few standard methods:

Linear scaling
Hard cap (clipping) to max, min
Log scaling

Dropout Regularization

Dropout: Another form of regularization, useful for NNs
Works by randomly "dropping out" units in a network for a single gradient step

There's a connection to ensemble models here

The more you drop out, the stronger the regularization

0.0 = no dropout regularization
1.0 = drop everything out! learns nothing
Intermediate values more useful

Programming Exercise

Best Practices