Backpropagation is the most common training algorithm for neural networks.
It makes gradient descent feasible for multi-layer neural networks.
TensorFlow handles backpropagation automatically, so you don't need a deep
understanding of the algorithm. To get a sense of how it works, walk through
the following:
Backpropagation algorithm visual explanation.
As you scroll through the preceding explanation, note the following:
How data flows through the graph.
How dynamic programming lets us avoid computing exponentially many
paths through the graph. Here "dynamic programming" just means recording
intermediate results on the forward and backward passes.
Training Neural Nets
Backprop: What You Need To Know
Gradients are important
If it's differentiable, we can probably learn on it
Backprop: What You Need To Know
Gradients are important
If it's differentiable, we can probably learn on it
Gradients can vanish
Each additional layer can successively reduce signal vs. noise
ReLus are useful here
Backprop: What You Need To Know
Gradients are important
If it's differentiable, we can probably learn on it
Gradients can vanish
Each additional layer can successively reduce signal vs. noise
ReLus are useful here
Gradients can explode
Learning rates are important here
Batch normalization (useful knob) can help
Backprop: What You Need To Know
Gradients are important
If it's differentiable, we can probably learn on it
Gradients can vanish
Each additional layer can successively reduce signal vs. noise
ReLus are useful here
Gradients can explode
Learning rates are important here
Batch normalization (useful knob) can help
ReLu layers can die
Keep calm and lower your learning rates
Normalizing Feature Values
We'd like our features to have reasonable scales
Roughly zero-centered, [-1, 1] range often works well
Helps gradient descent converge; avoid NaN trap
Avoiding outlier values can also help
Can use a few standard methods:
Linear scaling
Hard cap (clipping) to max, min
Log scaling
Dropout Regularization
Dropout: Another form of regularization, useful for NNs
Works by randomly "dropping out" units in a network for a single gradient step
There's a connection to ensemble models here
The more you drop out, the stronger the regularization