This appendix contains a few additional details about learning rate.
Learning rate decay schedule
The best learning rate decay schedule family is an open problem; it's not clear how to construct a set of rigorous experiments to confidently answer this question. Although we don't know the best schedule family, we're confident of the following:
- It's important to have some (non-constant) schedule.
- Tuning that schedule is important.
Different learning rates work best at different times during the optimization process. Having some sort of schedule makes it more likely for the model to hit a good learning rate.
Best default learning rate decay
We recommend either of the following learning rate decay families as a default:
- Linear decay
- Cosine decay
Many other schedule families are probably good, too.
Why do some papers have complicated learning rate schedules?
Many academic papers use complicated piece-wise learning rate (LR) decay schedules. Readers often wonder how the authors arrived at such a complicated schedule. Many complicated LR decay schedules are the result of tuning the schedule as a function of the validation set performance in an ad hoc way. That is:
- Start a single training run with some simple LR decay (or a constant learning rate).
- Keep training running until the performance seems to stagnate. If this happens, pause training. Then, resume it with a perhaps steeper LR decay schedule (or smaller constant learning rate) from this point. Repeat this process (until the conference or launch deadline).
Blithely copying the resulting schedule is generally not a good idea since the best particular schedule is sensitive to a host of other hyperparameter choices. We recommend copying the algorithm that produced the schedule, although this is rarely possible when arbitrary human judgment produced the schedule. This type of validation-error-sensitive schedule is fine to use if it can be fully automated, but human-in-the-loop schedules that are a function of validation error are brittle and not easily reproducible, so we recommend avoiding them. Before publishing results that used such a schedule, please try to make it fully reproducible.
How should Adam's hyperparameters be tuned?
Not all the hyperparameters in Adam are equally important. The following rules of thumb correspond to different "budgets" for the number of trials in a study.
- If < 10 trials in a study, only tune the (base) learning rate.
- If 10-25 trials in a study, tune the learning rate and
beta_1
. - If 25+ trials, tune the learning rate,
beta_1
, andepsilon
. - If substantially more than 25 trials, additionally tune tune
beta_2
.
Given how difficult it is to provide general rules about search spaces and how many points you should sample from the search space, view the rules of thumb stated in this section as rough guidelines."