學習率
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
本附錄包含學習率的一些額外詳細資料。
學習率衰減時間表
最佳學習率衰減排程系列仍是開放式問題;目前尚不清楚如何建構一套嚴謹的實驗,以自信地回答這個問題。雖然我們不知道最適合家庭的行程表,但我們有信心:
- 請務必安排一些 (非固定) 時間。
- 調整時間表非常重要。
在最佳化過程的不同階段,最佳學習率也會有所不同。設定某種時間表,模型就更有可能達到良好的學習率。
最佳預設學習率衰減
建議您預設使用下列任一學習率衰減系列:
其他許多排程系列也可能適用。
為什麼有些論文的學習率時間表很複雜?
許多學術論文都使用複雜的分段學習率 (LR) 衰減時間表。讀者經常想知道作者如何制定如此複雜的行程。許多複雜的 LR 衰減時間表,都是根據驗證集效能臨時調整時間表所致。也就是:
- 開始執行單一訓練,並使用一些簡單的 LR 衰減 (或常數學習率)。
- 持續執行訓練,直到成效似乎停滯不前為止。
如果發生這種情況,請暫停訓練。然後從這個時間點開始,以較陡峭的 LR 衰減時間表 (或較小的常數學習率) 繼續訓練。重複這個程序 (直到會議或發布截止日期)。
一般來說,直接複製產生的時間表並非好主意,因為最佳時間表會受到許多其他超參數選擇的影響。建議您複製產生時間表的演算法,但如果時間表是由任意人為判斷產生,則很少能做到這一點。如果這類對驗證錯誤敏感的時間表可以完全自動化,則可放心使用,但如果是以驗證錯誤為依據的人工參與時間表,則較為脆弱且不易重現,因此建議避免使用。發布使用這類排程的結果前,請盡量確保結果完全可重現。
如何調整 Adam 的超參數?
Adam 中的所有超參數重要性不盡相同。
以下經驗法則是根據研究中試驗次數的不同「預算」而定。
- 如果研究中的試驗次數少於 10 次,請只調整 (基本) 學習率。
- 如果研究中有 10 到 25 次試驗,請調整學習率和
beta_1
。
- 如果試驗次數超過 25 次,請調整學習率、
beta_1
和 epsilon
。
- 如果試驗次數遠超過 25 次,請額外調整
beta_2
。
由於很難提供有關搜尋空間的一般規則,以及您應從搜尋空間取樣多少點,因此請將本節所述的經驗法則視為粗略的指引。
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
上次更新時間:2025-07-27 (世界標準時間)。
[null,null,["上次更新時間:2025-07-27 (世界標準時間)。"],[[["\u003cp\u003eEmploying a non-constant learning rate decay schedule, such as linear or cosine decay, is crucial for optimal model performance.\u003c/p\u003e\n"],["\u003cp\u003eComplicated, piece-wise learning rate schedules often arise from ad hoc tuning based on validation set performance and should be approached with caution due to reproducibility concerns.\u003c/p\u003e\n"],["\u003cp\u003ePrioritize tuning Adam's hyperparameters strategically: focus on the base learning rate for limited trials, gradually incorporating \u003ccode\u003ebeta_1\u003c/code\u003e, \u003ccode\u003eepsilon\u003c/code\u003e, and \u003ccode\u003ebeta_2\u003c/code\u003e with increasing trial budgets.\u003c/p\u003e\n"],["\u003cp\u003eWhile specific learning rate decay schedules are dataset and model dependent, having a schedule is more important than the specific type.\u003c/p\u003e\n"]]],[],null,["# Learning rate\n\nThis appendix contains a few additional details about learning rate.\n\nLearning rate decay schedule\n----------------------------\n\nThe best learning rate decay schedule family is an open problem;\nit's not clear how to construct a set of rigorous experiments to\nconfidently answer this question.\nAlthough we don't know the best schedule family, we're confident\nof the following:\n\n- It's important to have some (non-constant) schedule.\n- Tuning that schedule is important.\n\nDifferent learning rates work best at different times during the\noptimization process. Having some sort of schedule makes it more\nlikely for the model to hit a good learning rate.\n\n### Best default learning rate decay\n\nWe recommend either of the following learning rate decay families\nas a default:\n\n- Linear decay\n- Cosine decay\n\nMany other schedule families are probably good, too.\n\n### Why do some papers have complicated learning rate schedules?\n\nMany academic papers use complicated piece-wise learning rate (LR)\ndecay schedules. Readers often wonder how the authors arrived at\nsuch a complicated schedule. Many complicated LR decay schedules are\nthe result of tuning the schedule as a function of the validation set\nperformance in an ad hoc way. That is:\n\n1. Start a single training run with some simple LR decay (or a constant learning rate).\n2. Keep training running until the performance seems to stagnate. If this happens, pause training. Then, resume it with a perhaps steeper LR decay schedule (or smaller constant learning rate) from this point. Repeat this process (until the conference or launch deadline).\n\nBlithely copying the resulting schedule is generally not a good idea\nsince the best particular schedule is sensitive to a host of other\nhyperparameter choices. We recommend copying the algorithm that produced\nthe schedule, although this is rarely possible when arbitrary human\njudgment produced the schedule. This type of validation-error-sensitive\nschedule is fine to use if it can be fully automated, but\nhuman-in-the-loop schedules that are a function of validation error are\nbrittle and not easily reproducible, so we recommend avoiding them.\nBefore publishing results that used such a schedule, please try to make\nit fully reproducible.\n\n### How should Adam's hyperparameters be tuned?\n\nNot all the hyperparameters in Adam are equally important.\nThe following rules of thumb correspond to different \"budgets\" for the number\nof trials in a study.\n\n- If \\\u003c 10 trials in a study, only tune the (base) learning rate.\n- If 10-25 trials in a study, tune the learning rate and `beta_1`.\n- If 25+ trials, tune the learning rate, `beta_1`, and `epsilon`.\n- If substantially more than 25 trials, additionally tune tune `beta_2`.\n\nGiven how difficult it is to provide general rules about search spaces and\nhow many points you should sample from the search space, view the rules of\nthumb stated in this section as rough guidelines.\""]]