How-are-Momentum-Decay-and-beta1-parameters-related-in-Deep-Learning

2025-06-25T13:31:53Z - 2025-06-25T13:31:53Z ago

How are Momentum Decay and beta1 parameters related in Deep Learning?

Are the parameters Momentum Decay and beta1 related by the formula Momentum Decay = 1 - beta1? That is, if beta1=0.9, should Momentum Decay=0.1 be set?

214

7 Replies

Bookmark

Notify

Sort

Glitch8

( 8.08% )

2025-06-25T13:35:36Z - 2025-06-25T13:35:36Z ago

This is from Google AI:

In the context of PyTorch (TorchSharp) optimizers, particularly Adam, Momentum Decay and beta1 are closely related concepts.

Momentum Decay: This refers to the exponential decay rate applied to the momentum term. It determines how much the optimizer's updates are influenced by past gradients. A higher momentum decay rate means that past gradients have a stronger influence, which can help the optimizer accelerate in the relevant direction and smooth out oscillations.

beta1: This is the specific parameter within optimizers like Adam that controls the momentum decay rate. It is the exponential decay rate for the first moment estimates, which represent the running average of the gradients. In PyTorch's implementation of Adam, the default value for beta1 is typically 0.9. This value is usually set close to 1 to allow the optimizer to build momentum and speed up the learning process.

dmitry78

2025-06-25T13:49:37Z - 2025-06-25T13:49:37Z ago

For some reason, many sources write that Momentum Decay and beta1 are the same parameter

dmitry78

2025-06-25T13:59:47Z - 2025-06-25T13:59:47Z ago

Another question about DL. Why does the learning speed decrease so much when the Weight Decay parameters are not equal to zero? I read that to avoid overfitting, it is recommended to have a weight decay of about 0.001

Glitch8

( 8.08% )

2025-06-25T14:30:47Z - 2025-06-25T14:30:47Z ago

I'm not really an expert in all of these parameters, so I don't have an answer for that. I'm exposing the parameters of the various engines, but you'll need to consult their corresponding docs to learn the intricacies of their parameters.

dmitry78

2025-06-27T07:30:02Z - 2025-06-27T07:30:02Z ago

This is what Gemini answered me:
In the context of optimizers like Adam and NAdam, Momentum Decay is directly related to the β1 parameter. The β1 parameter dictates the exponential decay rate for the first moment estimate, which is essentially a moving average of the gradients.
β1 : This coefficient determines how much the current gradients influence the accumulated momentum compared to past gradients. A β1 value close to 1 (e.g., 0.9 or 0.99) signifies that older gradients have a very strong influence, and the momentum is preserved for longer.

Momentum Decay (or (1−β1 )): Can be viewed as the "forgetting" rate for old gradients. If β1 =0.9, then the "momentum decay" would be 1−0.9=0.1. This means that 10% of the current gradient is added to the momentum, while 90% of the previous momentum is retained.

dmitry78

2025-06-27T12:59:50Z - 2025-06-27T12:59:50Z ago

https://github.com/pytorch/pytorch/issues/51539

Bookmark

Notify

Sort