CosineAnnealingLR

final class fairseq2.optim.lr_scheduler.CosineAnnealingLR(optimizer, cycle_len, num_warmup_steps, *, cycle_mul=1.0, lr_mul=1.0, start_lr=0.0, final_lr=0.0, last_epoch=-1, verbose=False)[source]

Bases: LRSchedulerBase

Represents the learning rate schedule described in Loshchilov and Hutter [LH17].

During warmup:

\[\eta_t = \eta_{base} \frac{t}{T_{warmup}}\]

After warmup:

\[\eta_t = \eta_{final}^i + \frac{1}{2} (\eta_{base}^i - \eta_{final}^i) (1 + \text{cos}(\pi \frac{t_{i}}{T_{i}}))\]

where \(i\) is the number of the current annealing cycle, \(t_i\) is the number of steps taken since the last restart, and \(T_i\) is the total number of steps within the \(i\)-th cycle (i.e. length of the cycle).

Cosine Annealing is a type of learning rate schedule that has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being increased rapidly again.

Please refer to the paper to learn more about the details.

In addition to the original schedule, this implementation also supports a warmup phase where the learning rate is linearly increased for the first \(T_{warmup}\) training steps to the base learning rate.

Note

This scheduler is not chainable.

Parameters:
  • optimizer (Optimizer) – The associated optimizer.

  • cycle_len (int) – The number of steps within the first cycle.

  • num_warmup_steps (int) – The number of warmup steps.

  • cycle_mul (float) – The factor to grow the length of each cycle.

  • lr_mul (float) – The factor to scale the base and final learning rate at the end of each cycle.

  • start_lr (float | Sequence[float]) – The initial warmup learning rate of all parameter groups, or of each parameter group respectively.

  • final_lr (float | Sequence[float]) – The final learning rate of all parameter groups, or of each parameter group respectively, at the end of the first cycle.

  • last_epoch (int) – The index of the last epoch.

  • verbose (bool) – If True, prints a message to stdout for each update.

get_last_lr()

Return last computed learning rate by current scheduler.

load_state_dict(state_dict)

Loads the schedulers state.

Args:
state_dict (dict): scheduler state. Should be an object returned

from a call to state_dict().

print_lr(is_verbose, group, lr, epoch=None)

Display the current learning rate.

state_dict()

Returns the state of the scheduler as a dict.

It contains an entry for every variable in self.__dict__ which is not the optimizer.