Quasi-hyperbolic optimizers for TensorFlow

Getting started

The TensorFlow optimizer classes are qhoptim.tf.QHMOptimizer and qhoptim.tf.QHAdamOptimizer.

Use these optimizers as you would any other TensorFlow optimizer:

>>> from qhoptim.tf import QHMOptimizer, QHAdamOptimizer

# something like this for QHM
>>> optimizer = QHMOptimizer(
...     learning_rate=1.0, nu=0.7, momentum=0.999)

# or something like this for QHAdam
>>> optimizer = QHAdamOptimizer(
...     learning_rate=1e-3, nu1=0.7, nu2=1.0, beta1=0.995, beta2=0.999)

QHM API reference

class qhoptim.tf.QHMOptimizer(learning_rate, momentum, nu, use_locking=False, name='QHM')[source]

Implements the quasi-hyperbolic momentum (QHM) optimization algorithm (Ma and Yarats, 2019).

Note that many other optimization algorithms are accessible via specific parameterizations of QHM. See from_accsgd(), from_robust_momentum(), etc. for details.

Parameters
  • learning_rate (float) – learning rate (\(\alpha\) from the paper)

  • momentum (float) – momentum factor (\(\beta\) from the paper)

  • nu (float) – immediate discount factor (\(\nu\) from the paper)

  • use_locking (bool) – whether or not to use locking parameter updates

  • name (str) – name of the optimizer

Example

>>> optimizer = qhoptim.tf.QHMOptimizer(
...     learning_rate=1.0, nu=0.7, momentum=0.999)

Note

Mathematically, QHM is a simple interpolation between plain SGD and momentum:

\[\begin{split}\begin{align*} g_{t + 1} &\leftarrow \beta \cdot g_t + (1 - \beta) \cdot \nabla_t \\ \theta_{t + 1} &\leftarrow \theta_t + \alpha \left[ (1 - \nu) \cdot \nabla_t + \nu \cdot g_{t + 1} \right] \end{align*}\end{split}\]

Here, \(\alpha\) is the learning rate, \(\beta\) is the momentum factor, and \(\nu\) is the “immediate discount” factor which controls the interpolation between plain SGD and momentum. \(g_t\) is the momentum buffer, \(\theta_t\) is the parameter vector, and \(\nabla_t\) is the gradient with respect to \(\theta_t\).

Note

QHM uses dampened momentum. This means that when converting from plain momentum to QHM, the learning rate must be scaled by \(\frac{1}{1 - \beta}\). For example, momentum with learning rate \(\alpha = 0.1\) and momentum \(\beta = 0.9\) should be converted to QHM with learning rate \(\alpha = 1.0\).

classmethod from_accsgd(delta, kappa, xi, eps=0.7)[source]

Calculates the QHM hyperparameters required to recover the AccSGD optimizer (Kidambi et al., 2018).

Parameters
  • delta (float) – short step (see reference)

  • kappa (float) – long step parameter (see reference)

  • xi (float) – statistical advantage parameter (see reference)

  • eps (float, optional) – arbitrary value, between 0 and 1 exclusive (see reference) (default: 0.7)

Returns

Three-element dict containing learning_rate, momentum, and nu to use in QHM.

Example

>>> optimizer = qhoptim.tf.QHMOptimizer(
...     **qhoptim.tf.QHMOptimizer.from_accsgd(
...         delta=0.1, kappa=1000.0, xi=10.0))
classmethod from_pid(k_p, k_i, k_d)[source]

Calculates the QHM hyperparameters required to recover a PID optimizer as described in Recht (2018).

Parameters
  • k_p (float) – proportional gain (see reference)

  • k_i (float) – integral gain (see reference)

  • k_d (float) – derivative gain (see reference)

Returns

Three-element dict containing learning_rate, momentum, and nu to use in QHM.

Example

>>> optimizer = qhoptim.tf.QHMOptimizer(
...     **qhoptim.tf.QHMOptimizer.from_pid(
...         k_p=-0.1, k_i=1.0, k_d=3.0))
classmethod from_robust_momentum(l, kappa, rho=None)[source]

Calculates the QHM hyperparameters required to recover the Robust Momentum (Cyrus et al., 2018) or Triple Momentum (Scoy et al., 2018) optimizers.

Parameters
  • l (float) – Lipschitz constant of gradient (see reference)

  • kappa (float) – condition ratio (see reference)

  • rho (float, optional) – noise-free convergence rate. If None, will return the parameters for the Triple Momentum optimizer.

Returns

Three-element dict containing learning_rate, momentum, and nu to use in QHM.

Example

>>> optimizer = qhoptim.tf.QHMOptimizer(
...     **qhoptim.tf.QHMOptimizer.from_robust_momentum(
...         l=5.0, kappa=15.0))
classmethod from_synthesized_nesterov(alpha, beta1, beta2)[source]

Calculates the QHM hyperparameters required to recover the synthesized Nesterov optimizer (Section 6 of Lessard et al. (2016)).

Parameters
  • alpha (float) – learning rate

  • beta1 (float) – first momentum (see reference)

  • beta2 (float) – second momentum (see reference)

Returns

Three-element dict containing learning_rate, momentum, and nu to use in QHM.

Example

>>> optimizer = qhoptim.tf.QHMOptimizer(
...     **qhoptim.tf.QHMOptimizer.from_synthesized_nesterov(
...         alpha=0.1, beta1=0.9, beta2=0.6))
classmethod from_two_state_optimizer(h, k, l, m, q, z)[source]

Calculates the QHM hyperparameters required to recover the following optimizer (named “TSO” in Ma and Yarats (2019)):

\[\begin{split}\begin{align*} a_{t + 1} &\leftarrow h \cdot a_t + k \cdot \theta_t + l \cdot \nabla_t \\ \theta_{t + 1} &\leftarrow m \cdot a_t + q \cdot \theta_t + z \cdot \nabla_t \end{align*}\end{split}\]

Here, \(a_t\) and \(\theta_t\) are the two states and \(\nabla_t\) is the gradient with respect to \(\theta_t\).

Be careful that your coefficients satisfy the regularity conditions from the reference.

Parameters
  • h (float) – see description

  • k (float) – see description

  • l (float) – see description

  • m (float) – see description

  • q (float) – see description

  • z (float) – see description

Returns

Three-element dict containing learning_rate, momentum, and nu to use in QHM.

Example

>>> optimizer = qhoptim.tf.QHMOptimizer(
...     **qhoptim.tf.QHMOptimizer.from_two_state_optimizer(
...         h=0.9, k=0.0, l=0.1, m=-0.09, q=1.0, z=-0.01))

QHAdam API reference

class qhoptim.tf.QHAdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, nu1=1.0, nu2=1.0, epsilon=1e-08, use_locking=False, name='QHAdam')[source]

Implements the QHAdam optimization algorithm (Ma and Yarats, 2019).

Note that the NAdam optimizer is accessible via a specific parameterization of QHAdam. See from_nadam() for details.

Parameters
  • learning_rate (float, optional) – learning rate (\(\alpha\) from the paper) (default: 1e-3)

  • beta1 (float, optional) – coefficient used for computing running average of gradient (default: 0.9)

  • beta2 (float, optional) – coefficients used for computing running average of squared gradient (default: 0.999)

  • nu1 (float, optional) – immediate discount factor used to estimate the gradient (default: 1.0)

  • nu2 (float, optional) – immediate discount factor used to estimate the squared gradient (default: 1.0)

  • epsilon (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)

  • use_locking (bool) – whether or not to use locking parameter updates

  • name (str) – name of the optimizer

Example

>>> optimizer = qhoptim.tf.QHAdamOptimizer(
...     learning_rate=3e-4, nu1=0.8, nu2=1.0,
...     beta1=0.99, beta2=0.999)
classmethod from_nadam(learning_rate=0.001, beta1=0.9, beta2=0.999)[source]

Calculates the QHAdam hyperparameters required to recover the NAdam optimizer (Dozat, 2016).

This is not an identical recovery of the formulation in the paper, due to subtle differences in the application of the bias correction in the first moment estimator. However, in practice, this difference is almost certainly irrelevant.

Parameters
  • learning_rate (float, optional) – learning rate (\(\alpha\) from the paper) (default: 1e-3)

  • beta1 (float, optional) – coefficient used for computing running average of gradient (default: 0.9)

  • beta2 (float, optional) – coefficients used for computing running averages of squared gradient (default: 0.999)

Returns

Five-element dict containing learning_rate, beta1, beta2, nu1, and nu2 to use in QHAdam.

Example

>>> optimizer = qhoptim.tf.QHAdamOptimizer(
...     **qhoptim.tf.QHAdamOptimizer.from_nadam(
...         learning_rate=1e-3, beta1=0.9, beta2=0.999))