Quasi-hyperbolic optimizers for TensorFlow¶
Getting started¶
The TensorFlow optimizer classes are qhoptim.tf.QHMOptimizer
and
qhoptim.tf.QHAdamOptimizer
.
Use these optimizers as you would any other TensorFlow optimizer:
>>> from qhoptim.tf import QHMOptimizer, QHAdamOptimizer
# something like this for QHM
>>> optimizer = QHMOptimizer(
... learning_rate=1.0, nu=0.7, momentum=0.999)
# or something like this for QHAdam
>>> optimizer = QHAdamOptimizer(
... learning_rate=1e-3, nu1=0.7, nu2=1.0, beta1=0.995, beta2=0.999)
QHM API reference¶
-
class
qhoptim.tf.
QHMOptimizer
(learning_rate, momentum, nu, use_locking=False, name='QHM')[source]¶ Implements the quasi-hyperbolic momentum (QHM) optimization algorithm (Ma and Yarats, 2019).
Note that many other optimization algorithms are accessible via specific parameterizations of QHM. See
from_accsgd()
,from_robust_momentum()
, etc. for details.- Parameters
Example
>>> optimizer = qhoptim.tf.QHMOptimizer( ... learning_rate=1.0, nu=0.7, momentum=0.999)
Note
Mathematically, QHM is a simple interpolation between plain SGD and momentum:
\[\begin{split}\begin{align*} g_{t + 1} &\leftarrow \beta \cdot g_t + (1 - \beta) \cdot \nabla_t \\ \theta_{t + 1} &\leftarrow \theta_t + \alpha \left[ (1 - \nu) \cdot \nabla_t + \nu \cdot g_{t + 1} \right] \end{align*}\end{split}\]Here, \(\alpha\) is the learning rate, \(\beta\) is the momentum factor, and \(\nu\) is the “immediate discount” factor which controls the interpolation between plain SGD and momentum. \(g_t\) is the momentum buffer, \(\theta_t\) is the parameter vector, and \(\nabla_t\) is the gradient with respect to \(\theta_t\).
Note
QHM uses dampened momentum. This means that when converting from plain momentum to QHM, the learning rate must be scaled by \(\frac{1}{1 - \beta}\). For example, momentum with learning rate \(\alpha = 0.1\) and momentum \(\beta = 0.9\) should be converted to QHM with learning rate \(\alpha = 1.0\).
-
classmethod
from_accsgd
(delta, kappa, xi, eps=0.7)[source]¶ Calculates the QHM hyperparameters required to recover the AccSGD optimizer (Kidambi et al., 2018).
- Parameters
- Returns
Three-element
dict
containinglearning_rate
,momentum
, andnu
to use in QHM.
Example
>>> optimizer = qhoptim.tf.QHMOptimizer( ... **qhoptim.tf.QHMOptimizer.from_accsgd( ... delta=0.1, kappa=1000.0, xi=10.0))
-
classmethod
from_pid
(k_p, k_i, k_d)[source]¶ Calculates the QHM hyperparameters required to recover a PID optimizer as described in Recht (2018).
- Parameters
- Returns
Three-element
dict
containinglearning_rate
,momentum
, andnu
to use in QHM.
Example
>>> optimizer = qhoptim.tf.QHMOptimizer( ... **qhoptim.tf.QHMOptimizer.from_pid( ... k_p=-0.1, k_i=1.0, k_d=3.0))
-
classmethod
from_robust_momentum
(l, kappa, rho=None)[source]¶ Calculates the QHM hyperparameters required to recover the Robust Momentum (Cyrus et al., 2018) or Triple Momentum (Scoy et al., 2018) optimizers.
- Parameters
- Returns
Three-element
dict
containinglearning_rate
,momentum
, andnu
to use in QHM.
Example
>>> optimizer = qhoptim.tf.QHMOptimizer( ... **qhoptim.tf.QHMOptimizer.from_robust_momentum( ... l=5.0, kappa=15.0))
-
classmethod
from_synthesized_nesterov
(alpha, beta1, beta2)[source]¶ Calculates the QHM hyperparameters required to recover the synthesized Nesterov optimizer (Section 6 of Lessard et al. (2016)).
- Parameters
- Returns
Three-element
dict
containinglearning_rate
,momentum
, andnu
to use in QHM.
Example
>>> optimizer = qhoptim.tf.QHMOptimizer( ... **qhoptim.tf.QHMOptimizer.from_synthesized_nesterov( ... alpha=0.1, beta1=0.9, beta2=0.6))
-
classmethod
from_two_state_optimizer
(h, k, l, m, q, z)[source]¶ Calculates the QHM hyperparameters required to recover the following optimizer (named “TSO” in Ma and Yarats (2019)):
\[\begin{split}\begin{align*} a_{t + 1} &\leftarrow h \cdot a_t + k \cdot \theta_t + l \cdot \nabla_t \\ \theta_{t + 1} &\leftarrow m \cdot a_t + q \cdot \theta_t + z \cdot \nabla_t \end{align*}\end{split}\]Here, \(a_t\) and \(\theta_t\) are the two states and \(\nabla_t\) is the gradient with respect to \(\theta_t\).
Be careful that your coefficients satisfy the regularity conditions from the reference.
- Parameters
- Returns
Three-element
dict
containinglearning_rate
,momentum
, andnu
to use in QHM.
Example
>>> optimizer = qhoptim.tf.QHMOptimizer( ... **qhoptim.tf.QHMOptimizer.from_two_state_optimizer( ... h=0.9, k=0.0, l=0.1, m=-0.09, q=1.0, z=-0.01))
QHAdam API reference¶
-
class
qhoptim.tf.
QHAdamOptimizer
(learning_rate=0.001, beta1=0.9, beta2=0.999, nu1=1.0, nu2=1.0, epsilon=1e-08, use_locking=False, name='QHAdam')[source]¶ Implements the QHAdam optimization algorithm (Ma and Yarats, 2019).
Note that the NAdam optimizer is accessible via a specific parameterization of QHAdam. See
from_nadam()
for details.- Parameters
learning_rate (float, optional) – learning rate (\(\alpha\) from the paper) (default: 1e-3)
beta1 (float, optional) – coefficient used for computing running average of gradient (default: 0.9)
beta2 (float, optional) – coefficients used for computing running average of squared gradient (default: 0.999)
nu1 (float, optional) – immediate discount factor used to estimate the gradient (default: 1.0)
nu2 (float, optional) – immediate discount factor used to estimate the squared gradient (default: 1.0)
epsilon (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
use_locking (bool) – whether or not to use locking parameter updates
name (str) – name of the optimizer
Example
>>> optimizer = qhoptim.tf.QHAdamOptimizer( ... learning_rate=3e-4, nu1=0.8, nu2=1.0, ... beta1=0.99, beta2=0.999)
-
classmethod
from_nadam
(learning_rate=0.001, beta1=0.9, beta2=0.999)[source]¶ Calculates the QHAdam hyperparameters required to recover the NAdam optimizer (Dozat, 2016).
This is not an identical recovery of the formulation in the paper, due to subtle differences in the application of the bias correction in the first moment estimator. However, in practice, this difference is almost certainly irrelevant.
- Parameters
learning_rate (float, optional) – learning rate (\(\alpha\) from the paper) (default: 1e-3)
beta1 (float, optional) – coefficient used for computing running average of gradient (default: 0.9)
beta2 (float, optional) – coefficients used for computing running averages of squared gradient (default: 0.999)
- Returns
Five-element
dict
containinglearning_rate
,beta1
,beta2
,nu1
, andnu2
to use in QHAdam.
Example
>>> optimizer = qhoptim.tf.QHAdamOptimizer( ... **qhoptim.tf.QHAdamOptimizer.from_nadam( ... learning_rate=1e-3, beta1=0.9, beta2=0.999))