On this page:
revise
gradient-descent
revs
alpha
naked-gradient-descent
velocity-gradient-descent
mu
rms-gradient-descent
beta
adam-gradient-descent
smooth
epsilon
8.12

13 Gradient Descent Functions and Hyperparameters🔗ℹ

The following types are used in this section.
  • accompanied? : (listof tensor?)

  • objective-fn? : (-> theta? tensor?), as defined in Loss Functions.

  • id? : The identity function

  • inflator? : (-> tensor? accompanied?)

  • deflator? : (-> accompanied? tensor?)

  • updator? : (-> accompanied? tensor? accompanied?)

  • id-updator? : (-> tensor? tensor? tensor?)

Hyperparameters can be given values using with-hypers as in Hyperparameters.

procedure

(revise f revs theta)  theta?

  f : (-> theta? theta?)
  revs : natural?
  theta : theta?
Returns the result of (f (f (f ... (f theta)))), where f is applied revs times.

procedure

(gradient-descent inflate deflate update)

  (-> objective-fn? theta? theta?)
  inflate : id?
  deflate : id?
  update : id-updator?
(gradient-descent inflate deflate update)
  (-> objective-fn? theta? theta?)
  inflate : inflator?
  deflate : deflator?
  update : updator?
Generates a gradient descent function by accepting three control functions.
  • inflate injects a parameter tensor into an accompanied parameter.

  • deflate projects a parameter tensor out of an accompanied parameter.

  • update produces a new accompanied? from a given accompanied parameter and a gradient tensor.

The generated gradient descent function accepts an objective function and a θ and returns a revised θ after revs revisions, using gradient descent.

value

revs : scalar?

Hyperparameter that defines the number of revisions that gradient-descent will use.

value

alpha : scalar?

Hyperparameter that defines the learning rate for the different types of gradient descent functions.

procedure

(naked-gradient-descent obj? θ)  (listof tensor?)

  obj? : (-> (listof tensor?) scalar?)
  θ : (listof tensor?)
Gradient descent function where inflate and deflate are the identity function and update is
(λ (pa g)
  (- pa (* alpha g)))
where alpha is the learning rate hyper parameter.

procedure

(velocity-gradient-descent obj? θ)  (listof tensor?)

  obj? : (-> (listof tensor?) scalar?)
  θ : (listof tensor?)
Gradient descent function generated with the following functions as the inflate, deflate and update
(define velocity-i
  (λ (p)
    (list p (zeroes p))))
 
(define velocity-d
  (λ (pa)
    (ref pa 0)))
 
(define velocity-u
  (λ (pa g)
    (let ((v (- (* mu (ref pa 1)) (* alpha g))))
      (list (+ (ref pa 0) v) v))))

Here mu is the hyperparameter defining the fraction of the velocity from the past revision that is transferred to the current revision.

value

mu : scalar?

Hyperparameter that defines the fraction of the gradient that is used to control velocity-gradient-descent and adam-gradient-descent.

procedure

(rms-gradient-descent obj? θ)  (listof tensor?)

  obj? : (-> (listof tensor?) scalar?)
  θ : (listof tensor?)
Gradient descent function generated with the following functions as the inflate, deflate and update
(define rms-i
  (λ (p)
    (list p (zeroes p))))
 
(define rms-d
  (λ (pa)
    (ref pa 0)))
 
(define rms-u
  (λ (pa g)
    (let ((r (smooth beta (ref pa 1) (sqr g))))
      (let ((alpha-hat (/ alpha (+ (sqrt r) epsilon))))
        (list (- (ref pa 0) (* alpha-hat g)) r)))))

Here beta is the hyperparameter defining the decay rate for smoothing the square of the gradients.

value

beta : scalar?

Hyperparameter used to smooth gradients in order to control rms-gradient-descent and adam-gradient-descent.

procedure

(adam-gradient-descent obj? θ)  (listof tensor?)

  obj? : (-> (listof tensor?) scalar?)
  θ : (listof tensor?)
Gradient descent function generated with the following functions as the inflate, deflate and update
(define adam-i
  (λ (p)
    (let ((zeroed (zeroes p)))
      (list p zeroed zeroed))))
 
(define adam-d
  (λ (pa)
    (ref pa 0)))
 
(define adam-u
  (λ (pa g)
    (let ((r (smooth beta (ref pa 2) (sqr g))))
      (let ((alpha-hat (/ alpha (+ (sqrt r) epsilon)))
            (v (smooth mu (ref pa 1) g)))
        (list (- (ref pa 0) (* alpha-hat v)) v  r)))))

Here beta and mu are decay rates for smoothing the square of the gradients and the gradient respectively.

procedure

(smooth decay-rate average g)  real?

  decay-rate : real?
  average : real?
  g : real?
Returns a blending of average and g using the decay-rate as follows
(+ (* decay-rate average)
   (* (- 1.0 decay-rate) g))

value

epsilon : real?

A numerical stabilizer set to the value 1e-8.