13 Gradient Descent Functions and Hyperparameters

8.12

13 Gradient Descent Functions and Hyperparameters🔗ℹ

The following types are used in this section.

accompanied? : (listof tensor?)
objective-fn? : (-> theta? tensor?), as defined in Loss Functions.
id? : The identity function
inflator? : (-> tensor? accompanied?)
deflator? : (-> accompanied? tensor?)
updator? : (-> accompanied? tensor? accompanied?)
id-updator? : (-> tensor? tensor? tensor?)

Hyperparameters can be given values using with-hypers as in Hyperparameters.

procedure
(revise f revs theta) → theta?
  f : (-> theta? theta?)
  revs : natural?
  theta : theta?

Returns the result of (f (f (f ... (f theta)))), where f is applied revs times.

procedure
(gradient-descent inflate deflate update)
→ (-> objective-fn? theta? theta?)
  inflate : id?
  deflate : id?
  update : id-updator?
(gradient-descent inflate deflate update)
→ (-> objective-fn? theta? theta?)
  inflate : inflator?
  deflate : deflator?
  update : updator?

Generates a gradient descent function by accepting three control functions.

inflate injects a parameter tensor into an accompanied parameter.
deflate projects a parameter tensor out of an accompanied parameter.
update produces a new accompanied? from a given accompanied parameter and a gradient tensor.

The generated gradient descent function accepts an objective function and a θ and returns a revised θ after revs revisions, using gradient descent.

value
revs : scalar?

Hyperparameter that defines the number of revisions that gradient-descent will use.

value
alpha : scalar?

Hyperparameter that defines the learning rate for the different types of gradient descent functions.

procedure
(naked-gradient-descent obj? θ) → (listof tensor?)
obj? : (-> (listof tensor?) scalar?)
θ : (listof tensor?)

Gradient descent function where inflate and deflate are the identity function and update is

(λ (pa g)
(- pa (* alpha g)))

where alpha is the learning rate hyper parameter.

procedure
(velocity-gradient-descent obj? θ) → (listof tensor?)
obj? : (-> (listof tensor?) scalar?)
θ : (listof tensor?)

Gradient descent function generated with the following functions as the inflate, deflate and update

(define velocity-i
  (λ (p)
    (list p (zeroes p))))

(define velocity-d
  (λ (pa)
    (ref pa 0)))

(define velocity-u
  (λ (pa g)
    (let ((v (- (* mu (ref pa 1)) (* alpha g))))
      (list (+ (ref pa 0) v) v))))

Here mu is the hyperparameter defining the fraction of the velocity from the past revision that is transferred to the current revision.

value
mu : scalar?

Hyperparameter that defines the fraction of the gradient that is used to control velocity-gradient-descent and adam-gradient-descent.

procedure
(rms-gradient-descent obj? θ) → (listof tensor?)
obj? : (-> (listof tensor?) scalar?)
θ : (listof tensor?)

Gradient descent function generated with the following functions as the inflate, deflate and update

(define rms-i
  (λ (p)
    (list p (zeroes p))))

(define rms-d
  (λ (pa)
    (ref pa 0)))

(define rms-u
  (λ (pa g)
    (let ((r (smooth beta (ref pa 1) (sqr g))))
      (let ((alpha-hat (/ alpha (+ (sqrt r) epsilon))))
        (list (- (ref pa 0) (* alpha-hat g)) r)))))

Here beta is the hyperparameter defining the decay rate for smoothing the square of the gradients.

value
beta : scalar?

Hyperparameter used to smooth gradients in order to control rms-gradient-descent and adam-gradient-descent.

procedure
(adam-gradient-descent obj? θ) → (listof tensor?)
obj? : (-> (listof tensor?) scalar?)
θ : (listof tensor?)

Gradient descent function generated with the following functions as the inflate, deflate and update

(define adam-i
  (λ (p)
    (let ((zeroed (zeroes p)))
      (list p zeroed zeroed))))

(define adam-d
  (λ (pa)
    (ref pa 0)))

(define adam-u
  (λ (pa g)
    (let ((r (smooth beta (ref pa 2) (sqr g))))
      (let ((alpha-hat (/ alpha (+ (sqrt r) epsilon)))
            (v (smooth mu (ref pa 1) g)))
        (list (- (ref pa 0) (* alpha-hat v)) v  r)))))

Here beta and mu are decay rates for smoothing the square of the gradients and the gradient respectively.

procedure
(smooth decay-rate average g) → real?
  decay-rate : real?
  average : real?
  g : real?

Returns a blending of average and g using the decay-rate as follows

(+ (* decay-rate average)
(* (- 1.0 decay-rate) g))

value
epsilon : real?

A numerical stabilizer set to the value 1e-8.

1	Overview
2	Entry points
3	List functions
4	Tensor functions
5	Extended Functions
6	Automatic Differentiation
7	Differentiable extended numerical functions
8	Non-differentiable extended numerical functions
9	Base-rank (non-extended) differentiable functions
10	Boolean comparison functions
11	Tensorized comparison functions
12	Hyperparameters
13	Gradient Descent Functions and Hyperparameters
14	Layer functions
15	Loss Functions
16	Building blocks for neural networks
17	He Initialization
18	Random number functions
19	Models and Accuracy
20	Logging
21	Utilities
22	Setting tensor implementations