Information about Steepest Descent

Gradient descent is an optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or the approximate gradient) of the function at the current point. If instead one takes steps proportional to the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.

Gradient descent is also known as steepest descent, or the method of steepest descent. When known as the latter, gradient descent should not be confused with the method of steepest descent for approximating integrals.

Description

Gradient descent is based on the observation that if the real-valued function is defined and differentiable in a neighborhood of a point , then decreases fastest if one goes from in the direction of the negative gradient of at , . It follows that, if



for a small enough number, then . With this observation in mind, one starts with a guess for a local minimum of , and considers the sequence such that



We have



so hopefully the sequence converges to the desired local minimum. Note that the value of the step size is allowed to change at every iteration.

This process is illustrated in the following picture.
Here is assumed to be defined on the plane, and that its graph has a bowl shape. The blue curves are the contour lines, that is, the regions on which the value of is constant. A red arrow originating at a point shows the direction of the negative gradient at that point. Note that the (negative) gradient at a point is perpendicular to the contour line going through that point. We see that gradient descent leads us to the bottom of the bowl, that is, to the point where the value of the function is minimal.

Some examples

Gradient descent has problems with pathological functions such as the Rosenbrock function shown here:


The gradient descent method applied to :
Enlarge picture
The gradient descent algorithm in action. (1: contour)
Enlarge picture
The gradient descent algorithm in action. (2: surface)

Comments

Note that gradient descent works in spaces of any number of dimensions, even in infinite-dimensional ones. In the latter case the search space is typically a function space, and one calculates the Gâteaux derivative of the functional to be minimized to determine the descent direction.

Two weaknesses of gradient descent are:
  1. The algorithm can take many iterations to converge towards a local minimum, if the curvature in different directions is very different.
  2. Finding the optimal per step can be time-consuming. Conversely, using a fixed can yield poor results. Methods based on Newton's method and inversion of the Hessian using conjugate gradient techniques are often a better alternative.


A more powerful algorithm is given by the BFGS method which consists in calculating on every step a matrix by which the gradient vector is multiplied to go into a "better" direction, combined with a more sophisticated line search algorithm, to find the "best" value of

See also

References

  • Mordecai Avriel (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing. ISBN 0-486-43227-0.
  • Jan A. Snyman (2005). Practical Mathematical Optimization: An Introduction to Basic Optimization Theory and Classical and New Gradient-Based Algorithms. Springer Publishing. ISBN 0-387-24348-8
In mathematics, the term optimization, or mathematical programming, refers to the study of problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set.
..... Click the link for more information.
In mathematics, computing, linguistics, and related disciplines, an algorithm is a finite list of well-defined instructions for accomplishing some task that, given an initial state, will proceed through a well-defined series of successive states, eventually terminating in an
..... Click the link for more information.
maxima and minima, known collectively as extrema, are the largest value (maximum) or smallest value (minimum), that a function takes in a point either within a given neighbourhood (local extremum) or on the function domain in its entirety (global
..... Click the link for more information.
gradient of a scalar field is a vector field which points in the direction of the greatest rate of increase of the scalar field, and whose magnitude is the greatest rate of change.
..... Click the link for more information.
maxima and minima, known collectively as extrema, are the largest value (maximum) or smallest value (minimum), that a function takes in a point either within a given neighbourhood (local extremum) or on the function domain in its entirety (global
..... Click the link for more information.
steepest descent method or saddle-point approximation is a method used to approximate integrals of the form
where f(x) is some twice-differentiable function, M is a large number, and the integral endpoints a and
..... Click the link for more information.
bowl, a common open-top container in many cultures, is used to serve food, and is sometimes also used for drinking and storing other items. They are generally small and shallow, although some, such as punch bowls and salad bowls, are larger and are sometimes intended to serve many
..... Click the link for more information.
contour map (topographic map) uses contour lines (often just called a "contour") to join points of equal elevation (height) and thus show valleys and hills, and the steepness of slopes.
..... Click the link for more information.
Rosenbrock function is a non-convex function used as a test problem for optimization algorithms. It is also known as Rosenbrock's valley or Rosenbrock's banana function.

This function is often used to test performance of optimisation algorithms.
..... Click the link for more information.
In mathematics, a function space is a set of functions of a given kind from a set X to a set Y. It is called a space because in many applications, it is a topological space or a vector space or both.
..... Click the link for more information.
In mathematics, the Gâteaux derivative is a generalisation of the concept of directional derivative in differential calculus. Named after René Gâteaux, a French mathematician who died young in World War I, it is defined for locally convex topological vector spaces, contrasting with
..... Click the link for more information.
In mathematics, Newton's method is a well-known algorithm for finding roots of equations in one or more dimensions. It can also be used to find local maxima and local minima of functions by noticing that if a real number is a stationary point of a function , then is a root of the
..... Click the link for more information.
In mathematics, the Hessian matrix is the square matrix of second-order partial derivatives of a function. Given the real-valued function



if all second partial derivatives of f exist, then the Hessian matrix of f is the matrix
..... Click the link for more information.
conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive definite.
..... Click the link for more information.
In mathematics, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is a method to solve an unconstrained nonlinear optimization problem.

The BFGS method is derived from the Newton's method in optimization, a class of hill-climbing optimization techniques that
..... Click the link for more information.
In (unconstrained) optimization, the linesearch strategy is one of two basic iterative approaches to finding a local minimum of an objective function . The other method is that of trust regions.
..... Click the link for more information.
conjugate gradient method is an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive definite.
..... Click the link for more information.
Stochastic gradient descent is a general optimization algorithm, but is typically used to fit the parameters of a machine learning model.

In standard (or "batch") gradient descent, the true gradient is used to update the parameters of the model.
..... Click the link for more information.
In mathematics, Newton's method is a well-known algorithm for finding roots of equations in one or more dimensions. It can also be used to find local maxima and local minima of functions by noticing that if a real number is a stationary point of a function , then is a root of the
..... Click the link for more information.
In mathematics, the term optimization, or mathematical programming, refers to the study of problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set.
..... Click the link for more information.
In (unconstrained) optimization, the linesearch strategy is one of two basic iterative approaches to finding a local minimum of an objective function . The other method is that of trust regions.
..... Click the link for more information.
The delta rule is a gradient descent learning rule for updating the weights of the artificial neurons in a single-layer perceptron. For a neuron with activation function the delta rule for 's th weight is given by

,

..... Click the link for more information.


This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus


page counter