$ \newcommand{\R}{\mathbb{R}} \newcommand{\costR}{\mathcal{R}} $
This post is a part of The NAG Optimization Corner series.
Derivatives play an important role in the whole field of nonlinear optimization as a majority of the algorithms requires derivative information in one form or another. This post describes several ways to compute derivatives and focuses on the well-known finite difference approximation in detail.
Through the text we assume that the objective function $f(x)$ is sufficiently smooth and is minimized without constraints.
Why are derivatives so important?
Recall how a typical nonlinear optimization solver works. It gradually improves the estimate of the solution step by step. The solver needs to decide at each iteration the location of the new estimate, however, it has only very local information about the landscape of the function – the current function value and the derivatives. The derivatives express the slope of the function at a point so it seems natural that the derivative information be used to define, for example, a descent search direction and the solver searches along this ray for a new (better) point. In addition, if the derivatives are close to zero, the function is locally flat, they report (together with other pieces of information) that a local solution has been reached. It is easy to imagine that mistakes in derivatives might mislead the solver which then fails. It is therefore crucial to have a reliable way to provide derivatives whenever possible.
Approaches to compute derivatives
Evaluating derivatives written by hand is a possibility only for problems where the function can be expressed as an analytical formula. For anything other than small problems it will be a tedious and error-prone exercise and is impossible for more complex functions, such as, simulations. This is a very common mistake so a derivative checker is highly recommended!
Finite differences (FD) (also called bumping) approximate the slope of the function as a change in function values at two points close to each other. It is probably the most common approach if derivatives are not written by hand and will be discussed further.
Alternatively, the computer can compute the derivatives via Algorithmic Differentiation (AD). No matter how complicated the computation of the function is, very often it is "only" a (very) long series of basic operations and known mathematical functions. If the computer has access to the source code for the function it can also apply standard derivatives rules (e.g., the chain rule) and evaluate the derivatives. However, AD cannot be applied to functions where a part or the whole function is a black-box. Naive applications of AD might become too slow or memory demanding so access to AD expertise is always useful.
There are algorithms which neither require derivatives nor approximate them internally via FD. This class of method is called Derivative-free Optimization (DFO) and the following blog post introduces them.
Bumping - how difficult can it be?
Let's perform a small experiment: We try to minimize Rosenbrock's banana function $$ f(x_1,x_2) = 100(x_2-x_1^2)^2 + (1 - x_1)^2 $$ with a quasi-Newton algorithm e04ky which requires first derivatives. We will approximate them by forward finite differences: $$ \frac{\partial}{\partial x_i} f(x) \approx \frac{f(x+he_i) - f(x)}{h} $$ where $e_i$ is a standard basis vector. It turns out that the right choice of the FD interval $h$ is crucial.
The solution is $x_1=x_2=1.0$ with the optimal function value $0.0$. We will rerun the solver several times from starting point $[-2.75, 1.5]$ and compare the number of function evaluations, the final objective value, the exit ifail code and the norm of the true gradient at the final point as a measure of optimality.
$h$ | status (ifail) | evals | $f(x)$ | $\|\nabla f(x)\|$ |
---|---|---|---|---|
exact derivatives | optimal (0) | 57 | 1.2346e-16 | 4.9014e-07 |
1e-13 | failed (10) | 3 | 3.6894e+03 | 6.7854e+03 |
1e-12 | optimal (0) | 61 | 8.7011e-20 | 2.0029e-09 |
1e-10 | optimal (0) | 60 | 4.6257e-16 | 6.1138e-07 |
1e-08 | optimal (0) | 63 | 8.3944e-12 | 2.5924e-06 |
1e-07 | suboptimal (7) | 188 | 2.5827e-10 | 1.4908e-05 |
1e-06 | suboptimal (7) | 89 | 9.0222e-08 | 3.8266e-04 |
1e-05 | suboptimal (3) | 88 | 8.7574e-06 | 3.5074e-03 |
1e-04 | suboptimal (3) | 79 | 8.1961e-04 | 3.8085e-02 |
1e-03 | failed (10) | 3 | 3.6894e+03 | 6.7854e+03 |
It can be observed that for too small or too large $h$ the algorithm fails. For $h \in [10^{-12}, 10^{-8}]$ the solver fully satisfies its stopping criteria and for $h \in [10^{-7}, 10^{-4}]$ a reasonable approximation of the solution was obtained but with increasing $h$ the reliability decreases.
It is also interesting to look closely at the precision of the gradient computation at individual evaluations. The following table shows selected points from the run with $h=10^{-6}$ and shows the absolute and relative precision of the FD approximation, the norms of the exact and FD derivative, and the angle between both vectors (in degrees):
eval | $\| \nabla f(x) - \nabla f_{FD}(x) \|$ | $\| \nabla f(x) - \nabla f_{FD}(x) \| / \| \nabla f(x)\|$ | $\| \nabla f(x)\|$ | $\| \nabla f_{FD}(x)\|$ | angle (deg) |
---|---|---|---|---|---|
1 | 4.2385e-03 | 6.2464e-07 | 6.7855e+03 | 6.7855e+03 | 0.0 |
5 | 1.8397e-03 | 1.5818e-05 | 1.1630e+02 | 1.1630e+02 | 0.0 |
10 | 6.7621e-04 | 2.0091e-06 | 3.3658e+02 | 3.3658e+02 | 0.0 |
20 | 1.7706e-04 | 1.3523e-06 | 1.3093e+02 | 1.3093e+02 | 0.0 |
30 | 1.0067e-04 | 1.3220e-05 | 7.6146e+00 | 7.6145e+00 | 0.0 |
40 | 1.3910e-04 | 2.2372e-05 | 6.2177e+00 | 6.2176e+00 | 0.0 |
50 | 3.8461e-04 | 8.3636e-04 | 4.5986e-01 | 4.6013e-01 | 0.0 |
51 | 4.0201e-04 | 6.3257e-04 | 6.3552e-01 | 6.3581e-01 | 0.0 |
52 | 4.0894e-04 | 2.6149e-03 | 1.5638e-01 | 1.5607e-01 | 0.1 |
53 | 4.1319e-04 | 9.2076e-03 | 4.4875e-02 | 4.5189e-02 | 0.3 |
54 | 4.1295e-04 | 1.6156e-01 | 2.5561e-03 | 2.8432e-03 | 6.3 |
55 | 4.1304e-04 | 1.3006e+00 | 3.1758e-04 | 1.7052e-04 | 67.9 |
56 | 4.1305e-04 | 9.9867e-01 | 4.1360e-04 | 6.9688e-07 | 37.8 |
57 | 4.1305e-04 | 9.9998e-01 | 4.1306e-04 | 9.0793e-09 | 39.7 |
58 | 4.1305e-04 | 1.0017e+00 | 4.1234e-04 | 9.2678e-07 | 39.8 |
The effect of the finite difference interval $h$
The observed behaviour can easily be explained by the theory. Finite difference approximation has its roots in Taylor's theorem which says that a sufficiently smooth function $f$ at point $x+d$ can be expressed via derivatives and function value at point $x$ as: $$ f(x+d) = f(x) + \nabla f(x)^T d + \frac{1}{2} d^T \nabla^2 f(x) d + \cdots $$ It is easy to find out that forward finite difference $$ \frac{\partial}{\partial x_i} f(x) = \frac{f(x+he_i) - f(x)}{h} + \delta_h, \qquad \text{where}\quad \delta_h \approx O(h) $$ approximates the exact derivative by neglecting higher order terms. This introduces the truncation error $\delta_h$ of order $h$. It suggests that the smaller $h$ is, a better approximation we get. However, there is a second competing factor which needs to be accounted for. The evaluation of the function values $f(x)$ and $f(x+h)$ is performed in finite precision arithmetic and thus an error of order machine precision $\varepsilon$ needs to be considered. Therefore the cancellation error of the division in real arithmetic is of order $O(\varepsilon/h)$. Although too small $h$ improves the truncation error, it increases the cancellation error.
The optimal interval
The right balance between the both the truncation and the cancellation errors is needed. It can be estimated (see [1], [2]) that on assumption of a well-scaled function, the optimal forward finite difference interval $h_{opt} \approx \sqrt{\varepsilon}$ with the overall approximation error of order $\sqrt{\varepsilon}$. In standard double floating-point arithmetic $\varepsilon = 1.1e-16$ thus $h_{opt}$ is roughly $1e-8$.
It is easy to imagine a function that does not fit the assumptions, for example, each of the variables has noticeably different scale. In such a case care needs to be taken for the finite difference interval $h$. NAG provide a numerical estimation of $h$, see e04xa. The routine requires at least three function evaluations per a coordinate and best use is at a representative point (such as the starting point) and keep the interval $h$ unchanged through the computation.
The effect of the different scale for each coordinate is demonstrated on the following quadratic function: $$ f(x_1,x_2) = (x_1-1)^2 + \alpha*(x_2-1)^2 $$ with $\alpha \in \{1, 100, 10^4, 10^6\}$. Routine e04xa was called at point $[10,10]$ to estimate the optimal interval $h$ for each coordinate. The results show how the change of scale influences the choice of $h$:
$\alpha$ | 1 | 100 | 1e+04 | 1e+06 |
$h_{opt}$ for $x_1$ | 1.1941e-06 | 8.4602e-06 | 8.4181e-05 | 8.4177e-04 |
$h_{opt}$ for $x_2$ | 1.1941e-06 | 8.4602e-07 | 8.4178e-07 | 8.4176e-07 |
Possible complications
Despite all the efforts to choose the optimal FD interval $h$, there might be factors which damage the FD approximation. The forward finite difference has a clear advantage that it requires only one extra function evaluation per a coordinate at each point. However, as it was demonstrated, the relative error of the approximation grows as the algorithm approaches the solution ($\|\nabla f(x)\|$ decreases) which might stop convergence altogether. In such a case, the central finite difference $$ \frac{\partial}{\partial x_i} f(x) = \frac{f(x+he_i) - f(x-he_i)}{2h} + \delta_c, \qquad \text{where}\quad \delta_c \approx O(h^2) $$ might be a suitable replacement because its truncation error is of order $h^2$. Similar analysis can be performed to discover the optimal FD interval. Under the same assumptions of well-scaled $f$, $h_{opt}=\varepsilon^{1/3}$ with the overall approximation error $\varepsilon^{2/3}$ (5e-6 and 2e-11, respectively in double arithmetic). The disadvantage is that two extra function evaluations are required which makes the central difference fairly expensive and thus an alternative only for a last couple of iterations when the forward FD approximation is insufficient.
There are even higher order derivative approximations, however, they would be typically prohibitively expensive for a standard optimization run and are not discussed further.
The estimate of the optimal interval $h_{opt}$ heavily relies on the precision of evaluations of $f(\cdot)$. If the function evaluation is noisy (for example, a complex simulation is performed as a part of the objective function and cannot reach the full $\varepsilon$ precision), the FD approximation degrades very quickly and the only remedy is to choose a specialized solver which is less sensitive to the noise, such as DFO.
We showed that even the finite difference approximation has its caveats. Particularly, the choice of the FD interval $h$ is crucial for the precision and the forward FD might be insufficient close to the solution. Ideally the solver itself should drive the FD approximations as it can modify other solver's options and possibly switch to central differences closer to the solution. In any case, special care should be taken when finite differences are applied to imprecise function evaluation, in such a case there might be better suited approaches, such as derivative free optimization.
To see all the previous blogs, please go here. You can also find various examples through our GitHub Local optimization page. See you in the next blog.
[1] Gill P E, Murray W and Wright M H (1981) Practical Optimization Academic Press
[2] Nocedal J and Wright S J (2006) Numerical Optimization (2nd Edition) Springer Series in Operations Research, Springer, New York
The experiments presented above were performed in MATLAB® with the NAG Toolbox with the following files, click here to download (3,228 bytes).