Limited-Memory Nonlinear Conjugate Gradient Solver

At Mark 27 NAG introduced a novel solver handle_solve_bounds_foas (e04kf) for large-scale unconstrained or bound-constrained nonlinear optimization problems. It offers a significant performance improvement over other available alternatives. It is based on the nonlinear conjugate gradient method and thanks to its low memory requirements it is suitable even for problems with millions of variables and beyond. It is part of NAG's ongoing effort to expand and improve its offering in mathematical optimization.

Large-scale nonlinear optimization problems of the form

\[ \begin{align} \min f(x)& \label{p}\\ \text{subject to}&\quad\ell\leq x\leq u,\quad x\in\mathbb{R}^n, \nonumber \end{align} \](1)

where \(f(x)\) is smooth with first-order derivatives, are quite ubiquitous in real-life applications. This kind of problem becomes quite challenging to solve when the search space is substantially large (\(n=10^5\) or more). It is often the case that evaluating the second-order derivatives (Hessian) is prohibitively expensive (in time or resources) or outright impossible. First-order methods are designed to rely only on first-order derivatives (gradient) and make clever use of gradient information to build a model of \(f(x)\) to find good search directions. Methods such as conjugate gradient (CG) or quasi-Newton (QN) have been well studied and are commonly used to solve (1).

Improving on Well-established Methods

CG methods have their origin in linear optimization and have been successfully extended to nonlinear optimization. The main aspect that characterize these methods are their low memory requirements and convergence speed. They have gained recent popularity with the introduction of new ways on how to deal with numerical ill-conditioning and the coupling with very fast linesearch algorithms. CG methods represent an attractive compromise between the simple steepest descent method and the more memory hungry QN methods.

Implementations of first-order methods are not only ubiquitous and have widespread use, but have also demonstrated that they endure the challenges of ever-growing problem size imposed by industry. Most notable are applications in statistics and Machine Learning, e.g. parameter calibration for log-linear models and conditional random fields (\(\ell_2\)-regularization).

One of the most used first-order method is L-BFGS-B which is a limited-memory bound constrained version of the BFGS QN algorithm. The main idea is to approximate the inverse Hessian matrix using a limited amount of direction and gradient vectors to represent implicitly the approximation. Recent CG methods incorporate QN ideas [5] in which they retain convergence speed and low memory requirement while exploiting curvature information provided by the QN approximation to the Hessian. This combination has made the limited-memory CG a competitive alternative to L-BFGS-B.

NAG's e04kf is a modern implementation of a bound-constrained nonlinear CG method that incorporates a limited-memory QN approximation similar in idea to L-BFGS-B. The methods implemented into this new solver have a very low memory footprint and are based on groundbreaking research in the field of first-order methods.

Features of the New Solver

  • Designed to solve large-scale nonlinear problems subject to box bounds;
  • Uses only first-order information;
  • Only approximates the Hessian when numerical difficulties arise due to ill-conditioned problems;
  • Incorporates a state-of-the-art bracketed linesearch [3];
  • Ability to recover when it is not possible to evaluate the function \(f(x)\) or its gradient at a given point;
  • Memory requirement is half of L-BFGS-B. While this algorithm requires to store separately direction and gradient information, e04kf only stores recent directions, thus halving the memory footprint. It can be a one-to-one replacement for L-BFGS-B.

e04kf was designed to create a competitive alternative to QN methods. Benchmark on 130 problems of the CUTEst set shows that e04kf is considerably faster than L-BFGS-B (3.0). Figure 1 shows the performance profiles in time and gradient calls. Contrasting the solvers on plots (a) and (b) shows that both solvers use roughly the same amount of gradient calls (b) yet e04kf solves 70% of the problems faster (a).

Performance Profiles 1(a) (a) Performance Profiles 2(b) (b)

 

Figure 1: Performance profiles for 130 CUTEst nonlinear bound constrained and unconstrained problems, higher line represent faster solver. In (a) comparison metric is time while in (b) the metric is the number of gradient calls.

 

Replacement for NAG Solver uncon_conjgrd_comp (e04dg)

One of the main design objectives for e04kf was to provide a modern and attractive replacement for the CG solver uncon_conjgrd_comp (e04dg) introduced at Mark 12. While the e04dg solver was targeted for unconstrained NLPs, e04kf not only has been extended to solve bound-constrained NLPs but also offers noticeable performance gains. Figure 2 reports a benchmark using performance profiles over a set of CUTEst unconstrained NLP problems for both solvers e04kf and e04dg. Contrasting the two plots, it can be seen that the new solver is more efficient in time and in terms of user call-backs: it solves 45% of the problems faster (left plot) and 60% of the problems require less gradient evaluations (right plot). The benchmark timing data also showed that 13% of the problems where solved 10 times or more faster which includes a 7% that where solved at least 100 times faster. These results show a clear advantage of e04kf over e04dg and current users of uncon_conjgrd_comp (e04dg) are highly encourage to upgrade.

 

Performance Profiles 2(a) (a) Performance Profiles 2(b) (b)

 

Figure 2: Performance profiles comparing solvers e04kf and e04dg over 114 CUTEst unconstrained NLP problems. Performance measure are: time (a) and number of gradient calls (b). For the time plot (a), higher line indicates faster solver. For the plot (b), higher line represent less gradients calls.

 

Choosing the right solver

A frequent issue when it comes to solving optimization problems is, which solver to use? The go-to solution for using the most powerful solver is not always the best choice. In general, simpler first-order (gradient) methods such as e04kf require more iterations than second-order (Hessian) counterparts such as e04st. For problems where evaluating the gradient has negligible cost compared to evaluating the Hessian, first-order methods can be very competitive. Despite having a higher iteration count, the overall cost can be considerably less. As an illustrative example, e04kf and e04st are used to solve the CUTEst COSINE(\(n\)) problem for an increasing value of the decision space, \(n\). Figure 3 plots the time required for e04kf and e04st to solve each instance of \(n\) from 10 to 1000000. It can be observed that after a given size, evaluating and factorizing the Hessian matrix has a non-negligible cost. In effect, when \(n=10^6\) it can be seen that e04kf is \(18/5=3.6\) faster than e04st. This is one reason why it is important to know and match solver strengths with the characteristics of the problem to solve.

Performance Profiles 3

 

Figure 3: Performance comparison in time (s) of e04kf and e04st when solving different size (\(n\)) instances of CUTEst COSINE problem.

e04kf is part of the NAG Optimization Modelling Suite, it offers clarity and consistency of the interface of the solvers within the suite.

 

References

[1] D. C. Liu and J. Nocedal. (1989). On the Limited Memory Method for Large Scale Optimization. Mathematical Programming B. 45 (3) 503–528.

[2] R. H. Byrd, P. Lu and J. Nocedal. (1995). A Limited Memory Algorithm for Bound Constrained Optimization. SIAM Journal on Scientific and Statistical Computing, 16(5) 1190–1208.

[3] W. W. Hager and H. Zhang. (2005). A New Conjugate Gradient Method with Guaranteed Descent and an Efficient Line Search. SIAM J. Optim. 16(1) 170–192.

[4] W. W. Hager and H. Zhang. (2006b). A New Active Set Algorithm for Box Constrained Optimization. SIAM J. Optim. 17(2) 525–557.

[5] W. W. Hager and H. Zhang. (2013). The Limited Memory Conjugate Gradient Method. SIAM J. Optim. 23(4) 2150–2168.