'On Vectorization of Algorithmic Adjoint C++ Code' is written by NAG Collaborators – Johannes Lotz, Klaus Leppkes, Uwe Naumann (RWTH Aachen University, Germany), and was originally published on QuantMinds 


Exploitation of vectorization capabilities in modern CPUs is crucial for getting the highest possible performance. For example, it is widely known that suitable financial Monte Carlo simulation code can benefit from vectorization immensely. On the other hand, the efficient computation of gradients as well as of higherorder parameter sensitivities for a given (also: primal) C++ code by adjoint algorithmic differentiation (AAD) has become essential in computational finance. While auto-vectorization built into modern compilers performs reasonably well on the primal code it typically fails on adjoint code obtained by implementations of AAD by overloading. Vectorization needs to be built explicitly into the corresponding AAD tools. In this article we discuss vectorized AAD by overloading based on recent advances in the upcoming version 3.4 of the NAG AAD tool dco/c++.

Modern C++ compilers are able to automatically vectorize suitable C++ code, that is they can create vector instructions in the object code based on the results of data-dependence analysis applied to the scalar source code. This capability relies on the programmer taking care of things like data alignment and pointer aliasing. In addition, compilers can be given hints in form of vendor-specific keywords or pragmas. For example, the GNU compiler supports __builtin_assume_aligned and the OpenMP standard provides #pragma simd to mark loops operating in SIMD (single instruction multiple data) mode. The respective loop bodies need to be simple; for example, Intel suggests to use "straight-line code (a single basic block)" only and to avoid "function calls".

Vectorization of AAD is essential unless one is willing to accept a substantial increase in run time of the adjoint code relative to the primal code. Without vectorization and depending on the properties of the primal target code as well as on the level of maturity of the AAD solution/tool this ratio often ranges somewhere between  two and ten. Failure to vectorize the adjoint can easily add an order of magnitude. The additional code complexity due to AAD implemented by operator and function overloading and possibly based on sophisticated template meta-programming methodology almost certainly prevents the compiler from autovectorization as illustrated by the following image.

Intel Compiler on Haswell i5-7200

We plot absolute run times for a Monte Carlo solution of a simple stochastic differential equation compiled with the Intel C++ compiler (version 19.0) on AVX2 hardware. The relative run time of the scalar adjoint generated by dco/c++ turns out to be approximately equal to two (blue columns). Auto-vectorization performs adequately for the primal. It fails on the adjoint (orange columns). Use of the dco/c++ vector capabilities not only regains vectorization for the adjoint. It even decreases the primal run time (yellow columns).

Consider the following simplified Monte Carlo code for computing N paths with a set of given parameters p and random variable vector Z of matching size. The output of each individual path is stored in the vector y .

std::vector<double> /* inputs  */ p(M),
                    /* outputs */ y(N);
for (size_t i = 0; i < N; ++i) {
  /* single Monte Carlo path */
  y[i] = f(p, Z[i]);

The path-wise adjoint method yields the following adjoint code with new input ya (adjoint of primal output y ) and new output pa (adjoint of primal input p ); only additional declarations are shown.

std::vector<double> ya(N), pa(M);
for (size_t i = 0; i < N; ++i) {
/* adjoint of single Monte Carlo path */
fa(p, pa, Z[i], ya[i]);

dco/c++ provides vector types in combination with corresponding vector instructions for use within the primal code. The SIMD length needs to be supplied as a compile time constant in form of a template argument. The memory occupied by variables of this type is required to be 32-byte aligned; a specialized allocator for std::vector comes with dco/c++. Moreover, malloc -like platform independent allocation and deallocation functions are available.

/* vector type with SIMD length L */
using vtype = dco::gv<double, L>::type;
std::vector<vtype> p(M);
/* => p is initialized here <= */
/* iterate chunk-wise */
for (auto range : dco::subranges<L>(N)) {
/* vectorized Monte Carlo paths */
vtype yi = f(p, Z[range.global_index]);
/* unpack chunk into y */
dco::vunpack(range, y, yi);

The dco/c++ vector type can be used as base type for tangent (directional) as well as adjoint derivative types yielding vectorized computation of tangents and adjoints. Recursive nesting of derivative types enables vectorized second- and higher-order tangents and adjoints. The following code shows a combination of a standard dco/c++ adjoint driver with vectorization.

using vtype = dco::gv<double, L>::type;
/* the adjoint mode and data type */
using amode = dco::ga1s<vtype>;
using atype = amode::type;
/* create tape data structure */
amode::global_tape = amode::tape_t::create();
std::vector<atype> p(M);
for (auto range : dco::subranges<L>(N)) {
/* define which derivatives we want */
amode::global_tape->register_variable(p, M);
/* record vectorized Monte Carlo paths */
atype yi = f(p, Z[range.global_index]);
for (auto i : range) {
y[i.global_index] = dco::value(yi)[i.sub_index];
dco::derivative(yi)[i.sub_index] = ya[i.global_index];

The vector types in dco/c++ are based on operator overloading making their integration into existing vectorizable code relatively straight forward. No restrictions apply to call tree depth nor to flow of control for as long as the latter does not depend on the vectorized values. Vector masking needs to be used otherwise. 

Obviously, the concepts discussed in this article are not limited to implementation of AAD by overloading. Both manual and automated source transformation AAD can benefit from vectorization. Still the ease of use of custom vector types and their seamless integration into the hierarchical AAD type system make an overloading method the preferred choice for large C++ code bases.

Leave a Comment

This form has an automated anti-spam system running (Recaptcha). If it suspects you are not a valid visitor a backup challenge will appear here.