Compiled functions

Compiled functions#

Added in version 4.0.0.

heyoka can compile just-in-time (JIT) multivariate vector functions defined via the expression system. This feature is described and explored in detail in a dedicated tutorial for heyoka.py, the Python bindings of heyoka.

In Python, just-in-time compilation can lead to substantial speedups for function evaluation. In C++, the performance argument is less strong as C++ does not suffer from the same performance pitfalls of Python: if you need fast evaluation of a function, you can just implement it directly in C++ code without resorting to just-in-time compilation.

Nevertheless, even in C++ heyoka’s compiled functions offer a few advantages over plain C++ functions:

  • because functions are compiled just-in-time, they can take advantage of all the features of the host CPU. Most importantly, heyoka’s compiled functions support batch evaluation via SIMD instructions which can provide a multifold speed boost over plain (scalar) C++ functions;

  • batch mode evaluation of compiled functions also supports multithreaded parallelisation, which can provide another substantial performance boost on modern multicore machines;

  • heyoka’s functions support automatic differentiation up to arbitrary order, thus it is possible to evaluate the derivatives of a compiled function without any additional effort;

  • heyoka’s functions can be defined at runtime, whereas C++ functions need to be defined and available at compilation time. This means that it is possible to create a compiled function at runtime from user-supplied data (e.g., a configuration file) and evaluate it with optimal performance.

The main downside of compiled functions is that the compilation process is computationally expensive and thus just-in-time compilation is most useful when a function needs to be evaluated repeatedly with different input values (so that the initial compilation overhead can be absorbed by the evaluation performance increase).

A simple example#

As an initial example, we will JIT compile the simple bivariate function

\[f\left(x, y \right) = x^2 - y^2.\]

We begin with the definition of the symbolic variables and of the symbolic function to be compiled:

    // Init the symbolic variables.
    auto [x, y] = make_vars("x", "y");

    // Create the symbolic function.
    auto sym_func = x * x - y * y;

Next, we create a compiled function via the cfunc class:

    // Create the compiled function.
    cfunc<double> cf{{sym_func}, {x, y}};

Note how sym_func was passed to the constructor of cfunc enclosed in curly brackets: this is because in general cfunc expects in input a vector function - that is, a list of expressions representing the function components. In this specific case, we are compiling a vector function with only one component.

Like many other heyoka classes, cfunc is a class template parametrised over a single type T representing the floating-point type to be used for function evaluation. In this case, we are operating in standard double precision.

Let us inspect the compiled function object by printing it to screen:

    // Print the compiled function object to screen.
    fmt::println("{}", cf);
C++ datatype: double
Variables: [x, y]
Output #0: (x**2.0000000000000000 - y**2.0000000000000000)

We can now proceed to evaluate the compiled function. In order to do so, we need to store the input values in a memory buffer and prepare a memory buffer to store the result of the evaluation. We can use for both std::array:

    // Prepare the input-output buffers.
    std::array<double, 2> in{1, 2};
    std::array<double, 1> out{};

We stored the values \(1\) and \(2\) in the input buffer, which means that the function will be evaluated for \(x=1\) and \(y=2\).

We can now proceed to invoke the call operator of cfunc, which will write the result of the evaluation into out:

    // Invoke the compiled function.
    cf(out, in);

Let us print out to screen in order to confirm that the evaluation was successful:

    // Print the output.
    fmt::println("Output: {}", out);
Output: [-3]

Batch evaluation#

The simple example we have just seen consisted of the evaluation of a function over a single value for each variable. cfunc also supports evaluation of a function over batches of input values for each variable.

In order to perform batch evaluation, we first have to define new memory buffers to store the inputs and outputs of the evaluation. We select a batch size of \(2\), which means we need storage for \(2 \times 2 = 4\) input values and \(2\) output values:

    // Prepare input-output buffers for batch evaluation.
    std::array<double, 4> in_batch{1, 1.1, 2, 2.2};
    std::array<double, 2> out_batch{};

In batch evaluations, input values for a single batch are expected to be stored contiguously. That is, the input buffer will be interpreted as a row-major bidimensional array in which each row contains the batch of input values for a single variable. In this specific example, we will be evaluating the function for \(x=\left[ 1, 1.1 \right]\) and \(y=\left[ 2, 2.2 \right]\).

In the next step, we create bidimensional views over the input/output buffers with the help of mdspan and the convenience typedefs cfunc::in_2d and cfunc::out_2d:

    // Prepare the views onto the input-output buffers.
    cfunc<double>::in_2d in_view{in_batch.data(), 2, 2};
    cfunc<double>::out_2d out_view{out_batch.data(), 1, 2};

As we just explained, the input data is interpreted as a \(2 \times 2\) array while the input data is interpreted as a \(1 \times 2\) array.

We are now ready to perform a batch evaluation:

    // Invoke the compiled function.
    cf(out_view, in_view);

Finally, we can print to screen the result of the evaluation:

    // Print the output.
    fmt::println("Output: {}", out_batch);
Output: [-3, -3.6300000000000003]

For this simple example, we used a batch size of \(2\), but arbitrarily large batch sizes are possible. If the batch size is large enough, heyoka will parallelise the computation using multiple threads of execution, leading to substantial speedups on multicore machines.

While in this tutorial we operated in standard double precision for simplicity, compiled functions can also operate in single, extended, quadruple and multiple precision.