Compiled functions#

Added in version 0.19.

Starting from version 0.19, heyoka.py can compile just-in-time (JIT) multivariate vector functions defined via the expression system. On the one hand, JIT compilation can greatly increase the performance of function evaluation with respect to a Python implementation of the same function. On the other hand, the JIT compilation process is computationally expensive and thus JIT compilation is most useful when a function needs to be evaluated repeatedly with different input values (so that the initial overhead of JIT compilation can be absorbed by the evaluation performance increase).

A simple example#

As an initial example, we will JIT compile the simple function

\[ f\left(x, y \right) = x^2 - y^2. \]

Let us begin, as usual, with the introduction of the symbolic variables:

import heyoka as hy

x,y = hy.make_vars("x", "y")

Next, we define the function to be compiled:

sym_func = x**2-y**2

We can now proceed to JIT compile sym_func via the cfunc() function. cfunc() takes two mandatory input arguments:

  • the list of symbolic expressions representing the outputs of the vector function, and

  • the list of symbolic variables representing the inputs of the function.

In this case, we only have a single output, sym_func, and two inputs, x and y:

cf = hy.cfunc([sym_func], [x, y])

The value returned by cfunc() is a callable function object which accepts as input a NumPy array representing the values to use in the evaluation of sym_func:

# Evaluate for x=1 and y=5.
cf([1, 5])
array([-24.])

Note that we passed as first element in the list the value for x and then, as second element, the value for y. This is because when we created cf we specified [x, y] (in that order) as the inputs for the function. This can be confirmed by examining the screen output of the cf object:

cf
C++ datatype: double
Variables: [x, y]
Output #0: (x**2.0000000000000000 - y**2.0000000000000000)

Any other ordering for the input variables is possible:

cf2 = hy.cfunc([sym_func], vars=[y,x])

# Evaluate for x=1 and y=5.
cf2([5,1])
array([-24.])

Let’s inspect the cf2 object:

cf2
C++ datatype: double
Variables: [y, x]
Output #0: (x**2.0000000000000000 - y**2.0000000000000000)

cfunc() accepts several additional keyword arguments, the most important of which is the boolean flag compact_mode (defaulting to False). Similarly to the adaptive Taylor integrators, you should enable compact_mode if you want to compile extremely large symbolic expressions that result in excessively long compilation times. The downside of compact_mode is a slight performance degradation due to the different code generation model adopted during the JIT compilation process.

The function object returned by cfunc() also accepts several optional keyword arguments. It is possible, for instance, to pass as outputs argument a pre-allocated NumPy array into which the result of the evaluation will be written. This is useful to avoid the overhead of allocating new memory for the return value, if such memory is already available:

import numpy as np

# Pre-allocate a NumPy array to store the
# result of the evaluation.
ret_arr = np.zeros((1,))

# Evaluate, specifying that the result
# will be written into ret_arr.
cf([1,5], outputs=ret_arr)

ret_arr
array([-24.])

Warning

The ability to pass lists and other iterables as input/output arguments to a compiled function is offered as a convenience, but it incurs into a runtime cost as compiled functions need to convert the iterables into NumPy arrays on-the-fly before performing the evaluation.

For optimal performance, consider passing NumPy arrays to the call operator of compiled functions in order to avoid this hidden cost.

Functions with parameters#

It the compiled function references external parameters, the parameters array will have to be supplied during evaluation via the pars keyword argument:

# A function with 3 parameters.
sym_func_par = hy.par[0]*x**2-hy.par[1]*y**2+hy.par[2]

# Compile it.
cf_par = hy.cfunc([sym_func_par], [x, y])

# Evaluate, specifying the parameter values
cf_par([1,5], pars=[-1, -2, -3])
array([46.])

Time-dependent functions#

As seen in the tutorial on non-autonomous ODEs, heyoka.py uses a special placeholder variable called heyoka.time to represent time (i.e., the independent variable) in ODEs.

Usually outside the context of ODE solving there are no compelling reasons to use the special heyoka.time variable instead of a regular variable. However, it can sometimes be useful to numerically evaluate the right-hand side of ODEs (e.g., for validation/sanity checking). Thus, starting with heyoka.py 0.21, expressions containing heyoka.time can also be compiled.

In a time-dependent compiled function, it is mandatory to supply the time value via the time keyword argument:

# Create a time-dependent function.
sym_func_tm = x**2-y**2+hy.time

# Compile it.
cf_tm = hy.cfunc([sym_func_tm], [x, y])

# Evaluate for x=1, y=5 and time=6.
cf_tm([1,5], time=6.)
array([-18.])

Batched evaluations#

An important feature of compiled functions is the ability to be evaluated over batches of input variables in a single evaluation. Let us see a simple example:

# Evaluate cf for x=[1,2,3] and y=[5,6,7].
cf([[1,2,3],[5,6,7]])
array([[-24., -32., -40.]])

Because we now passed the two-dimensional array \([[1,2,3],[5,6,7]]\) as input argument (rather than a one-dimensional array, like in the previous examples), cf will be evaluated for multiple values of \(x\) (\(\left[1,2,3\right]\)) and \(y\) (\(\left[5,6,7\right]\)). The result also consists of a two-dimensional array in which the first dimension is the number of outputs (1 in this case), and the second dimension is the number of evaluation points (3 in this case).

Because heyoka.py makes extensive use of the SIMD instructions available in modern processors, a single batched evaluation will perform considerably better than multiple unbatched evaluations.

Batched evaluations of functions containing parameters and/or time are also supported:

# Batched evaluation with parameters.
cf_par([[1,2,3],[5,6,7]], pars=[[-1,-1.1,-1.2],[-2,-2.1,-2.2],[-3,-3.1,-3.2]])
array([[46. , 68.1, 93.8]])
# Batched evaluation with time.
cf_tm([[1,2,3],[5,6,7]], time=[6,6.1,6.2])
array([[-18. , -25.9, -33.8]])

Performance analysis#

In order to assess the performance of heyoka.py’s compiled functions, we will consider the evaluation of the Hamiltonian of the restricted three-body problem, a 6-dimensional scalar function defined as:

\[ \mathcal{H}\left(p_x,p_y,p_z,x,y,z\right) = \frac{1}{2}\left( p_x^2+p_y^2+p_z^2 \right) +yp_x-xp_y-\frac{1-\mu}{r_{PS}}-\frac{\mu}{r_{PJ}}, \]

where

\[\begin{split} \begin{align} \mu &= 0.1, \\ r_{PS} &=\sqrt{\left( x-\mu \right)^2+y^2+z^2}, \\ r_{PJ} &=\sqrt{\left( x -\mu + 1 \right)^2+y^2+z^2}. \end{align} \end{split}\]

\(\mathcal{H}\) will be evaluated on a grid of \(10^8\) randomly-generated evaluation points:

# Deterministic seeding.
rng = np.random.default_rng(42)

# Generate 10**8 evaluation points randomly.
nevals = 100000000
inputs = rng.uniform(size=(6, nevals))

# The mu parameter.
mu = .1

NumPy#

Let us begin with a NumPy-based evaluation:

# Evaluation of H via NumPy.
def Ham_np(px,py,pz,x,y,z):
    rps = np.sqrt((x-mu)**2 + (y**2+z**2))
    rpj = np.sqrt((x-mu+1)**2 + (y**2+z**2))
    
    return .5*(px**2+py**2+pz**2) + y*px - x*py - (1-mu)/rps - mu/rpj

# Extract the function arguments from
# the inputs array.
px_arr = inputs[0,:]
py_arr = inputs[1,:]
pz_arr = inputs[2,:]
x_arr = inputs[3,:]
y_arr = inputs[4,:]
z_arr = inputs[5,:]

# Time it.
%timeit Ham_np(px_arr,py_arr,pz_arr,x_arr,y_arr,z_arr)
2.16 s ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

heyoka.py#

Let us now try with heyoka.py. First we define the symbolic variables and the mathematical expression of \(\mathcal{H}\):

px,py,pz,x,y,z=hy.make_vars("px", "py", "pz", "x", "y", "z")

rps = hy.sqrt((x-mu)**2 + (y**2+z**2))
rpj = hy.sqrt((x-mu+1)**2 + (y**2+z**2))

Ham_sym = .5*(px**2+py**2+pz**2) + y*px - x*py - (1-mu)/rps - mu/rpj

Then we compile Ham_sym:

Ham_cf = hy.cfunc([Ham_sym], vars=[px,py,pz,x,y,z])

heyoka.py’s compiled functions support multithreaded parallelisation for batched evaluations. However, for this simple test, we will be disabling multithreaded parallelisation:

# Disable parallel batched evaluations by
# setting the number of threads in use by
# heyoka.py to 1.
hy.set_nthreads(1)

We can now time Ham_cf:

%timeit Ham_cf(inputs)
213 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can see how heyoka.py’s compiled function is about 10 times faster than the NumPy based implementation. We can also appreciate the effect of providing an externally-allocated outputs array:

# Pre-allocate the outputs array.
outputs = np.empty((1, nevals))
%timeit Ham_cf(inputs,outputs=outputs)
179 ms ± 3.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The effect is not dramatic but measurable nevertheless.

JAX#

As a last benchmark, we will be performing the same evaluation with JAX. Similarly to heyoka.py, JAX offers the possibility to JIT compile Python functions, so we expect similar performance to heyoka.py. Note that, in order to perform a fair comparison, for the execution of this notebook we enabled 64-bit floats in JAX and we used JAX’s CPU backend forcing a single thread of execution.

Let us see the jax code:

import jax
import jax.numpy as jnp

# Turn inputs into a JAX array of float64.
jinputs = jnp.array(inputs, dtype=jnp.float64)

# Fetch the function arguments from jinputs.
jpx_arr = jinputs[0,:]
jpy_arr = jinputs[1,:]
jpz_arr = jinputs[2,:]
jx_arr = jinputs[3,:]
jy_arr = jinputs[4,:]
jz_arr = jinputs[5,:]

# The function for the evaluation of the Hamiltonian.
def Ham_jnp(jpx,jpy,jpz,jx,jy,jz):
    rps = jnp.sqrt((jx-mu)**2 + (jy**2+jz**2))
    rpj = jnp.sqrt((jx-mu+1)**2 + (jy**2+jz**2))
    
    return .5*(jpx**2+jpy**2+jpz**2) + jy*jpx - jx*jpy - (1-mu)/rps - mu/rpj

# Compile it.
Ham_jnp_jit = jax.jit(Ham_jnp)

# Warm up.
Ham_jnp_jit(jpx_arr,jpy_arr,jpz_arr,jx_arr,jy_arr,jz_arr).block_until_ready();

We are now ready to run the benchmark:

%timeit Ham_jnp_jit(jpx_arr,jpy_arr,jpz_arr,jx_arr,jy_arr,jz_arr).block_until_ready()
313 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We can indeed see how JAX’s performance is similar to heyoka.py, although heyoka.py retains a performance edge.