JIT compilation and caching

JIT compilation and caching#

Added in version 2.0.0.

heyoka.py makes extensive use of just-in-time (JIT) compilation techniques, implemented via the LLVM compiler infrastructure. JIT compilation is used not only in the implementation of the adaptive integrator, but also in compiled functions and in the implementation of dense/continuous output.

JIT compilation can provide a noticeable performance boost with respect to the usual ahead-of-time (AOT) compilation, because it takes advantage of all the features available on the target CPU. The downside is that JIT compilation is computationally expensive, and thus in some cases the compilation overhead can end up dominating the total runtime of the program.

Starting from version 2.0.0, heyoka.py implements an in-memory cache that alleviates the JIT compilation overhead by avoiding re-compilation of code that has already been compiled during the program execution.

Let us see the cache in action. We start off by timing the construction of an adaptive integrator:

import heyoka as hy

%time ta = hy.taylor_adaptive(hy.model.pendulum(), [0., 1.])
CPU times: user 52.9 ms, sys: 3 ms, total: 55.9 ms
Wall time: 56 ms

Now we construct again the same integrator, again with timing:

%time ta = hy.taylor_adaptive(hy.model.pendulum(), [0., 1.])
CPU times: user 2.73 ms, sys: 0 ns, total: 2.73 ms
Wall time: 2.44 ms

We can see how the construction runtime has drastically decreased because heyoka.py cached the result of the compilation of the first integrator.

Let us see another example, this time involving continuous output. We propagate the system for a very short timespan, and we ask for the continuous output function object via the c_output=True flag:

%time ta.propagate_until(0.01, c_output=True)
CPU times: user 11.6 ms, sys: 0 ns, total: 11.6 ms
Wall time: 11.3 ms
(<taylor_outcome.time_limit: -4294967299>,
 inf,
 0.0,
 1,
 C++ datatype: double
 Direction   : forward
 Time range  : [0, 0.01)
 N of steps  : 1,
 None)

We can see how such a short integration took several milliseconds. Indeed, most of the time has been spent in the compilation of the function for the evaluation of the continuous output, rather than in the numerical integration.

Let us now repeat the same computation:

# Reset time and state.
ta.time = 0.
ta.state[:] = [0., 1.]

%time ta.propagate_until(0.01, c_output=True)
CPU times: user 1.04 ms, sys: 5 μs, total: 1.04 ms
Wall time: 909 μs
(<taylor_outcome.time_limit: -4294967299>,
 inf,
 0.0,
 1,
 C++ datatype: double
 Direction   : forward
 Time range  : [0, 0.01)
 N of steps  : 1,
 None)

We can see how the runtime has again drastically decreased thanks to the fact that the code for the evaluation of the continuous output had already been compiled earlier.

Functions to query and interact with the cache are available as static methods of the llvm_state class. For instance, we can fetch the current cache size:

f"Current cache size: {hy.llvm_state.memcache_size} bytes"
'Current cache size: 130298 bytes'

By default, the maximum cache size is set to 2GB:

f"Current cache limit: {hy.llvm_state.memcache_limit} bytes"
'Current cache limit: 2147483648 bytes'

If the cache size exceeds the limit, items in the cache are removed following a least-recently-used (LRU) policy. The cache limit can be changed at will:

# Set the maximum cache size to 1MB.
hy.llvm_state.memcache_limit = 1024*1024

f"New cache limit: {hy.llvm_state.memcache_limit} bytes"
'New cache limit: 1048576 bytes'

The cache can be cleared:

# Clear the cache.
hy.llvm_state.clear_memcache()

f"Current cache size: {hy.llvm_state.memcache_size} bytes"
'Current cache size: 0 bytes'

All the methods and attributes to query and interact with the cache are thread-safe.

Note that in multi-processing scenarios (e.g., in process-based ensemble propagations) each process gets its own cache, and thus any custom cache setup (e.g., changing the default cache limit) needs to be performed in each and every process.