An inference engine is a shared object (library) that executes a machine learning model on the particular platform. Celantur SDK supports multiple inference engines, each with its own strengths and weaknesses. The choice of inference engine can significantly impact the performance and efficiency of the model execution. Since each of the inference engine has a different set of prerequisites, they are implemented as a plugins to the SDK, which can be loaded at runtime. This allows the most flexibility for the end user but translates into slightly more complex setup.

For each inference engine, two sets of parameters can be set: compilation parameters and runtime parameters; each section below goes over each inference engine and describes the parameters that can be set.

Currently, the following inference engines are supported:

ONNX Runtime: A cross-platform, high-performance scoring engine for Open Neural Network Exchange (ONNX) models. It is optimized for both CPU and GPU execution and supports a wide range of hardware and software platforms. Performance-wise it is not the best, but it has the easiest setup and is the most portable.
NVIDIA TensorRT: A high-performance deep learning inference optimizer and runtime library for NVIDIA GPUs. It is designed to deliver low latency and high throughput for deep learning applications. It is recommended for use cases where the target platform has an NVIDIA GPU.)
Intel OpenVINO: A toolkit for optimizing and deploying AI inference on CPU (it is capable for more, but we use it currenly only to get the best CPU performance). It is designed to deliver high performance and low latency for deep learning applications.

ONNX

ONNX Runtime is the inference engine that does not require model compilation. One can use the default model received from Celantur.

Compilation parameters

NONE

Runtime parameters

Key	Value type
"n_intra_threads"	int
"n_outer_threads"	int
"optimisation_level"	celantur::OptimisationLevel
"log_severity"	celantur::LogSeverity

The n_intra_threads and n_outer_threads control the threading model of ONNX Runtime. The n_intra_threads controls the number of threads used for parallelizing operations within a single inference request, while the n_outer_threads controls the number of threads used for parallelizing multiple inference requests. The optimal values for these parameters depend on the specific hardware and workload, and may require some experimentation to determine the best settings.

The optimisation_level controls the level of optimizations applied to the model during inference. Higher levels of optimization can improve performance, but may also increase memory usage and latency. The optimal level of optimization depends on the specific model and hardware being used. In our expementation, we found this parameter fairly useless.

The log_severity controls the verbosity of logging output from ONNX Runtime. Higher severity levels will result in more detailed logging, which can be useful for debugging and troubleshooting.

TensorRT

TensorRT runs on NVidia GPUs and requires model compilation before it can be used for inference. The compilation process optimizes the model for the specific hardware and software configuration of the target platform.

Compilation parameters

Key	Value type
"precision"	celantur::CompilePrecision
"optimisation_level"	celantur::OptimisationLevel
"min_opt_max_dims"	celantur::MinOptMaxDims

The precision parameter controls the numerical precision used for computations during inference. With regular Celantur model, only CompilePrecision::FP32 and CompilePrecision::FP16 is supported. We are working on allowing CompilePrecision::INT8 precision, which can significantly improve performance on supported hardware.

The optimisation_level parameter controls the level of optimizations applied to the model during compilation. When deployed, one probably wants to set the level to OptimisationLevel::Full.

Work in progress: The min_opt_max_dims parameter specifies the minimum, optimal, and maximum input dimensions for the model. This is for dynamic model support and is not yet production ready.

Runtime parameters

NONE

OpenVINO

OpenVINO runs on CPUs and requires model compilation before it can be used for inference. Note that when used to run models on CPU, OpenVINO might require manual number of thread settings to be set, as it does not always automatically detect the optimal number of threads on AMD machines.

Compilation parameters

Key	Value type
"num_threads"	std::optional<int>
"device_name"	std::string

The num_threads parameter specifies the number of threads to be used for inference. Useful either to limit the number of threads or to make OpenVINO behave on AMD CPUs. Celantur software tries to determine the optimal number of threads automatically, but if that fails, one can set this parameter manually. If you want to let OpenVINO decide, leave this parameter to std::nullopt.

Work in progress: The device_name parameter specifies the target device for inference. Currently, only CPU is supported, so leave it to the default "CPU".

Runtime parameters

NONE