Network Deployment

TensorRT

TensorRT is a library developed by NVIDIA for faster inference on NVIDIA graphics processing units (GPUs), built on CUDA. You can only use TensorRT with Nvidia Graphics cards.

Is TensorRT individual-GPU specific, or architecture specific?

For example, if I built model weights for a NVIDIA Jetson Orin Nano, and you also have a Jetson Orin Nano, do you need to rebuild the model from the ONNX weights, or can I just give you the model? Why or why not?

  • It is architecture-specific

Installation (source)

sudo apt update
sudo apt install nvidia-tensorrt

What does the Optimization process look like

The optimization process involves several key steps:

  1. Precision Calibration: Converts model weights and activations to lower precision formats (e.g., FP32 to FP16 or INT8) to accelerate computation
  2. Layer and Tensor Fusion: Combines multiple layers and operations into a single operation to reduce memory access and improve execution speed
  3. Kernel Auto-Tuning: Selects the most efficient algorithms and kernels based on the specific GPU architecture.
  4. Dynamic Tensor Memory Management: Optimizes memory usage for the model’s intermediate tensors to reduce the memory footprint and increase throughput