TensorRT
TensorRT is a library developed by NVIDIA for faster inference on NVIDIA graphics processing units (GPUs), built on CUDA. You can only use TensorRT with Nvidia Graphics cards.
Is TensorRT individual-GPU specific, or architecture specific?
For example, if I built model weights for a NVIDIA Jetson Orin Nano, and you also have a Jetson Orin Nano, do you need to rebuild the model from the ONNX weights, or can I just give you the model? Why or why not?
- It is architecture-specific
Installation (source)
What does the Optimization process look like
The optimization process involves several key steps:
- Precision Calibration: Converts model weights and activations to lower precision formats (e.g., FP32 to FP16 or INT8) to accelerate computation
- Layer and Tensor Fusion: Combines multiple layers and operations into a single operation to reduce memory access and improve execution speed
- Kernel Auto-Tuning: Selects the most efficient algorithms and kernels based on the specific GPU architecture.
- Dynamic Tensor Memory Management: Optimizes memory usage for the model’s intermediate tensors to reduce the memory footprint and increase throughput