Jetson-inference
To walk through the requirements for running a vAccel-enabled workload with the jetson-inference plugin, we will use a set of NVIDIA GPUs (RTX 2060 SUPER, Jetson Nano and Xavier AGX) and a common distribution like Ubuntu.
The Jetson-inference vAccel plugin is based on jetson-inference, a frontend for TensorRT, developed by NVIDIA. This intro section will serve as a guide to install TensorRT, CUDA and jetson inference on a Ubuntu 20.04 system.
Install prerequisites
The first step is to prepare the system for building and installing jetson-inference.
Install common tools
apt update && TZ=Etc/UTC apt install -y build-essential \
git \
cmake \
python3 \
python3-venv \
libpython3-dev \
python3-numpy \
gcc-8 \
g++-8 \
lsb-release \
wget \
software-properties-common && \
Set gcc-8 as the default compiler
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 8 && \
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 8 && \
update-alternatives --set gcc /usr/bin/gcc-8
setup NVIDIA's custom repositories
export OS=ubuntu2004
wget http://developer.download.nvidia.com/compute/machine-learning/repos/${OS}/x86_64/nvidia-machine-learning-repo-${OS}_1.0.0-1_amd64.deb
dpkg -i nvidia-machine-learning-repo-${OS}_1.0.0-1_amd64.deb
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/machine-learning/repos/${OS}/x86_64/7fa2af80.pub
wget https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/cuda-${OS}.pin
mv cuda-${OS}.pin /etc/apt/preferences.d/cuda-repository-pin-600
apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/3bf863cc.pub
add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/ /"
Install NVIDIA CUDA, CUDNN and TENSORRT
apt-get update
apt-get install -y libcudnn8 libcudnn8-dev tensorrt nvidia-cuda-toolkit
Prepare for building jetson-inference
We clone jetson-inference:
git clone --recursive https://github.com/nubificus/jetson-inference
cd jetson-inference
Build & install jetson-inference
We create a build dir, enter it and prepare the Makefiles:
mkdir build
cd build
BUILD_DEPS=YES cmake -DBUILD_INTERACTIVE=NO ../
Finally, we issue the build command and we install it to our system:
make -j$(nproc)
make install
Note: For aarch64
hosts, this process is slightly different (one would say easier!) as it assumes the Host system is a L4T
distro. Just clone the repo and follow the Build & install step
Build a jetson-inference container image
We use a container file to capture the individual steps to install jetson inference. Assuming the host is debian-based (we tried that on Ubuntu 20.04), and has a recent NVIDIA driver (520.61.05
) we follow the steps below:
- clone the container repo:
git clone https://github.com/nubificus/jetson-inference-container
- build the container image:
docker build -t nubificus/jetson-inference-updated:x86_64 -f Dockerfile .
or just get the one we've built (could take some time, i'ts 12GB...):
docker pull nubificus/jetson-inference-updated:x86_64
- run the
jetson-inference
example:
Run the container:
# docker run --gpus all --rm -it -v/data/code:/data/code -w $PWD nubificus/jetson-inference-updated:x86_64 /bin/bash
root@9f5224cb28cc:/data/code#
Use pre-installed example images and models to do image inference:
root@9f5224cb28cc:/data/code# ln -s /usr/local/data/images .
root@9f5224cb28cc:/data/code# ln -s /usr/local/data/networks .
root@9f5224cb28cc:/data/code# imagenet-console images/dog_0.jpg
[video] created imageLoader from file:///data/code/images/dog_0.jpg
------------------------------------------------
imageLoader video options:
------------------------------------------------
-- URI: file:///data/code/images/dog_0.jpg
- protocol: file
- location: images/dog_0.jpg
- extension: jpg
-- deviceType: file
-- ioType: input
-- codec: unknown
-- width: 0
-- height: 0
-- frameRate: 0.000000
-- bitRate: 0
-- numBuffers: 4
-- zeroCopy: true
-- flipMethod: none
-- loop: 0
-- rtspLatency 2000
------------------------------------------------
[video] videoOptions -- failed to parse output resource URI (null)
[video] videoOutput -- failed to parse command line options
imagenet: failed to create output stream
imageNet -- loading classification network model from:
-- prototxt networks/googlenet.prototxt
-- model networks/bvlc_googlenet.caffemodel
-- class_labels networks/ilsvrc12_synset_words.txt
-- input_blob 'data'
-- output_blob 'prob'
-- batch_size 1
[TRT] TensorRT version 8.5.1
[TRT] loading NVIDIA plugins...
[TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[TRT] Registered plugin creator - ::Clip_TRT version 1
[TRT] Registered plugin creator - ::CoordConvAC version 1
[TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[TRT] Registered plugin creator - ::CropAndResize version 1
[TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[TRT] Could not register plugin creator - ::FlattenConcat_TRT version 1
[TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[TRT] Registered plugin creator - ::LReLU_TRT version 1
[TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[TRT] Registered plugin creator - ::NMS_TRT version 1
[TRT] Registered plugin creator - ::Normalize_TRT version 1
[TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[TRT] Registered plugin creator - ::PriorBox_TRT version 1
[TRT] Registered plugin creator - ::ProposalDynamic version 1
[TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[TRT] Registered plugin creator - ::Proposal version 1
[TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[TRT] Registered plugin creator - ::Region_TRT version 1
[TRT] Registered plugin creator - ::Reorg_TRT version 1
[TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[TRT] Registered plugin creator - ::RPROI_TRT version 1
[TRT] Registered plugin creator - ::ScatterND version 1
[TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[TRT] Registered plugin creator - ::Split version 1
[TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[TRT] detected model format - caffe (extension '.caffemodel')
[TRT] desired precision specified for GPU: FASTEST
[TRT] requested fasted precision for device GPU without providing valid calibrator, disabling INT8
[TRT] [MemUsageChange] Init CUDA: CPU +298, GPU +0, now: CPU 321, GPU 223 (MiB)
[TRT] Trying to load shared library libnvinfer_builder_resource.so.8.5.1
[TRT] Loaded shared library libnvinfer_builder_resource.so.8.5.1
[TRT] [MemUsageChange] Init builder kernel library: CPU +262, GPU +76, now: CPU 637, GPU 299 (MiB)
[TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[TRT] native precisions detected for GPU: FP32, FP16, INT8
[TRT] selecting fastest native precision for GPU: FP16
[TRT] attempting to open engine cache file networks/bvlc_googlenet.caffemodel.1.1.8501.GPU.FP16.engine
[TRT] loading network plan from engine cache... networks/bvlc_googlenet.caffemodel.1.1.8501.GPU.FP16.engine
[TRT] device GPU, loaded networks/bvlc_googlenet.caffemodel
[TRT] Loaded engine size: 15 MiB
[TRT] Trying to load shared library libcudnn.so.8
[TRT] Loaded shared library libcudnn.so.8
[TRT] Using cuDNN as plugin tactic source
[TRT] Using cuDNN as core library tactic source
[TRT] [MemUsageChange] Init cuDNN: CPU +576, GPU +236, now: CPU 978, GPU 477 (MiB)
[TRT] Deserialization required 506628 microseconds.
[TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +13, now: CPU 0, GPU 13 (MiB)
[TRT] Trying to load shared library libcudnn.so.8
[TRT] Loaded shared library libcudnn.so.8
[TRT] Using cuDNN as plugin tactic source
[TRT] Using cuDNN as core library tactic source
[TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 979, GPU 477 (MiB)
[TRT] Total per-runner device persistent memory is 94720
[TRT] Total per-runner host persistent memory is 147808
[TRT] Allocated activation device memory of size 3612672
[TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +3, now: CPU 0, GPU 16 (MiB)
[TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[TRT]
[TRT] CUDA engine context initialized on device GPU:
[TRT] -- layers 75
[TRT] -- maxBatchSize 1
[TRT] -- deviceMemory 3612672
[TRT] -- bindings 2
[TRT] binding 0
-- index 0
-- name 'data'
-- type FP32
-- in/out INPUT
-- # dims 3
-- dim #0 3
-- dim #1 224
-- dim #2 224
[TRT] binding 1
-- index 1
-- name 'prob'
-- type FP32
-- in/out OUTPUT
-- # dims 3
-- dim #0 1000
-- dim #1 1
-- dim #2 1
[TRT]
[TRT] binding to input 0 data binding index: 0
[TRT] binding to input 0 data dims (b=1 c=3 h=224 w=224) size=602112
[TRT] binding to output 0 prob binding index: 1
[TRT] binding to output 0 prob dims (b=1 c=1000 h=1 w=1) size=4000
[TRT]
[TRT] device GPU, networks/bvlc_googlenet.caffemodel initialized.
[TRT] imageNet -- loaded 1000 class info entries
[TRT] imageNet -- networks/bvlc_googlenet.caffemodel initialized.
[image] loaded 'images/dog_0.jpg' (500x375, 3 channels)
class 0248 - 0.229980 (Eskimo dog, husky)
class 0249 - 0.605469 (malamute, malemute, Alaskan malamute)
class 0250 - 0.160400 (Siberian husky)
imagenet: 60.54688% class #249 (malamute, malemute, Alaskan malamute)
[TRT] ------------------------------------------------
[TRT] Timing Report networks/bvlc_googlenet.caffemodel
[TRT] ------------------------------------------------
[TRT] Pre-Process CPU 0.02498ms CUDA 0.21392ms
[TRT] Network CPU 1.04583ms CUDA 0.86154ms
[TRT] Post-Process CPU 0.02265ms CUDA 0.02288ms
[TRT] Total CPU 1.09346ms CUDA 1.09834ms
[TRT] ------------------------------------------------
[TRT] note -- when processing a single image, run 'sudo jetson_clocks' before
to disable DVFS for more accurate profiling/timing measurements
[image] imageLoader -- End of Stream (EOS) has been reached, stream has been closed
imagenet: shutting down...
imagenet: shutdown complete.
Note: The first time the engine needs to do some autotuning, so it will take some time and drop output similar to the one below:
...
[TRT] Tactic: 0x89c2d153627e52ba Time: 0.0134678
[TRT] loss3/classifier Set Tactic Name: volta_h884cudnn_256x128_ldg8_relu_exp_small_nhwc_tn_v1 Tactic: 0xc110e19c9f5aa36e
[TRT] Tactic: 0xc110e19c9f5aa36e Time: 0.0752844
[TRT] loss3/classifier Set Tactic Name: turing_h1688cudnn_256x128_ldg8_relu_exp_medium_nhwc_tn_v1 Tactic: 0xdc1c841ef1cd3e8e
[TRT] Tactic: 0xdc1c841ef1cd3e8e Time: 0.0422516
[TRT] loss3/classifier Set Tactic Name: sm70_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x64x32_stage1_warpsize2x2x1_g1_tensor8x8x4_t1r1s1 Tactic: 0x4c17dc9d992e6a1d
[TRT] Tactic: 0x4c17dc9d992e6a1d Time: 0.0231476
[TRT] loss3/classifier Set Tactic Name: sm75_xmma_fprop_implicit_gemm_f16f16_f16f16_f16_nhwckrsc_nhwc_tilesize128x128x64_stage1_warpsize2x2x1_g1_tensor16x8x8_t1r1s1 Tactic: 0xc399fdbffdc34032
[TRT] Tactic: 0xc399fdbffdc34032 Time: 0.0262451
[TRT] loss3/classifier Set Tactic Name: turing_h1688cudnn_256x64_sliced1x2_ldg8_relu_exp_interior_nhwc_tn_v1 Tactic: 0x105f56cf03ee5549
[TRT] Tactic: 0x105f56cf03ee5549 Time: 0.0241811
[TRT] Fastest Tactic: 0x017a89ce2d82b850 Time: 0.00878863
[TRT] >>>>>>>>>>>>>>> Chose Runner Type: CaskGemmConvolution Tactic: 0x0000000000020318
[TRT] =============== Computing costs for
[TRT] *************** Autotuning format combination: Float(1000,1,1,1) -> Float(1000,1,1,1) ***************
[TRT] --------------- Timing Runner: prob (CudaSoftMax)
[TRT] Tactic: 0x00000000000003ea Time: 0.00364102
[TRT] Fastest Tactic: 0x00000000000003ea Time: 0.00364102
[TRT] --------------- Timing Runner: prob (CaskSoftMax)
[TRT] CaskSoftMax has no valid tactics for this config, skipping
[TRT] >>>>>>>>>>>>>>> Chose Runner Type: CudaSoftMax Tactic: 0x00000000000003ea
[TRT] *************** Autotuning format combination: Half(1000,1,1,1) -> Half(1000,1,1,1) ***************
[TRT] --------------- Timing Runner: prob (CudaSoftMax)
[TRT] Tactic: 0x00000000000003ea Time: 0.00370118
[TRT] Fastest Tactic: 0x00000000000003ea Time: 0.00370118
[TRT] --------------- Timing Runner: prob (CaskSoftMax)
[TRT] CaskSoftMax has no valid tactics for this config, skipping
[TRT] >>>>>>>>>>>>>>> Chose Runner Type: CudaSoftMax Tactic: 0x00000000000003ea
[TRT] *************** Autotuning format combination: Half(500,1:2,1,1) -> Half(500,1:2,1,1) ***************
[TRT] --------------- Timing Runner: prob (CudaSoftMax)
[TRT] Tactic: 0x0000000000000012 Time: 0.00347464
[TRT] Fastest Tactic: 0x0000000000000012 Time: 0.00347464
[TRT] >>>>>>>>>>>>>>> Chose Runner Type: CudaSoftMax Tactic: 0x0000000000000012
[TRT] Adding reformat layer: Reformatted Input Tensor 0 to conv1/7x7_s2 + conv1/relu_7x7 (data) from Float(150528,50176,224,1) to Half(100352,50176:2,224,1)
[TRT] Adding reformat layer: Reformatted Input Tensor 0 to conv2/3x3_reduce + conv2/relu_3x3_reduce (pool1/norm1) from Half(100352,3136:2,56,1) to Half(25088,1:8,448,8)
[TRT] Adding reformat layer: Reformatted Input Tensor 0 to conv2/norm2 (conv2/3x3) from Half(75264,1:8,1344,24) to Half(301056,3136:2,56,1)
[TRT] Adding reformat layer: Reformatted Output Tensor 0 to pool2/3x3_s2 (pool2/3x3_s2) from Half(75264,784:2,28,1) to Half(18816,1:8,672,24)
[TRT] Adding reformat layer: Reformatted Output Tensor 0 to inception_4a/1x1 + inception_4a/relu_1x1 || inception_4a/3x3_reduce + inception_4a/relu_3x3_reduce || inception_4a/5x5_reduce + inception_4a/relu_5x5_reduce (inception_4a/1x1 + inception_4a/relu_1x1 || inception_4a/3x3_reduce + inception_4a/relu_3x3_reduce || inception_4a/5x5_reduce + inception_4a/relu_5x5_reduce) from Half(7448,1:8,532,38) to Half(29792,196:2,14,1)
...
Note: If you want to avoid that everytime you run the container, keep the networks folder outside the container and bind mount it (eg. in the /data/code
path). That is, instead of doing ln -s /usr/local/data/networks .
do a cp -avf /usr/local/data/networks .
. Thus, every time you re-run the example using this folder, the auto-tuned engine will be there.