Before you start, make sure that you have created an account with Genesis Cloud and finished the on-boarding steps (phone verification, adding SSH key, providing credit card). You can create an account here and get $15 in free credits. Furthermore, ensure that you have access to one NVIDIA RTX 3080 GPU instance (if not, request quota here). Important: this is the second part of our ML inference article series. Please make sure to check our tutorials’ article first if you haven’t done so.
NVIDIA TensorRT is an SDK for high-performance deep learning inference on NVIDIA GPU devices. It includes the inference engine and parsers for handling various input network specification formats. TensorRT provides application programmer interfaces (API) for C++ and Python. This article will present example programs using both these languages.
To deploy PyTorch models using TensorRT, we will export them in ONNX format. ONNX stands for Open Neural Network Exchange and is an open format built to represent deep learning models in a framework-agnostic way. TensorRT provides a specialized parser for importing ONNX models.
We assume that you will continue using the Genesis Cloud GPU-enabled instance that you created and configured while studying the Article 1.
In particular, the following software must be installed and configured as described in that article:
pip
Various assets (source code, shell scripts, and data files) used in this article can be found in the supporting GitHub repository.
To run examples described in this article we recommend cloning the entire
repository on your Genesis Cloud instance.
The subdirectory art03
must be made your current directory.
The version of TensorRT must be compatible with the chosen versions of CUDA and cuDNN. For our choice of CUDA 11.3.1 and cuDNN 8.2.1 we will need TensorRT 8.0.3. (The actual support matrix for TensorRT 8.x is available here.)
To access TensorRT, you should register as a member of the NVIDIA Developer Program.
To download the TensorRT distribution, visit the official download site. Choose “TensorRT 8”, then agree to the “NVIDIA TensorRT License Agreement” and choose “TensorRT 8.0 GA Update 1” (“GA” stands for “General Availability”). Select and download “TensorRT 8.0.3 GA for Ubuntu 20.04 and CUDA 11.3 DEB local repo package”. You will get a DEB repo file; at the time of writing this article its name was:
nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831_1-1_amd64.deb
Place it in a scratch directory on you instance (we use ~/transit
in this series of articles),
then proceed with installation by entering these commands:
sudo dpkg -i nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831_1-1_amd64.deb
sudo apt-key add /var/nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831/7fa2af80.pub
sudo apt-get update
sudo apt-get install tensorrt
Then install Python bindings for TensorRT API:
python3 -m pip install numpy
sudo apt-get install python3-libnvinfer-dev
Verify the installation using the command:
dpkg -l | grep TensorRT
Detailed installation instructions can be found on the official “Installing TensorRT” page.
PyCUDA is a Python package implementing access to the CUDA API from Python. The Python programs described in this article require PyCUDA for accessing the basic CUDA functionality like managing CUDA device memory buffers.
Before starting the PyCUDA installation make sure that the NVIDIA CUDA compiler
driver nvcc
is accessible by entering the command:
nvcc --version
If this command fails, update the PATH
environment variable:
export PATH=/usr/local/cuda/bin:$PATH
To install PyCUDA, enter the command:
python3 -m pip install pycuda
We will continue using the torchvision image classification models for our examples. As the first step, we will demonstrate conversion of the already familiar ResNet50 model to ONNX format.
The Python program generate_onnx_resnet50.py
serves this purpose.
import torch
import torchvision.models as models
input = torch.rand(1, 3, 224, 224)
model = models.resnet50(pretrained=True)
model.eval()
output = model(input)
torch.onnx.export(model, input, "./onnx/resnet50.onnx", export_params=True)
This program:
We store generated ONNX files in the subdirectory onnx
which
must be created before running the program:
mkdir -p onnx
To run this program, use the command:
python3 generate_onnx_resnet50.py
The program will produce a file resnet50.onnx
containing the ONNX model representation.
The Python program generate_onnx_all.py
can be used to produce ONNX descriptions
for all considered torchvision image classification models.
import torch
import torchvision.models as models
MODELS = [
('alexnet', models.alexnet),
('densenet121', models.densenet121),
('densenet161', models.densenet161),
('densenet169', models.densenet169),
('densenet201', models.densenet201),
('mnasnet0_5', models.mnasnet0_5),
('mnasnet1_0', models.mnasnet1_0),
('mobilenet_v2', models.mobilenet_v2),
('mobilenet_v3_large', models.mobilenet_v3_large),
('mobilenet_v3_small', models.mobilenet_v3_small),
('resnet18', models.resnet18),
('resnet34', models.resnet34),
('resnet50', models.resnet50),
('resnet101', models.resnet101),
('resnet152', models.resnet152),
('resnext50_32x4d', models.resnext50_32x4d),
('resnext101_32x8d', models.resnext101_32x8d),
('shufflenet_v2_x0_5', models.shufflenet_v2_x0_5),
('shufflenet_v2_x1_0', models.shufflenet_v2_x1_0),
('squeezenet1_0', models.squeezenet1_0),
('squeezenet1_1', models.squeezenet1_1),
('vgg11', models.vgg11),
('vgg11_bn', models.vgg11_bn),
('vgg13', models.vgg13),
('vgg13_bn', models.vgg13_bn),
('vgg16', models.vgg16),
('vgg16_bn', models.vgg16_bn),
('vgg19', models.vgg19),
('vgg19_bn', models.vgg19_bn),
('wide_resnet50_2', models.wide_resnet50_2),
('wide_resnet101_2', models.wide_resnet101_2),
]
def generate_model(name, builder):
print('Generate', name)
input = torch.rand(1, 3, 224, 224)
model = builder(pretrained=True)
model.eval()
output = model(input)
onnx_path = './onnx/' + name + '.onnx'
torch.onnx.export(model, input, onnx_path, export_params=True)
for name, model in MODELS:
generate_model(name, model)
To run this program, enter the following commands:
mkdir -p onnx
python3 generate_onnx_all.py
To perform inference of the ONNX model using TensorRT, it must be pre-processed using the TensorRT ONNX parser. We will start with conversion of the ONNX representation to the TensorRT plan. The TensorRT plan is a serialized form of a TensorRT engine. The TensorRT engine represents the model optimized for execution on a chosen CUDA device.
The Python program trt_onnx_parser.py
serves this purpose.
import sys
import tensorrt as trt
def main():
if len(sys.argv) != 3:
sys.exit("Usage: python3 trt_onnx_parser.py <input_onnx_path> <output_plan_path>")
onnx_path = sys.argv[1]
plan_path = sys.argv[2]
logger = trt.Logger()
builder = trt.Builder(logger)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
config.max_workspace_size = 256 * 1024 * 1024
config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)
parser = trt.OnnxParser(network, logger)
ok = parser.parse_from_file(onnx_path)
if not ok:
sys.exit("ONNX parse error")
plan = builder.build_serialized_network(network, config)
with open(plan_path, "wb") as fp:
fp.write(plan)
print("DONE")
main()
The Python package tensorrt
implements TensorRT Python API and
provides a collection of Python object classes used to handle
various aspects of TensorRT inference and model parsing.
This program uses the following TensorRT API object classes:
Logger
- logger used by several other object classesBuilder
- a factory used to create several other classesINetworkDefinition
- representation of TensorRT networks (models)IBuilderConfig
- a class used to hold configuration parameters for Builder
OnnxParser
- a class used for parsing ONNX models into TensorRT network definitionsIHostMemory
- representation of buffers in a host memoryThe program performs the following steps:
logger: Logger
representing a logger instancebuilder: Builder
representing a builder instancebuilder
to create network: INetworkDefinition
representing
an empty network instancebuilder
to create config: IBuilderConfig
representing
a builder configuration instancemax_workspace_size
configuration parameter representing
the maximum workspace size that can be used by inference algorithmsparser: OnnxParser
representing an ONNX parser instance;
reference to the previously created empty network definition is attached
to the parserparser
to parse the input ONNX file and convert it to
the TensorRT network definition; assigns the parsing result
the attached network definition objectbuilder
to create plan: IHostMemory
representing
a serialized network (plan) stored in a host memory bufferThe program has two command line arguments: a path to the input ONNX file and a path to the output TensorRT plan file.
We store generated plan files in the subdirectory plan
which
must be created before running the program:
mkdir -p plan
To run this program for conversion of ResNet50 ONNX representation, use the command:
python3 trt_onnx_parser.py ./onnx/resnet50.onnx ./plan/resnet50.plan
The Python program trt_onnx_parser_all.py
can be used to produce ONNX representations
for all considered torchvision image classification models.
import sys
import tensorrt as trt
MODELS = [
'alexnet',
'densenet121',
'densenet161',
'densenet169',
'densenet201',
'mnasnet0_5',
'mnasnet1_0',
'mobilenet_v2',
'mobilenet_v3_large',
'mobilenet_v3_small',
'resnet18',
'resnet34',
'resnet50',
'resnet101',
'resnet152',
'resnext50_32x4d',
'resnext101_32x8d',
'shufflenet_v2_x0_5',
'shufflenet_v2_x1_0',
'squeezenet1_0',
'squeezenet1_1',
'vgg11',
'vgg11_bn',
'vgg13',
'vgg13_bn',
'vgg16',
'vgg16_bn',
'vgg19',
'vgg19_bn',
'wide_resnet50_2',
'wide_resnet101_2',
]
def setup_builder():
logger = trt.Logger()
builder = trt.Builder(logger)
return (logger, builder)
def generate_plan(logger, builder, name):
print('Generate TensorRT plan for ' + name)
onnx_path = './onnx/' + name + '.onnx'
plan_path = './plan/' + name + '.plan'
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
config = builder.create_builder_config()
config.max_workspace_size = 256 * 1024 * 1024
config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)
parser = trt.OnnxParser(network, logger)
ok = parser.parse_from_file(onnx_path)
if not ok:
sys.exit('ONNX parse error')
plan = builder.build_serialized_network(network, config)
with open(plan_path, "wb") as fp:
fp.write(plan)
def main():
logger, builder = setup_builder()
for name in MODELS:
generate_plan(logger, builder, name)
print('DONE')
main()
To run this program, enter the following commands:
mkdir -p plan
python3 trt_onnx_parser_all.py
Conversion of the ONNX representation to TensorRT plan can be also implemented using the TensorRT C++ API.
The C++ program trt_onnx_parser.cpp
serves this purpose.
#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <NvInfer.h>
#include <NvOnnxParser.h>
#include "common.h"
// wrapper class for ONNX parser
class OnnxParser {
public:
OnnxParser();
~OnnxParser();
public:
void Init();
void Parse(const char *onnxPath, const char *planPath);
private:
bool m_active;
Logger m_logger;
UniquePtr<nvinfer1::IBuilder> m_builder;
UniquePtr<nvinfer1::INetworkDefinition> m_network;
UniquePtr<nvinfer1::IBuilderConfig> m_config;
UniquePtr<nvonnxparser::IParser> m_parser;
};
OnnxParser::OnnxParser(): m_active(false) { }
OnnxParser::~OnnxParser() { }
void OnnxParser::Init() {
assert(!m_active);
m_builder.reset(nvinfer1::createInferBuilder(m_logger));
if (m_builder == nullptr) {
Error("Error creating infer builder");
}
auto networkFlags = 1 << int(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
m_network.reset(m_builder->createNetworkV2(networkFlags));
if (m_network == nullptr) {
Error("Error creating network");
}
m_config.reset(m_builder->createBuilderConfig());
if (m_config == nullptr) {
Error("Error creating builder config");
}
m_config->setMaxWorkspaceSize(256 * 1024 * 1024);
m_config->setFlag(nvinfer1::BuilderFlag::kDISABLE_TIMING_CACHE);
m_parser.reset(nvonnxparser::createParser(*m_network, m_logger));
if (m_parser == nullptr) {
Error("Error creating ONNX parser");
}
}
void OnnxParser::Parse(const char *onnxPath, const char *planPath) {
bool ok = m_parser->parseFromFile(onnxPath, static_cast<int>(m_logger.SeverityLevel()));
if (!ok) {
Error("ONNX parse error");
}
UniquePtr<nvinfer1::IHostMemory> plan(m_builder->buildSerializedNetwork(*m_network, *m_config));
if (plan == nullptr) {
Error("Network serialization error");
}
const void *data = plan->data();
size_t size = plan->size();
FILE *fp = fopen(planPath, "wb");
if (fp == nullptr) {
Error("Failed to create file %s", planPath);
}
fwrite(data, 1, size, fp);
fclose(fp);
}
// main program
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "Usage: trt_onnx_parser <input_onnx_path> <output_plan_path>\n");
return 1;
}
const char *onnxPath = argv[1];
const char *planPath = argv[2];
printf("Generate TensorRT plan for %s\n", onnxPath);
OnnxParser parser;
parser.Init();
parser.Parse(onnxPath, planPath);
return 0;
}
The program is functionally similar to previously described Python program trt_onnx_parser.py
.
Plans generated using the Python and C++ program versions are interchangeable; each plan
can be used for the subsequent inference with Python and C++ programs described in
this article.
The program uses the TensorRT C++ API specified in two header files:
NvInfer.h
defines interface to the TensorRT inference engine encapsulated
in the nvinfer1
namespaceNvOnnxParser.h
defines interface to the TensorRT ONNX parser encapsulated
in the nvonnxparser
namespaceThis program uses the following TensorRT API object classes:
nvinfer1::ILogger
- logger used by several other object classesnvinfer1::IBuilder
- a factory used to create several other classesnvinfer1::INetworkDefinition
- representation of TensorRT networks (models)nvinfer1::IBuilderConfig
- a class used to hold configuration parameters for IBuilder
nvonnxparser::IParser
- a class used for parsing ONNX models into TensorRT network definitionsnvinfer1::IHostMemory
- representation of buffers in a host memoryClass OnnxParser
holds smart pointers to instances of these objects.
It exposes two principal public methods: Init
and Parse
.
The Init
method performs the following steps:
m_builder
representing a builder instancem_builder
to create m_network
representing
an empty network instancem_builder
to create m_config
representing
a builder configuration instancemaxWorkspaceSize
configuration parameter representing
the maximum workspace size that can be used by inference algorithmsm_parser
representing an ONNX parser instance;
reference to the previously created empty network definition is attached
to the parserThe Parse
method performs the following steps:
m_parser
to parse the input ONNX file and convert it to
the TensorRT network definition; assigns the parsing result
the attached network definition objectm_builder
to create plan
representing
a serialized network (plan) stored in a host memory bufferThe shell script build_trt_onnx_parser.sh
must be used to compile and link this program:
#!/bin/bash
mkdir -p ./bin
g++ -o ./bin/trt_onnx_parser \
-I /usr/local/cuda/include \
trt_onnx_parser.cpp common.cpp \
-L /usr/local/cuda/lib64 -lnvonnxparser -lnvinfer -lcudart
Running this script is straightforward:
./build_trt_onnx_parser.sh
The program has two command line arguments: a path to the input ONNX file and a path to the output TensorRT plan file.
To run this program for conversion of ResNet50 ONNX representation, use the command:
./bin/trt_onnx_parser ./onnx/resnet50.onnx ./plan/resnet50.plan
The shell script trt_onnx_parser_all.sh
uses the C++ program to generate
ONNX representations for all considered torchvision image classification files:
#!/bin/bash
./bin/trt_onnx_parser ./onnx/alexnet.onnx ./plan/alexnet.plan
./bin/trt_onnx_parser ./onnx/densenet121.onnx ./plan/densenet121.plan
./bin/trt_onnx_parser ./onnx/densenet161.onnx ./plan/densenet161.plan
./bin/trt_onnx_parser ./onnx/densenet169.onnx ./plan/densenet169.plan
./bin/trt_onnx_parser ./onnx/densenet201.onnx ./plan/densenet201.plan
./bin/trt_onnx_parser ./onnx/mnasnet0_5.onnx ./plan/mnasnet0_5.plan
./bin/trt_onnx_parser ./onnx/mnasnet1_0.onnx ./plan/mnasnet1_0.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v2.onnx ./plan/mobilenet_v2.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v3_large.onnx ./plan/mobilenet_v3_large.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v3_small.onnx ./plan/mobilenet_v3_small.plan
./bin/trt_onnx_parser ./onnx/resnet101.onnx ./plan/resnet101.plan
./bin/trt_onnx_parser ./onnx/resnet152.onnx ./plan/resnet152.plan
./bin/trt_onnx_parser ./onnx/resnet18.onnx ./plan/resnet18.plan
./bin/trt_onnx_parser ./onnx/resnet34.onnx ./plan/resnet34.plan
./bin/trt_onnx_parser ./onnx/resnet50.onnx ./plan/resnet50.plan
./bin/trt_onnx_parser ./onnx/resnext101_32x8d.onnx ./plan/resnext101_32x8d.plan
./bin/trt_onnx_parser ./onnx/resnext50_32x4d.onnx ./plan/resnext50_32x4d.plan
./bin/trt_onnx_parser ./onnx/shufflenet_v2_x0_5.onnx ./plan/shufflenet_v2_x0_5.plan
./bin/trt_onnx_parser ./onnx/shufflenet_v2_x1_0.onnx ./plan/shufflenet_v2_x1_0.plan
./bin/trt_onnx_parser ./onnx/squeezenet1_0.onnx ./plan/squeezenet1_0.plan
./bin/trt_onnx_parser ./onnx/squeezenet1_1.onnx ./plan/squeezenet1_1.plan
./bin/trt_onnx_parser ./onnx/vgg11.onnx ./plan/vgg11.plan
./bin/trt_onnx_parser ./onnx/vgg11_bn.onnx ./plan/vgg11_bn.plan
./bin/trt_onnx_parser ./onnx/vgg13.onnx ./plan/vgg13.plan
./bin/trt_onnx_parser ./onnx/vgg13_bn.onnx ./plan/vgg13_bn.plan
./bin/trt_onnx_parser ./onnx/vgg16.onnx ./plan/vgg16.plan
./bin/trt_onnx_parser ./onnx/vgg16_bn.onnx ./plan/vgg16_bn.plan
./bin/trt_onnx_parser ./onnx/vgg19.onnx ./plan/vgg19.plan
./bin/trt_onnx_parser ./onnx/vgg19_bn.onnx ./plan/vgg19_bn.plan
./bin/trt_onnx_parser ./onnx/wide_resnet101_2.onnx ./plan/wide_resnet101_2.plan
./bin/trt_onnx_parser ./onnx/wide_resnet50_2.onnx ./plan/wide_resnet50_2.plan
Running this script is straightforward:
mkdir -p ./plan
./trt_onnx_parser_all.sh
The inference programs in Python and C++ described in the rest of this article reuse several files introduced in Articles 1 and 2. These include:
imagenet_classes.txt
- class descriptions for ImageNet labels (Article 1)./data/husky01.dat
- pre-processed input tensor for the husky image (Article 2)See the respective articles for details on obtaining these files.
The Python program trt_infer_plan.py
implements TensorRT inference using
the previously generated TensorRT plan and a pre-processed input image.
import sys
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
def softmax(x):
y = np.exp(x)
sum = np.sum(y)
y /= sum
return y
def topk(x, k):
idx = np.argsort(x)
idx = idx[::-1][:k]
return (idx, x[idx])
def main():
if len(sys.argv) != 3:
sys.exit("Usage: python3 trt_infer_plan.py <plan_path> <input_path>")
plan_path = sys.argv[1]
input_path = sys.argv[2]
print("Start " + plan_path)
# read the plan
with open(plan_path, "rb") as fp:
plan = fp.read()
# read the pre-processed image
input = np.fromfile(input_path, np.float32)
# read the categories
with open("imagenet_classes.txt", "r") as f:
categories = [s.strip() for s in f.readlines()]
# initialize the TensorRT objects
logger = trt.Logger()
runtime = trt.Runtime(logger)
engine = runtime.deserialize_cuda_engine(plan)
context = engine.create_execution_context()
# create device buffers and TensorRT bindings
output = np.zeros((1000), dtype=np.float32)
d_input = cuda.mem_alloc(input.nbytes)
d_output = cuda.mem_alloc(output.nbytes)
bindings = [int(d_input), int(d_output)]
# copy input to device, run inference, copy output to host
cuda.memcpy_htod(d_input, input)
context.execute_v2(bindings=bindings)
cuda.memcpy_dtoh(output, d_output)
# apply softmax and get Top-5 results
output = softmax(output)
top5p, top5v = topk(output, 5)
# print results
print("Top-5 results")
for ind, val in zip(top5p, top5v):
print(" {0} {1:.2f}%".format(categories[ind], val * 100))
main()
This program uses the following TensorRT API object classes:
Logger
- logger used by several other object classesRuntime
- used to deserialize TensorRT plans to TensorRT CUDA enginesICudaEngine
- engine for executing inference on built networksIExecutionContext
- context for executing inference using CUDA engineThe program performs the following steps:
logger: Logger
representing a logger instanceruntime: Runtime
representing a runtime instanceruntime
to deserialize the plan into engine: ICudaEngine
context: IExecutionContext
for the engine
context
with the specified bindings and CUDA stream handleThe program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.
To run this program for the previously created ResNet50 plan and husky image, use the command:
python3 trt_infer_plan.py ./plan/resnet50.plan ./data/husky01.dat
The program output will look like:
Siberian husky 49.52%
Eskimo dog 42.90%
malamute 5.87%
dogsled 1.22%
Saint Bernard 0.32%
The inference with TensorRT models can be also implemented using the TensorRT C++ API.
The C++ program trt_infer_plan.cpp
serves this purpose.
#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <string>
#include <vector>
#include <iostream>
#include <fstream>
#include <NvInfer.h>
#include "common.h"
// wrapper class for inference engine
class Engine {
public:
Engine();
~Engine();
public:
void Init(const std::vector<char> &plan);
void Infer(const std::vector<float> &input, std::vector<float> &output);
void DiagBindings();
private:
bool m_active;
Logger m_logger;
UniquePtr<nvinfer1::IRuntime> m_runtime;
UniquePtr<nvinfer1::ICudaEngine> m_engine;
};
Engine::Engine(): m_active(false) { }
Engine::~Engine() { }
void Engine::Init(const std::vector<char> &plan) {
assert(!m_active);
m_runtime.reset(nvinfer1::createInferRuntime(m_logger));
if (m_runtime == nullptr) {
Error("Error creating infer runtime");
}
m_engine.reset(m_runtime->deserializeCudaEngine(plan.data(), plan.size(), nullptr));
if (m_engine == nullptr) {
Error("Error deserializing CUDA engine");
}
m_active = true;
}
void Engine::Infer(const std::vector<float> &input, std::vector<float> &output) {
assert(m_active);
UniquePtr<nvinfer1::IExecutionContext> context;
context.reset(m_engine->createExecutionContext());
if (context == nullptr) {
Error("Error creating execution context");
}
CudaBuffer<float> inputBuffer;
inputBuffer.Init(3 * 224 * 224);
assert(inputBuffer.Size() == input.size());
inputBuffer.Put(input.data());
CudaBuffer<float> outputBuffer;
outputBuffer.Init(1000);
void *bindings[2];
bindings[0] = inputBuffer.Data();
bindings[1] = outputBuffer.Data();
bool ok = context->executeV2(bindings);
if (!ok) {
Error("Error executing inference");
}
output.resize(outputBuffer.Size());
outputBuffer.Get(output.data());
}
void Engine::DiagBindings() {
int nbBindings = static_cast<int>(m_engine->getNbBindings());
printf("Bindings: %d\n", nbBindings);
for (int i = 0; i < nbBindings; i++) {
const char *name = m_engine->getBindingName(i);
bool isInput = m_engine->bindingIsInput(i);
nvinfer1::Dims dims = m_engine->getBindingDimensions(i);
std::string fmtDims = FormatDims(dims);
printf(" [%d] \"%s\" %s [%s]\n", i, name, isInput ? "input" : "output", fmtDims.c_str());
}
}
// I/O utilities
void ReadClasses(const char *path, std::vector<std::string> &classes) {
std::string line;
std::ifstream ifs(path, std::ios::in);
if (!ifs.is_open()) {
Error("Cannot open %s", path);
}
while (std::getline(ifs, line)) {
classes.push_back(line);
}
ifs.close();
}
void ReadPlan(const char *path, std::vector<char> &plan) {
std::ifstream ifs(path, std::ios::in | std::ios::binary);
if (!ifs.is_open()) {
Error("Cannot open %s", path);
}
ifs.seekg(0, ifs.end);
size_t size = ifs.tellg();
plan.resize(size);
ifs.seekg(0, ifs.beg);
ifs.read(plan.data(), size);
ifs.close();
}
void ReadInput(const char *path, std::vector<float> &input) {
std::ifstream ifs(path, std::ios::in | std::ios::binary);
if (!ifs.is_open()) {
Error("Cannot open %s", path);
}
size_t size = 3 * 224 * 224;
input.resize(size);
ifs.read(reinterpret_cast<char *>(input.data()), size * sizeof(float));
ifs.close();
}
void PrintOutput(const std::vector<float> &output, const std::vector<std::string> &classes) {
int top5p[5];
float top5v[5];
TopK(static_cast<int>(output.size()), output.data(), 5, top5p, top5v);
printf("Top-5 results\n");
for (int i = 0; i < 5; i++) {
std::string label = classes[top5p[i]];
float prob = 100.0f * top5v[i];
printf(" [%d] %s %.2f%%\n", i, label.c_str(), prob);
}
}
// main program
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "Usage: trt_infer_plan <plan_path> <input_path>\n");
return 1;
}
const char *planPath = argv[1];
const char *inputPath = argv[2];
printf("Start %s\n", planPath);
std::vector<std::string> classes;
ReadClasses("imagenet_classes.txt", classes);
std::vector<char> plan;
ReadPlan(planPath, plan);
std::vector<float> input;
ReadInput(inputPath, input);
std::vector<float> output;
Engine engine;
engine.Init(plan);
engine.DiagBindings();
engine.Infer(input, output);
Softmax(static_cast<int>(output.size()), output.data());
PrintOutput(output, classes);
return 0;
}
This program uses the following TensorRT API object classes:
nvinfer1::ILogger
- logger used by several other object classesnvinfer1::IRuntime
- used to deserialize TensorRT plans to TensorRT CUDA enginesnvinfer1::ICudaEngine
- engine for executing inference on built networksnvinfer1::IExecutionContext
- context for executing inference using CUDA engineClass Engine
holds smart pointers to instances of these objects.
It exposes two principal public methods: Init
and Infer
.
The Init
method performs the following steps:
m_runtime
representing a runtime instancem_runtime
to deserialize the plan into m_engine
The Infer
method performs the following steps:
context
for the m_engine
context
with the specified bindings and CUDA stream handleThe program performs the following steps:
engine
and initializes it using the Init
methodengine
using the Infer
methodNOTE: In this program we intentionally use the deprecated version of
IRuntime::deserializeCudaEngine
method requiring the last nullptr
argument
because, at the times of writing, using the new version without this
argument sometimes caused unexpected program behavior on the considered
GPU devices. The root cause of this problem is not yet clarified;
there might be an undocumented bug in TensorRT inference library.
The shell script build_trt_infer_plan.sh
must be used to compile and link this program:
#!/bin/bash
mkdir -p ./bin
g++ -o ./bin/trt_infer_plan \
-I /usr/local/cuda/include \
trt_infer_plan.cpp common.cpp \
-L /usr/local/cuda/lib64 -lnvinfer -lcudart
Running this script is straightforward:
./build_trt_infer_plan.sh
The program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.
To run this program for the previously created ResNet50 plan and husky image, use the command:
./bin/trt_infer_plan ./plan/resnet50.plan ./data/husky01.dat
The program output will look like:
Bindings: 2
[0] "input.1" input [1 3 224 224]
[1] "495" output [1 1000]
Top-5 results
[0] Siberian husky 49.53%
[1] Eskimo dog 42.90%
[2] malamute 5.87%
[3] dogsled 1.22%
[4] Saint Bernard 0.32%
The Python program trt_bench_plan.py
implements inference benchmarking using
the previously generated TensorRT plan and a pre-processed input image.
import sys
from time import perf_counter
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt
def softmax(x):
y = np.exp(x)
sum = np.sum(y)
y /= sum
return y
def topk(x, k):
idx = np.argsort(x)
idx = idx[::-1][:k]
return (idx, x[idx])
def main():
if len(sys.argv) != 2:
sys.exit("Usage: python3 trt_bench_plan.py <plan_path>")
plan_path = sys.argv[1]
print("Start " + plan_path)
# read the plan
with open(plan_path, "rb") as fp:
plan = fp.read()
# generate random input
np.random.seed(1234)
input = np.random.random(3 * 224 * 224)
input = input.astype(np.float32)
# initialize the TensorRT objects
logger = trt.Logger()
runtime = trt.Runtime(logger)
engine = runtime.deserialize_cuda_engine(plan)
context = engine.create_execution_context()
# create device buffers and TensorRT bindings
output = np.zeros((1000), dtype=np.float32)
d_input = cuda.mem_alloc(input.nbytes)
d_output = cuda.mem_alloc(output.nbytes)
bindings = [int(d_input), int(d_output)]
# copy input to device, run inference
cuda.memcpy_htod(d_input, input)
# warm up
for i in range(1, 10):
context.execute_v2(bindings=bindings)
# benchmark
start = perf_counter()
for i in range(1, 100):
context.execute_v2(bindings=bindings)
end = perf_counter()
elapsed = ((end - start) / 100) * 1000
print('Model {0}: elapsed time {1:.2f} ms'.format(plan_path, elapsed))
# record for automated extraction
print('#{0};{1:f}'.format(plan_path, elapsed))
# copy output to host
cuda.memcpy_dtoh(output, d_output)
# apply softmax and get Top-5 results
output = softmax(output)
top5p, top5v = topk(output, 5)
# print results
print("Top-5 results")
for ind, val in zip(top5p, top5v):
print(" {0} {1:.2f}%".format(ind, val * 100))
main()
This program uses the following TensorRT API object classes:
Logger
- logger used by several other object classesRuntime
- used to deserialize TensorRT plans to TensorRT CUDA enginesICudaEngine
- engine for executing inference on built networksIExecutionContext
- context for executing inference using CUDA engineThe program performs the following steps:
logger: Logger
representing a logger instanceruntime: Runtime
representing a runtime instanceruntime
to deserialize the plan into engine: ICudaEngine
context: IExecutionContext
for the engine
context
with
the specified bindings and CUDA stream handleThe program prints a special formatted line starting with "#"
that
will be later used for automated extraction of performance metrics.
The program uses a path to the TensorRT plan file as its single command line argument.
To run this program for the previously created ResNet50 plan, use the command:
python3 trt_bench_plan.py ./plan/resnet50.plan
The program output will look like:
Model resnet50_py.plan: elapsed time 1.59 ms
Top-5 results
610 6.29%
549 5.21%
446 5.00%
783 3.20%
892 2.93%
The shell script bench_plan_all_py.sh
performs benchmarking of all supported torchvision
models:
#!/bin/bash
echo "#head;TensorRT (Python)"
python3 trt_bench_plan.py ./plan/alexnet.plan
python3 trt_bench_plan.py ./plan/densenet121.plan
python3 trt_bench_plan.py ./plan/densenet161.plan
python3 trt_bench_plan.py ./plan/densenet169.plan
python3 trt_bench_plan.py ./plan/densenet201.plan
python3 trt_bench_plan.py ./plan/mnasnet0_5.plan
python3 trt_bench_plan.py ./plan/mnasnet1_0.plan
python3 trt_bench_plan.py ./plan/mobilenet_v2.plan
python3 trt_bench_plan.py ./plan/mobilenet_v3_large.plan
python3 trt_bench_plan.py ./plan/mobilenet_v3_small.plan
python3 trt_bench_plan.py ./plan/resnet101.plan
python3 trt_bench_plan.py ./plan/resnet152.plan
python3 trt_bench_plan.py ./plan/resnet18.plan
python3 trt_bench_plan.py ./plan/resnet34.plan
python3 trt_bench_plan.py ./plan/resnet50.plan
python3 trt_bench_plan.py ./plan/resnext101_32x8d.plan
python3 trt_bench_plan.py ./plan/resnext50_32x4d.plan
python3 trt_bench_plan.py ./plan/shufflenet_v2_x0_5.plan
python3 trt_bench_plan.py ./plan/shufflenet_v2_x1_0.plan
python3 trt_bench_plan.py ./plan/squeezenet1_0.plan
python3 trt_bench_plan.py ./plan/squeezenet1_1.plan
python3 trt_bench_plan.py ./plan/vgg11.plan
python3 trt_bench_plan.py ./plan/vgg11_bn.plan
python3 trt_bench_plan.py ./plan/vgg13.plan
python3 trt_bench_plan.py ./plan/vgg13_bn.plan
python3 trt_bench_plan.py ./plan/vgg16.plan
python3 trt_bench_plan.py ./plan/vgg16_bn.plan
python3 trt_bench_plan.py ./plan/vgg19.plan
python3 trt_bench_plan.py ./plan/vgg19_bn.plan
python3 trt_bench_plan.py ./plan/wide_resnet101_2.plan
python3 trt_bench_plan.py ./plan/wide_resnet50_2.plan
Running this script is straightforward:
./bench_plan_all_py.sh >bench_trt_py.log
The benchmarking log will be saved in bench_trt_py.log
that later will be
used for performance comparison of various deployment methods.
The benchmarking of TensorRT models can be also implemented using the TensorRT C++ API.
The C++ program trt_bench_plan.cpp
serves this purpose.
#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <vector>
#include <iostream>
#include <fstream>
#include <NvInfer.h>
#include "common.h"
// wrapper class for inference engine
class Engine {
public:
Engine();
~Engine();
public:
void Init(const std::vector<char> &plan);
void StartInfer(const std::vector<float> &input);
void RunInfer();
void EndInfer(std::vector<float> &output);
private:
bool m_active;
Logger m_logger;
UniquePtr<nvinfer1::IRuntime> m_runtime;
UniquePtr<nvinfer1::ICudaEngine> m_engine;
UniquePtr<nvinfer1::IExecutionContext> m_context;
CudaBuffer<float> m_inputBuffer;
CudaBuffer<float> m_outputBuffer;
};
Engine::Engine(): m_active(false) { }
Engine::~Engine() { }
void Engine::Init(const std::vector<char> &plan) {
assert(!m_active);
m_runtime.reset(nvinfer1::createInferRuntime(m_logger));
if (m_runtime == nullptr) {
Error("Error creating infer runtime");
}
m_engine.reset(m_runtime->deserializeCudaEngine(plan.data(), plan.size(), nullptr));
if (m_engine == nullptr) {
Error("Error deserializing CUDA engine");
}
m_active = true;
}
void Engine::StartInfer(const std::vector<float> &input) {
assert(m_active);
m_context.reset(m_engine->createExecutionContext());
if (m_context == nullptr) {
Error("Error creating execution context");
}
m_inputBuffer.Init(3 * 224 * 224);
assert(m_inputBuffer.Size() == input.size());
m_inputBuffer.Put(input.data());
m_outputBuffer.Init(1000);
}
void Engine::RunInfer() {
void *bindings[2];
bindings[0] = m_inputBuffer.Data();
bindings[1] = m_outputBuffer.Data();
bool ok = m_context->executeV2(bindings);
if (!ok) {
Error("Error executing inference");
}
}
void Engine::EndInfer(std::vector<float> &output) {
output.resize(m_outputBuffer.Size());
m_outputBuffer.Get(output.data());
}
// I/O utilities
void ReadPlan(const char *path, std::vector<char> &plan) {
std::ifstream ifs(path, std::ios::in | std::ios::binary);
if (!ifs.is_open()) {
Error("Cannot open %s", path);
}
ifs.seekg(0, ifs.end);
size_t size = ifs.tellg();
plan.resize(size);
ifs.seekg(0, ifs.beg);
ifs.read(plan.data(), size);
ifs.close();
}
void GenerateInput(std::vector<float> &input) {
int size = 3 * 224 * 224;
input.resize(size);
float *p = input.data();
std::srand(1234);
for (int i = 0; i < size; i++) {
p[i] = static_cast<float>(std::rand()) / RAND_MAX;
}
}
void PrintOutput(const std::vector<float> &output) {
int top5p[5];
float top5v[5];
TopK(static_cast<int>(output.size()), output.data(), 5, top5p, top5v);
printf("Top-5 results\n");
for (int i = 0; i < 5; i++) {
int label = top5p[i];
float prob = 100.0f * top5v[i];
printf(" [%d] %d %.2f%%\n", i, label, prob);
}
}
// main program
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: trt_bench_plan <plan_path>\n");
return 1;
}
const char *planPath = argv[1];
printf("Start %s\n", planPath);
int repeat = 100;
std::vector<char> plan;
ReadPlan(planPath, plan);
std::vector<float> input;
GenerateInput(input);
std::vector<float> output;
Engine engine;
engine.Init(plan);
engine.StartInfer(input);
for (int i = 0; i < 10; i++) {
engine.RunInfer();
}
Timer timer;
timer.Start();
for (int i = 0; i < repeat; i++) {
engine.RunInfer();
}
timer.Stop();
float t = timer.Elapsed();
printf("Model %s: elapsed time %f ms / %d = %f\n", planPath, t, repeat, t / float(repeat));
// record for automated extraction
printf("#%s;%f\n", planPath, t / float(repeat));
engine.EndInfer(output);
Softmax(static_cast<int>(output.size()), output.data());
PrintOutput(output);
return 0;
}
This program uses the following TensorRT API object classes:
nvinfer1::ILogger
- logger used by several other object classesnvinfer1::IRuntime
- used to deserialize TensorRT plans to TensorRT CUDA enginesnvinfer1::ICudaEngine
- engine for executing inference on built networksnvinfer1::IExecutionContext
- context for executing inference using CUDA engineClass Engine
holds smart pointers to instances of these objects.
It exposes four principal public methods: Init
, StartInfer
,
RunInfer
, and RunInfer
.
The Init
method performs the following steps:
m_runtime
representing a runtime instancem_runtime
to deserialize the plan into m_engine
The StartInfer
method performs the following steps:
m_context
for the m_engine
The RunInfer
method performs the following steps:
m_context
with the specified bindingsThe EndInfer
method performs the following step:
The program performs the following steps:
engine
engine
using the InitInfer
methodengine
using the RunInfer
methodengine
using the EndInfer
methodThe program prints a special formatted line starting with "#"
that
will be later used for automated extraction of performance metrics.
NOTE: In this program we intentionally use the deprecated version of
IRuntime::deserializeCudaEngine
method requiring the last nullptr
argument
because, at the times of writing, using the new version without this
argument sometimes caused unexpected program behavior on the considered
GPU devices. The root cause of this problem is not yet clarified;
there might be an undocumented bug in TensorRT inference library.
The shell script build_trt_bench_plan.sh
must be used to compile and link this program:
#!/bin/bash
mkdir -p ./bin
g++ -o ./bin/trt_bench_plan \
-I /usr/local/cuda/include \
trt_bench_plan.cpp common.cpp \
-L /usr/local/cuda/lib64 -lnvinfer -lcudart
Running this script is straightforward:
./build_trt_bench_plan.sh
The program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.
To run this program for the previously created ResNet50 plan, use the command:
./bin/trt_bench_plan ./plan/resnet50.plan
The program output will look like:
Model resnet50.plan: elapsed time 179.491653 ms / 100 = 1.794917
Top-5 results
[0] 610 4.25%
[1] 549 3.90%
[2] 783 3.64%
[3] 892 3.51%
[4] 446 3.18%
The shell script bench_plan_all.sh
performs benchmarking of all supported torchvision
models:
#!/bin/bash
echo "#head;TensorRT (C++)"
./bin/trt_bench_plan ./plan/alexnet.plan
./bin/trt_bench_plan ./plan/densenet121.plan
./bin/trt_bench_plan ./plan/densenet161.plan
./bin/trt_bench_plan ./plan/densenet169.plan
./bin/trt_bench_plan ./plan/densenet201.plan
./bin/trt_bench_plan ./plan/mnasnet0_5.plan
./bin/trt_bench_plan ./plan/mnasnet1_0.plan
./bin/trt_bench_plan ./plan/mobilenet_v2.plan
./bin/trt_bench_plan ./plan/mobilenet_v3_large.plan
./bin/trt_bench_plan ./plan/mobilenet_v3_small.plan
./bin/trt_bench_plan ./plan/resnet101.plan
./bin/trt_bench_plan ./plan/resnet152.plan
./bin/trt_bench_plan ./plan/resnet18.plan
./bin/trt_bench_plan ./plan/resnet34.plan
./bin/trt_bench_plan ./plan/resnet50.plan
./bin/trt_bench_plan ./plan/resnext101_32x8d.plan
./bin/trt_bench_plan ./plan/resnext50_32x4d.plan
./bin/trt_bench_plan ./plan/shufflenet_v2_x0_5.plan
./bin/trt_bench_plan ./plan/shufflenet_v2_x1_0.plan
./bin/trt_bench_plan ./plan/squeezenet1_0.plan
./bin/trt_bench_plan ./plan/squeezenet1_1.plan
./bin/trt_bench_plan ./plan/vgg11.plan
./bin/trt_bench_plan ./plan/vgg11_bn.plan
./bin/trt_bench_plan ./plan/vgg13.plan
./bin/trt_bench_plan ./plan/vgg13_bn.plan
./bin/trt_bench_plan ./plan/vgg16.plan
./bin/trt_bench_plan ./plan/vgg16_bn.plan
./bin/trt_bench_plan ./plan/vgg19.plan
./bin/trt_bench_plan ./plan/vgg19_bn.plan
./bin/trt_bench_plan ./plan/wide_resnet101_2.plan
./bin/trt_bench_plan ./plan/wide_resnet50_2.plan
Running this script is straightforward:
./bench_plan_all.sh >bench_trt.log
The benchmarking log will be saved in bench_trt.log
that later will be
used for performance comparison of various deployment methods.
The Python program merge_perf.py
introduced in Article 2 extracts
performance metrics from multiple benchmarking log files and merges them
in a single CSV file in a form suitable for further analysis.
The program has two or more command line arguments, each argument specifying a path to the log file.
The program extracts special records starting with "#"
from all input files,
merges the extracted information, and saves it as a single CSV file.
Each line of the output file corresponds to one model and each column corresponds to
one deployment method.
Assuming that benchmarking described in the Articles 1, 2, and 3 has been
performed in the sibling directories art01
, art02
, and art03
respectively
and the current directory is art03
, the following command can be used to merge the three log
files considered so far:
python3 merge_perf.py ../art01/bench_torch.log ../art02/bench_ts_py.log ../art02/bench_ts.log bench_trt_py.log bench_trt.log >perf03.csv
The output file perf03.csv
will look like:
Model;PyTorch;TorchScript (Python);TorchScript (C++);TensorRT (Python);TensorRT (C++)
alexnet;1.23;1.05;1.04;0.58;0.60
densenet121;19.79;13.65;13.34;3.73;3.67
densenet161;29.43;20.83;20.70;7.99;7.40
densenet169;28.47;19.33;20.11;8.17;7.32
densenet201;33.48;22.44;22.70;12.24;10.96
mnasnet0_5;5.45;3.63;3.67;0.64;0.61
mnasnet1_0;5.66;3.79;3.95;0.80;0.80
mobilenet_v2;6.19;4.12;4.02;0.77;0.76
mobilenet_v3_large;8.07;5.22;5.18;0.98;0.91
mobilenet_v3_small;6.37;4.20;4.19;0.74;0.67
resnet101;15.80;11.01;10.81;3.12;3.18
resnet152;23.66;16.65;16.37;4.57;4.57
resnet18;3.39;2.39;2.30;1.08;1.04
resnet34;6.11;4.22;4.11;1.84;1.79
resnet50;7.99;5.53;5.47;1.75;1.75
resnext101_32x8d;21.69;17.34;16.66;8.06;8.11
resnext50_32x4d;6.45;4.32;4.41;2.13;2.08
shufflenet_v2_x0_5;6.33;4.03;4.01;0.47;0.49
shufflenet_v2_x1_0;6.84;4.58;4.44;0.88;0.86
squeezenet1_0;3.05;2.28;2.33;0.41;0.42
squeezenet1_1;3.03;2.28;2.31;0.31;0.31
vgg11;1.91;1.81;1.84;1.74;1.75
vgg11_bn;2.37;1.93;1.96;1.75;1.75
vgg13;2.26;2.31;2.27;2.16;2.15
vgg13_bn;2.62;2.45;2.43;2.14;2.17
vgg16;2.82;2.75;2.88;2.64;2.61
vgg16_bn;3.23;3.10;3.06;2.61;2.65
vgg19;3.29;3.40;3.40;3.17;3.14
vgg19_bn;3.72;3.64;3.64;3.07;3.13
wide_resnet101_2;15.50;10.89;10.55;5.58;5.45
wide_resnet50_2;7.88;5.52;5.35;2.83;2.95
The Python program tab_perf.py
introduced in Article 2 can be used to display
the CSV data in the tabular format.
To run this program, use the following command line:
python3 tab_perf.py perf03.csv >perf03.txt
The output file perf03.txt
will look like:
Model PyTorch TorchScript (Python) TorchScript (C++) TensorRT (Python) TensorRT (C++)
----------------------------------------------------------------------------------------------------------------------
alexnet 1.23 1.05 1.04 0.58 0.60
densenet121 19.79 13.65 13.34 3.73 3.67
densenet161 29.43 20.83 20.70 7.99 7.40
densenet169 28.47 19.33 20.11 8.17 7.32
densenet201 33.48 22.44 22.70 12.24 10.96
mnasnet0_5 5.45 3.63 3.67 0.64 0.61
mnasnet1_0 5.66 3.79 3.95 0.80 0.80
mobilenet_v2 6.19 4.12 4.02 0.77 0.76
mobilenet_v3_large 8.07 5.22 5.18 0.98 0.91
mobilenet_v3_small 6.37 4.20 4.19 0.74 0.67
resnet101 15.80 11.01 10.81 3.12 3.18
resnet152 23.66 16.65 16.37 4.57 4.57
resnet18 3.39 2.39 2.30 1.08 1.04
resnet34 6.11 4.22 4.11 1.84 1.79
resnet50 7.99 5.53 5.47 1.75 1.75
resnext101_32x8d 21.69 17.34 16.66 8.06 8.11
resnext50_32x4d 6.45 4.32 4.41 2.13 2.08
shufflenet_v2_x0_5 6.33 4.03 4.01 0.47 0.49
shufflenet_v2_x1_0 6.84 4.58 4.44 0.88 0.86
squeezenet1_0 3.05 2.28 2.33 0.41 0.42
squeezenet1_1 3.03 2.28 2.31 0.31 0.31
vgg11 1.91 1.81 1.84 1.74 1.75
vgg11_bn 2.37 1.93 1.96 1.75 1.75
vgg13 2.26 2.31 2.27 2.16 2.15
vgg13_bn 2.62 2.45 2.43 2.14 2.17
vgg16 2.82 2.75 2.88 2.64 2.61
vgg16_bn 3.23 3.10 3.06 2.61 2.65
vgg19 3.29 3.40 3.40 3.17 3.14
vgg19_bn 3.72 3.64 3.64 3.07 3.13
wide_resnet101_2 15.50 10.89 10.55 5.58 5.45
wide_resnet50_2 7.88 5.52 5.35 2.83 2.95
Analysis of these performance data reveals that using TensorRT provides substantial performance increase compared to all previously considered deployment methods.
Differences between TensorRT performance data for Python and C++ are within the experimental error. Python and C++ can be considered equally good for running TensorRT.
This documentation on NVIDIA TensorRT 8.0.3 can be used for further references:
At the time of writing, the detailed API information was available only for version 8.0.1:
The index of documents covering all TensorRT versions is available here.
All recommendations and examples described in the Articles 1, 2, and 3 are also applicable to Genesis Cloud instances equipped with NVIDIA RTX 3090 GPUs. We have conducted benchmarking of inference for the image classification models on the RTX 3090 instance. Here are the results:
Model PyTorch TorchScript (Python) TorchScript (C++) TensorRT (Python) TensorRT (C++)
----------------------------------------------------------------------------------------------------------------------
alexnet 1.30 0.97 1.01 0.52 0.52
densenet121 19.91 13.76 13.80 3.65 3.58
densenet161 29.76 19.78 21.43 7.23 7.19
densenet169 28.93 19.06 19.67 7.03 6.91
densenet201 34.34 21.90 23.97 10.56 10.47
mnasnet0_5 5.55 3.44 3.78 0.63 0.61
mnasnet1_0 5.87 3.68 3.88 0.81 0.79
mobilenet_v2 6.21 3.90 4.21 0.73 0.71
mobilenet_v3_large 7.87 5.38 5.65 0.95 1.03
mobilenet_v3_small 6.49 4.38 4.43 0.70 0.79
resnet101 16.09 10.39 11.19 3.26 3.07
resnet152 24.34 15.38 17.10 4.54 4.55
resnet18 3.37 2.21 2.35 1.05 1.08
resnet34 6.11 4.03 4.25 1.84 1.75
resnet50 8.21 5.67 5.82 1.72 1.75
resnext101_32x8d 22.09 16.38 17.64 7.85 7.97
resnext50_32x4d 6.53 4.13 4.28 2.05 2.12
shufflenet_v2_x0_5 6.53 3.90 4.22 0.49 0.57
shufflenet_v2_x1_0 7.08 4.36 4.68 0.89 0.91
squeezenet1_0 3.16 2.21 2.40 0.40 0.38
squeezenet1_1 3.09 2.15 2.26 0.31 0.31
vgg11 1.92 1.55 1.58 1.50 1.51
vgg11_bn 2.32 1.78 1.84 1.49 1.49
vgg13 2.28 1.99 1.99 1.84 1.85
vgg13_bn 2.71 2.12 2.16 1.84 1.85
vgg16 2.68 2.46 2.50 2.27 2.25
vgg16_bn 3.27 2.72 3.55 2.27 2.28
vgg19 3.01 3.03 3.12 2.69 2.70
vgg19_bn 8.21 3.11 3.18 2.74 2.72
wide_resnet101_2 15.34 10.02 10.65 5.36 5.21
wide_resnet50_2 7.98 5.22 5.63 2.80 2.75
For all the considered models and batch size of 1 there is almost no performance improvement compared to the results for RTX 3080 listed above.
Written on June 3rd , 2022 by Alexey Gokhberg