Deployment of Deep Learning models on Genesis Cloud - Deployment techniques for PyTorch models using TensorRT

Deployment of Deep Learning models on Genesis Cloud - Deployment techniques for PyTorch models using TensorRT

Deployment techniques for PyTorch models using TensorRT


Before you start, make sure that you have created an account with Genesis Cloud and finished the on-boarding steps (phone verification, adding SSH key, providing credit card). You can create an account here and get $15 in free credits. Furthermore, ensure that you have access to one NVIDIA RTX 3080 GPU instance (if not, request quota here). Important: this is the second part of our ML inference article series. Please make sure to check our tutorials’ article first if you haven’t done so.


This article covers using TensorRT for deployment of PyTorch models.

NVIDIA TensorRT is an SDK for high-performance deep learning inference on NVIDIA GPU devices. It includes the inference engine and parsers for handling various input network specification formats. TensorRT provides application programmer interfaces (API) for C++ and Python. This article will present example programs using both these languages.

To deploy PyTorch models using TensorRT, we will export them in ONNX format. ONNX stands for Open Neural Network Exchange and is an open format built to represent deep learning models in a framework-agnostic way. TensorRT provides a specialized parser for importing ONNX models.

We assume that you will continue using the Genesis Cloud GPU-enabled instance that you created and configured while studying the Article 1.

In particular, the following software must be installed and configured as described in that article:

Various assets (source code, shell scripts, and data files) used in this article can be found in the supporting GitHub repository.

To run examples described in this article we recommend cloning the entire repository on your Genesis Cloud instance. The subdirectory art03 must be made your current directory.

Step 1. Install TensorRT

The version of TensorRT must be compatible with the chosen versions of CUDA and cuDNN. For our choice of CUDA 11.3.1 and cuDNN 8.2.1 we will need TensorRT 8.0.3. (The actual support matrix for TensorRT 8.x is available here.)

To access TensorRT, you should register as a member of the NVIDIA Developer Program.

To download the TensorRT distribution, visit the official download site. Choose “TensorRT 8”, then agree to the “NVIDIA TensorRT License Agreement” and choose “TensorRT 8.0 GA Update 1” (“GA” stands for “General Availability”). Select and download “TensorRT 8.0.3 GA for Ubuntu 20.04 and CUDA 11.3 DEB local repo package”. You will get a DEB repo file; at the time of writing this article its name was:

nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831_1-1_amd64.deb

Place it in a scratch directory on you instance (we use ~/transit in this series of articles), then proceed with installation by entering these commands:

sudo dpkg -i nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831_1-1_amd64.deb
sudo apt-key add /var/nv-tensorrt-repo-ubuntu2004-cuda11.3-trt8.0.3.4-ga-20210831/7fa2af80.pub
sudo apt-get update
sudo apt-get install tensorrt

Then install Python bindings for TensorRT API:

python3 -m pip install numpy
sudo apt-get install python3-libnvinfer-dev

Verify the installation using the command:

dpkg -l | grep TensorRT

Detailed installation instructions can be found on the official “Installing TensorRT” page.

Step 2. Install PyCUDA package

PyCUDA is a Python package implementing access to the CUDA API from Python. The Python programs described in this article require PyCUDA for accessing the basic CUDA functionality like managing CUDA device memory buffers.

Before starting the PyCUDA installation make sure that the NVIDIA CUDA compiler driver nvcc is accessible by entering the command:

nvcc --version

If this command fails, update the PATH environment variable:

export PATH=/usr/local/cuda/bin:$PATH

To install PyCUDA, enter the command:

python3 -m pip install pycuda

Step 3. Convert a PyTorch model to ONNX

We will continue using the torchvision image classification models for our examples. As the first step, we will demonstrate conversion of the already familiar ResNet50 model to ONNX format.

The Python program generate_onnx_resnet50.py serves this purpose.

import torch
import torchvision.models as models

input = torch.rand(1, 3, 224, 224)

model = models.resnet50(pretrained=True)
model.eval()
output = model(input)
torch.onnx.export(model, input, "./onnx/resnet50.onnx", export_params=True)

This program:

We store generated ONNX files in the subdirectory onnx which must be created before running the program:

mkdir -p onnx

To run this program, use the command:

python3 generate_onnx_resnet50.py

The program will produce a file resnet50.onnx containing the ONNX model representation.

The Python program generate_onnx_all.py can be used to produce ONNX descriptions for all considered torchvision image classification models.

import torch
import torchvision.models as models

MODELS = [
    ('alexnet', models.alexnet),

    ('densenet121', models.densenet121),
    ('densenet161', models.densenet161),
    ('densenet169', models.densenet169),
    ('densenet201', models.densenet201),

    ('mnasnet0_5', models.mnasnet0_5),
    ('mnasnet1_0', models.mnasnet1_0),

    ('mobilenet_v2', models.mobilenet_v2),
    ('mobilenet_v3_large', models.mobilenet_v3_large),
    ('mobilenet_v3_small', models.mobilenet_v3_small),

    ('resnet18', models.resnet18),
    ('resnet34', models.resnet34),
    ('resnet50', models.resnet50),
    ('resnet101', models.resnet101),
    ('resnet152', models.resnet152),

    ('resnext50_32x4d', models.resnext50_32x4d),
    ('resnext101_32x8d', models.resnext101_32x8d),

    ('shufflenet_v2_x0_5', models.shufflenet_v2_x0_5),
    ('shufflenet_v2_x1_0', models.shufflenet_v2_x1_0),

    ('squeezenet1_0', models.squeezenet1_0),
    ('squeezenet1_1', models.squeezenet1_1),

    ('vgg11', models.vgg11),
    ('vgg11_bn', models.vgg11_bn),
    ('vgg13', models.vgg13),
    ('vgg13_bn', models.vgg13_bn),
    ('vgg16', models.vgg16),
    ('vgg16_bn', models.vgg16_bn),
    ('vgg19', models.vgg19),
    ('vgg19_bn', models.vgg19_bn),

    ('wide_resnet50_2', models.wide_resnet50_2),
    ('wide_resnet101_2', models.wide_resnet101_2),
]

def generate_model(name, builder):
    print('Generate', name)
    input = torch.rand(1, 3, 224, 224)
    model = builder(pretrained=True)
    model.eval()
    output = model(input)
    onnx_path = './onnx/' + name + '.onnx'
    torch.onnx.export(model, input, onnx_path, export_params=True)

for name, model in MODELS:
    generate_model(name, model)

To run this program, enter the following commands:

mkdir -p onnx
python3 generate_onnx_all.py

Step 4. Convert ONNX format to TensorRT plan using Python

To perform inference of the ONNX model using TensorRT, it must be pre-processed using the TensorRT ONNX parser. We will start with conversion of the ONNX representation to the TensorRT plan. The TensorRT plan is a serialized form of a TensorRT engine. The TensorRT engine represents the model optimized for execution on a chosen CUDA device.

The Python program trt_onnx_parser.py serves this purpose.

import sys
import tensorrt as trt

def main():
    if len(sys.argv) != 3:
        sys.exit("Usage: python3 trt_onnx_parser.py <input_onnx_path> <output_plan_path>")

    onnx_path = sys.argv[1]
    plan_path = sys.argv[2]

    logger = trt.Logger()
    builder = trt.Builder(logger)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    config = builder.create_builder_config()
    config.max_workspace_size = 256 * 1024 * 1024
    config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)

    parser = trt.OnnxParser(network, logger)
    ok = parser.parse_from_file(onnx_path)
    if not ok:
        sys.exit("ONNX parse error")

    plan = builder.build_serialized_network(network, config)
    with open(plan_path, "wb") as fp:
        fp.write(plan)

    print("DONE")

main()

The Python package tensorrt implements TensorRT Python API and provides a collection of Python object classes used to handle various aspects of TensorRT inference and model parsing.

This program uses the following TensorRT API object classes:

The program performs the following steps:

The program has two command line arguments: a path to the input ONNX file and a path to the output TensorRT plan file.

We store generated plan files in the subdirectory plan which must be created before running the program:

mkdir -p plan

To run this program for conversion of ResNet50 ONNX representation, use the command:

python3 trt_onnx_parser.py ./onnx/resnet50.onnx ./plan/resnet50.plan

The Python program trt_onnx_parser_all.py can be used to produce ONNX representations for all considered torchvision image classification models.

import sys
import tensorrt as trt

MODELS = [
    'alexnet',

    'densenet121',
    'densenet161',
    'densenet169',
    'densenet201',

    'mnasnet0_5',
    'mnasnet1_0',

    'mobilenet_v2',
    'mobilenet_v3_large',
    'mobilenet_v3_small',

    'resnet18',
    'resnet34',
    'resnet50',
    'resnet101',
    'resnet152',

    'resnext50_32x4d',
    'resnext101_32x8d',

    'shufflenet_v2_x0_5',
    'shufflenet_v2_x1_0',

    'squeezenet1_0',
    'squeezenet1_1',

    'vgg11',
    'vgg11_bn',
    'vgg13',
    'vgg13_bn',
    'vgg16',
    'vgg16_bn',
    'vgg19',
    'vgg19_bn',

    'wide_resnet50_2',
    'wide_resnet101_2',
]

def setup_builder():
    logger = trt.Logger()
    builder = trt.Builder(logger)
    return (logger, builder)

def generate_plan(logger, builder, name):
    print('Generate TensorRT plan for ' + name)

    onnx_path = './onnx/' + name + '.onnx'
    plan_path = './plan/' + name + '.plan'

    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    config = builder.create_builder_config()
    config.max_workspace_size = 256 * 1024 * 1024
    config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)

    parser = trt.OnnxParser(network, logger)
    ok = parser.parse_from_file(onnx_path)
    if not ok:
        sys.exit('ONNX parse error')

    plan = builder.build_serialized_network(network, config)
    with open(plan_path, "wb") as fp:
        fp.write(plan)

def main():
    logger, builder = setup_builder()
    for name in MODELS:
        generate_plan(logger, builder, name)
    print('DONE')

main()

To run this program, enter the following commands:

mkdir -p plan
python3 trt_onnx_parser_all.py

Step 5. Convert ONNX format to TensorRT plan using C++

Conversion of the ONNX representation to TensorRT plan can be also implemented using the TensorRT C++ API.

The C++ program trt_onnx_parser.cpp serves this purpose.

#include <cstdio>
#include <cstdlib>
#include <cassert>

#include <NvInfer.h>
#include <NvOnnxParser.h>

#include "common.h"

// wrapper class for ONNX parser

class OnnxParser {
public:
    OnnxParser();
    ~OnnxParser();
public:
    void Init();
    void Parse(const char *onnxPath, const char *planPath);
private:
    bool m_active;
    Logger m_logger;
    UniquePtr<nvinfer1::IBuilder> m_builder;
    UniquePtr<nvinfer1::INetworkDefinition> m_network;
    UniquePtr<nvinfer1::IBuilderConfig> m_config;
    UniquePtr<nvonnxparser::IParser> m_parser;
};

OnnxParser::OnnxParser(): m_active(false) { }

OnnxParser::~OnnxParser() { }

void OnnxParser::Init() {
    assert(!m_active);
    m_builder.reset(nvinfer1::createInferBuilder(m_logger));
    if (m_builder == nullptr) {
        Error("Error creating infer builder");
    }
    auto networkFlags = 1 << int(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    m_network.reset(m_builder->createNetworkV2(networkFlags));
    if (m_network == nullptr) {
        Error("Error creating network");
    }
    m_config.reset(m_builder->createBuilderConfig());
    if (m_config == nullptr) {
        Error("Error creating builder config");
    }
    m_config->setMaxWorkspaceSize(256 * 1024 * 1024);
    m_config->setFlag(nvinfer1::BuilderFlag::kDISABLE_TIMING_CACHE);
    m_parser.reset(nvonnxparser::createParser(*m_network, m_logger));
    if (m_parser == nullptr) {
        Error("Error creating ONNX parser");
    }
}

void OnnxParser::Parse(const char *onnxPath, const char *planPath) {
    bool ok = m_parser->parseFromFile(onnxPath, static_cast<int>(m_logger.SeverityLevel()));
    if (!ok) {
        Error("ONNX parse error");
    }
    UniquePtr<nvinfer1::IHostMemory> plan(m_builder->buildSerializedNetwork(*m_network, *m_config));
    if (plan == nullptr) {
        Error("Network serialization error");
    }
    const void *data = plan->data();
    size_t size = plan->size();
    FILE *fp = fopen(planPath, "wb");
    if (fp == nullptr) {
        Error("Failed to create file %s", planPath);
    }
    fwrite(data, 1, size, fp);
    fclose(fp);
}

// main program

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: trt_onnx_parser <input_onnx_path> <output_plan_path>\n");
        return 1;
    }
    const char *onnxPath = argv[1];
    const char *planPath = argv[2];
    printf("Generate TensorRT plan for %s\n", onnxPath);
    OnnxParser parser;
    parser.Init();
    parser.Parse(onnxPath, planPath);
    return 0;
}

The program is functionally similar to previously described Python program trt_onnx_parser.py. Plans generated using the Python and C++ program versions are interchangeable; each plan can be used for the subsequent inference with Python and C++ programs described in this article.

The program uses the TensorRT C++ API specified in two header files:

This program uses the following TensorRT API object classes:

Class OnnxParser holds smart pointers to instances of these objects. It exposes two principal public methods: Init and Parse.

The Init method performs the following steps:

The Parse method performs the following steps:

The shell script build_trt_onnx_parser.sh must be used to compile and link this program:

#!/bin/bash

mkdir -p ./bin

g++ -o ./bin/trt_onnx_parser \
    -I /usr/local/cuda/include \
    trt_onnx_parser.cpp common.cpp \
    -L /usr/local/cuda/lib64 -lnvonnxparser -lnvinfer -lcudart

Running this script is straightforward:

./build_trt_onnx_parser.sh

The program has two command line arguments: a path to the input ONNX file and a path to the output TensorRT plan file.

To run this program for conversion of ResNet50 ONNX representation, use the command:

./bin/trt_onnx_parser ./onnx/resnet50.onnx ./plan/resnet50.plan

The shell script trt_onnx_parser_all.sh uses the C++ program to generate ONNX representations for all considered torchvision image classification files:

#!/bin/bash

./bin/trt_onnx_parser ./onnx/alexnet.onnx ./plan/alexnet.plan
./bin/trt_onnx_parser ./onnx/densenet121.onnx ./plan/densenet121.plan
./bin/trt_onnx_parser ./onnx/densenet161.onnx ./plan/densenet161.plan
./bin/trt_onnx_parser ./onnx/densenet169.onnx ./plan/densenet169.plan
./bin/trt_onnx_parser ./onnx/densenet201.onnx ./plan/densenet201.plan
./bin/trt_onnx_parser ./onnx/mnasnet0_5.onnx ./plan/mnasnet0_5.plan
./bin/trt_onnx_parser ./onnx/mnasnet1_0.onnx ./plan/mnasnet1_0.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v2.onnx ./plan/mobilenet_v2.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v3_large.onnx ./plan/mobilenet_v3_large.plan
./bin/trt_onnx_parser ./onnx/mobilenet_v3_small.onnx ./plan/mobilenet_v3_small.plan
./bin/trt_onnx_parser ./onnx/resnet101.onnx ./plan/resnet101.plan
./bin/trt_onnx_parser ./onnx/resnet152.onnx ./plan/resnet152.plan
./bin/trt_onnx_parser ./onnx/resnet18.onnx ./plan/resnet18.plan
./bin/trt_onnx_parser ./onnx/resnet34.onnx ./plan/resnet34.plan
./bin/trt_onnx_parser ./onnx/resnet50.onnx ./plan/resnet50.plan
./bin/trt_onnx_parser ./onnx/resnext101_32x8d.onnx ./plan/resnext101_32x8d.plan
./bin/trt_onnx_parser ./onnx/resnext50_32x4d.onnx ./plan/resnext50_32x4d.plan
./bin/trt_onnx_parser ./onnx/shufflenet_v2_x0_5.onnx ./plan/shufflenet_v2_x0_5.plan
./bin/trt_onnx_parser ./onnx/shufflenet_v2_x1_0.onnx ./plan/shufflenet_v2_x1_0.plan
./bin/trt_onnx_parser ./onnx/squeezenet1_0.onnx ./plan/squeezenet1_0.plan
./bin/trt_onnx_parser ./onnx/squeezenet1_1.onnx ./plan/squeezenet1_1.plan
./bin/trt_onnx_parser ./onnx/vgg11.onnx ./plan/vgg11.plan
./bin/trt_onnx_parser ./onnx/vgg11_bn.onnx ./plan/vgg11_bn.plan
./bin/trt_onnx_parser ./onnx/vgg13.onnx ./plan/vgg13.plan
./bin/trt_onnx_parser ./onnx/vgg13_bn.onnx ./plan/vgg13_bn.plan
./bin/trt_onnx_parser ./onnx/vgg16.onnx ./plan/vgg16.plan
./bin/trt_onnx_parser ./onnx/vgg16_bn.onnx ./plan/vgg16_bn.plan
./bin/trt_onnx_parser ./onnx/vgg19.onnx ./plan/vgg19.plan
./bin/trt_onnx_parser ./onnx/vgg19_bn.onnx ./plan/vgg19_bn.plan
./bin/trt_onnx_parser ./onnx/wide_resnet101_2.onnx ./plan/wide_resnet101_2.plan
./bin/trt_onnx_parser ./onnx/wide_resnet50_2.onnx ./plan/wide_resnet50_2.plan

Running this script is straightforward:

mkdir -p ./plan
./trt_onnx_parser_all.sh

Step 6. Run TensorRT inference using Python

The inference programs in Python and C++ described in the rest of this article reuse several files introduced in Articles 1 and 2. These include:

See the respective articles for details on obtaining these files.

The Python program trt_infer_plan.py implements TensorRT inference using the previously generated TensorRT plan and a pre-processed input image.

import sys
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt

def softmax(x):
    y = np.exp(x)
    sum = np.sum(y)
    y /= sum
    return y

def topk(x, k):
    idx = np.argsort(x)
    idx = idx[::-1][:k]
    return (idx, x[idx])

def main():
    if len(sys.argv) != 3:
        sys.exit("Usage: python3 trt_infer_plan.py <plan_path> <input_path>")

    plan_path = sys.argv[1]
    input_path = sys.argv[2]

    print("Start " + plan_path)

    # read the plan
    with open(plan_path, "rb") as fp:
        plan = fp.read()

    # read the pre-processed image
    input = np.fromfile(input_path, np.float32)

    # read the categories
    with open("imagenet_classes.txt", "r") as f:
        categories = [s.strip() for s in f.readlines()]

    # initialize the TensorRT objects
    logger = trt.Logger()
    runtime = trt.Runtime(logger)
    engine = runtime.deserialize_cuda_engine(plan)
    context = engine.create_execution_context()

    # create device buffers and TensorRT bindings
    output = np.zeros((1000), dtype=np.float32)
    d_input = cuda.mem_alloc(input.nbytes)
    d_output = cuda.mem_alloc(output.nbytes)
    bindings = [int(d_input), int(d_output)]

    # copy input to device, run inference, copy output to host
    cuda.memcpy_htod(d_input, input)
    context.execute_v2(bindings=bindings)
    cuda.memcpy_dtoh(output, d_output)

    # apply softmax and get Top-5 results
    output = softmax(output)
    top5p, top5v = topk(output, 5)

    # print results
    print("Top-5 results")
    for ind, val in zip(top5p, top5v):
        print("  {0} {1:.2f}%".format(categories[ind], val * 100))

main()

This program uses the following TensorRT API object classes:

The program performs the following steps:

The program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.

To run this program for the previously created ResNet50 plan and husky image, use the command:

python3 trt_infer_plan.py ./plan/resnet50.plan ./data/husky01.dat

The program output will look like:

Siberian husky 49.52%
Eskimo dog 42.90%
malamute 5.87%
dogsled 1.22%
Saint Bernard 0.32%

Step 7. Run TensorRT inference using C++

The inference with TensorRT models can be also implemented using the TensorRT C++ API.

The C++ program trt_infer_plan.cpp serves this purpose.

#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <string>
#include <vector>
#include <iostream>
#include <fstream>

#include <NvInfer.h>

#include "common.h"

// wrapper class for inference engine

class Engine {
public:
    Engine();
    ~Engine();
public:
    void Init(const std::vector<char> &plan);
    void Infer(const std::vector<float> &input, std::vector<float> &output);
    void DiagBindings();
private:
    bool m_active;
    Logger m_logger;
    UniquePtr<nvinfer1::IRuntime> m_runtime;
    UniquePtr<nvinfer1::ICudaEngine> m_engine;
};

Engine::Engine(): m_active(false) { }

Engine::~Engine() { }

void Engine::Init(const std::vector<char> &plan) {
    assert(!m_active);
    m_runtime.reset(nvinfer1::createInferRuntime(m_logger));
    if (m_runtime == nullptr) {
        Error("Error creating infer runtime");
    }
    m_engine.reset(m_runtime->deserializeCudaEngine(plan.data(), plan.size(), nullptr));
    if (m_engine == nullptr) {
        Error("Error deserializing CUDA engine");
    }
    m_active = true;
}

void Engine::Infer(const std::vector<float> &input, std::vector<float> &output) {
    assert(m_active);
    UniquePtr<nvinfer1::IExecutionContext> context;
    context.reset(m_engine->createExecutionContext());
    if (context == nullptr) {
        Error("Error creating execution context");
    }
    CudaBuffer<float> inputBuffer;
    inputBuffer.Init(3 * 224 * 224);
    assert(inputBuffer.Size() == input.size());
    inputBuffer.Put(input.data());
    CudaBuffer<float> outputBuffer;
    outputBuffer.Init(1000);
    void *bindings[2];
    bindings[0] = inputBuffer.Data();
    bindings[1] = outputBuffer.Data();
    bool ok = context->executeV2(bindings);
    if (!ok) {
        Error("Error executing inference");
    }
    output.resize(outputBuffer.Size());
    outputBuffer.Get(output.data());
}

void Engine::DiagBindings() {
    int nbBindings = static_cast<int>(m_engine->getNbBindings());
    printf("Bindings: %d\n", nbBindings);
    for (int i = 0; i < nbBindings; i++) {
        const char *name = m_engine->getBindingName(i);
        bool isInput = m_engine->bindingIsInput(i);
        nvinfer1::Dims dims = m_engine->getBindingDimensions(i);
        std::string fmtDims = FormatDims(dims);
        printf("  [%d] \"%s\" %s [%s]\n", i, name, isInput ? "input" : "output", fmtDims.c_str());
    }
}

// I/O utilities

void ReadClasses(const char *path, std::vector<std::string> &classes) {
    std::string line;
    std::ifstream ifs(path, std::ios::in);
    if (!ifs.is_open()) {
        Error("Cannot open %s", path);
    }
    while (std::getline(ifs, line)) {
        classes.push_back(line);
    }
    ifs.close();
}

void ReadPlan(const char *path, std::vector<char> &plan) {
    std::ifstream ifs(path, std::ios::in | std::ios::binary);
    if (!ifs.is_open()) {
        Error("Cannot open %s", path);
    }
    ifs.seekg(0, ifs.end);
    size_t size = ifs.tellg();
    plan.resize(size);
    ifs.seekg(0, ifs.beg);
    ifs.read(plan.data(), size);
    ifs.close();
}

void ReadInput(const char *path, std::vector<float> &input) {
    std::ifstream ifs(path, std::ios::in | std::ios::binary);
    if (!ifs.is_open()) {
        Error("Cannot open %s", path);
    }
    size_t size = 3 * 224 * 224;
    input.resize(size);
    ifs.read(reinterpret_cast<char *>(input.data()), size * sizeof(float));
    ifs.close();
}

void PrintOutput(const std::vector<float> &output, const std::vector<std::string> &classes) {
    int top5p[5];
    float top5v[5];
    TopK(static_cast<int>(output.size()), output.data(), 5, top5p, top5v);
    printf("Top-5 results\n");
    for (int i = 0; i < 5; i++) {
        std::string label = classes[top5p[i]];
        float prob = 100.0f * top5v[i];
        printf("  [%d] %s %.2f%%\n", i, label.c_str(), prob);
    }
}

// main program

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: trt_infer_plan <plan_path> <input_path>\n");
        return 1;
    }
    const char *planPath = argv[1];
    const char *inputPath = argv[2];

    printf("Start %s\n", planPath);

    std::vector<std::string> classes;
    ReadClasses("imagenet_classes.txt", classes);

    std::vector<char> plan;
    ReadPlan(planPath, plan);

    std::vector<float> input;
    ReadInput(inputPath, input);

    std::vector<float> output;

    Engine engine;
    engine.Init(plan);
    engine.DiagBindings();
    engine.Infer(input, output);

    Softmax(static_cast<int>(output.size()), output.data());
    PrintOutput(output, classes);

    return 0;
}

This program uses the following TensorRT API object classes:

Class Engine holds smart pointers to instances of these objects. It exposes two principal public methods: Init and Infer.

The Init method performs the following steps:

The Infer method performs the following steps:

The program performs the following steps:

NOTE: In this program we intentionally use the deprecated version of IRuntime::deserializeCudaEngine method requiring the last nullptr argument because, at the times of writing, using the new version without this argument sometimes caused unexpected program behavior on the considered GPU devices. The root cause of this problem is not yet clarified; there might be an undocumented bug in TensorRT inference library.

The shell script build_trt_infer_plan.sh must be used to compile and link this program:

#!/bin/bash

mkdir -p ./bin

g++ -o ./bin/trt_infer_plan \
    -I /usr/local/cuda/include \
    trt_infer_plan.cpp common.cpp \
    -L /usr/local/cuda/lib64 -lnvinfer -lcudart

Running this script is straightforward:

./build_trt_infer_plan.sh

The program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.

To run this program for the previously created ResNet50 plan and husky image, use the command:

./bin/trt_infer_plan ./plan/resnet50.plan ./data/husky01.dat

The program output will look like:

Bindings: 2
  [0] "input.1" input [1 3 224 224]
  [1] "495" output [1 1000]
Top-5 results
  [0] Siberian husky 49.53%
  [1] Eskimo dog 42.90%
  [2] malamute 5.87%
  [3] dogsled 1.22%
  [4] Saint Bernard 0.32%

Step 8. Run TensorRT benchmarking using Python

The Python program trt_bench_plan.py implements inference benchmarking using the previously generated TensorRT plan and a pre-processed input image.

import sys
from time import perf_counter
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import tensorrt as trt

def softmax(x):
    y = np.exp(x)
    sum = np.sum(y)
    y /= sum
    return y

def topk(x, k):
    idx = np.argsort(x)
    idx = idx[::-1][:k]
    return (idx, x[idx])

def main():
    if len(sys.argv) != 2:
        sys.exit("Usage: python3 trt_bench_plan.py <plan_path>")

    plan_path = sys.argv[1]

    print("Start " + plan_path)

    # read the plan
    with open(plan_path, "rb") as fp:
        plan = fp.read()

    # generate random input
    np.random.seed(1234)
    input = np.random.random(3 * 224 * 224)
    input = input.astype(np.float32)

    # initialize the TensorRT objects
    logger = trt.Logger()
    runtime = trt.Runtime(logger)
    engine = runtime.deserialize_cuda_engine(plan)
    context = engine.create_execution_context()

    # create device buffers and TensorRT bindings
    output = np.zeros((1000), dtype=np.float32)
    d_input = cuda.mem_alloc(input.nbytes)
    d_output = cuda.mem_alloc(output.nbytes)
    bindings = [int(d_input), int(d_output)]

    # copy input to device, run inference
    cuda.memcpy_htod(d_input, input)

    #  warm up
    for i in range(1, 10):
        context.execute_v2(bindings=bindings)

    # benchmark
    start = perf_counter()
    for i in range(1, 100):
        context.execute_v2(bindings=bindings)
    end = perf_counter()
    elapsed = ((end - start) / 100) * 1000
    print('Model {0}: elapsed time {1:.2f} ms'.format(plan_path, elapsed))
    # record for automated extraction
    print('#{0};{1:f}'.format(plan_path, elapsed))

    # copy output to host
    cuda.memcpy_dtoh(output, d_output)

    # apply softmax and get Top-5 results
    output = softmax(output)
    top5p, top5v = topk(output, 5)

    # print results
    print("Top-5 results")
    for ind, val in zip(top5p, top5v):
        print("  {0} {1:.2f}%".format(ind, val * 100))

main()

This program uses the following TensorRT API object classes:

The program performs the following steps:

The program prints a special formatted line starting with "#" that will be later used for automated extraction of performance metrics.

The program uses a path to the TensorRT plan file as its single command line argument.

To run this program for the previously created ResNet50 plan, use the command:

python3 trt_bench_plan.py ./plan/resnet50.plan

The program output will look like:

Model resnet50_py.plan: elapsed time 1.59 ms
Top-5 results
  610 6.29%
  549 5.21%
  446 5.00%
  783 3.20%
  892 2.93%

The shell script bench_plan_all_py.sh performs benchmarking of all supported torchvision models:

#!/bin/bash

echo "#head;TensorRT (Python)"

python3 trt_bench_plan.py ./plan/alexnet.plan
python3 trt_bench_plan.py ./plan/densenet121.plan
python3 trt_bench_plan.py ./plan/densenet161.plan
python3 trt_bench_plan.py ./plan/densenet169.plan
python3 trt_bench_plan.py ./plan/densenet201.plan
python3 trt_bench_plan.py ./plan/mnasnet0_5.plan
python3 trt_bench_plan.py ./plan/mnasnet1_0.plan
python3 trt_bench_plan.py ./plan/mobilenet_v2.plan
python3 trt_bench_plan.py ./plan/mobilenet_v3_large.plan
python3 trt_bench_plan.py ./plan/mobilenet_v3_small.plan
python3 trt_bench_plan.py ./plan/resnet101.plan
python3 trt_bench_plan.py ./plan/resnet152.plan
python3 trt_bench_plan.py ./plan/resnet18.plan
python3 trt_bench_plan.py ./plan/resnet34.plan
python3 trt_bench_plan.py ./plan/resnet50.plan
python3 trt_bench_plan.py ./plan/resnext101_32x8d.plan
python3 trt_bench_plan.py ./plan/resnext50_32x4d.plan
python3 trt_bench_plan.py ./plan/shufflenet_v2_x0_5.plan
python3 trt_bench_plan.py ./plan/shufflenet_v2_x1_0.plan
python3 trt_bench_plan.py ./plan/squeezenet1_0.plan
python3 trt_bench_plan.py ./plan/squeezenet1_1.plan
python3 trt_bench_plan.py ./plan/vgg11.plan
python3 trt_bench_plan.py ./plan/vgg11_bn.plan
python3 trt_bench_plan.py ./plan/vgg13.plan
python3 trt_bench_plan.py ./plan/vgg13_bn.plan
python3 trt_bench_plan.py ./plan/vgg16.plan
python3 trt_bench_plan.py ./plan/vgg16_bn.plan
python3 trt_bench_plan.py ./plan/vgg19.plan
python3 trt_bench_plan.py ./plan/vgg19_bn.plan
python3 trt_bench_plan.py ./plan/wide_resnet101_2.plan
python3 trt_bench_plan.py ./plan/wide_resnet50_2.plan

Running this script is straightforward:

./bench_plan_all_py.sh >bench_trt_py.log

The benchmarking log will be saved in bench_trt_py.log that later will be used for performance comparison of various deployment methods.

Step 9. Run TensorRT benchmarking using C++

The benchmarking of TensorRT models can be also implemented using the TensorRT C++ API.

The C++ program trt_bench_plan.cpp serves this purpose.

#include <cstdio>
#include <cstdlib>
#include <cassert>
#include <vector>
#include <iostream>
#include <fstream>

#include <NvInfer.h>

#include "common.h"

// wrapper class for inference engine

class Engine {
public:
    Engine();
    ~Engine();
public:
    void Init(const std::vector<char> &plan);
    void StartInfer(const std::vector<float> &input);
    void RunInfer();
    void EndInfer(std::vector<float> &output);
private:
    bool m_active;
    Logger m_logger;
    UniquePtr<nvinfer1::IRuntime> m_runtime;
    UniquePtr<nvinfer1::ICudaEngine> m_engine;
    UniquePtr<nvinfer1::IExecutionContext> m_context;
    CudaBuffer<float> m_inputBuffer;
    CudaBuffer<float> m_outputBuffer;
};

Engine::Engine(): m_active(false) { }

Engine::~Engine() { }

void Engine::Init(const std::vector<char> &plan) {
    assert(!m_active);
    m_runtime.reset(nvinfer1::createInferRuntime(m_logger));
    if (m_runtime == nullptr) {
        Error("Error creating infer runtime");
    }
    m_engine.reset(m_runtime->deserializeCudaEngine(plan.data(), plan.size(), nullptr));
    if (m_engine == nullptr) {
        Error("Error deserializing CUDA engine");
    }
    m_active = true;
}

void Engine::StartInfer(const std::vector<float> &input) {
    assert(m_active);
    m_context.reset(m_engine->createExecutionContext());
    if (m_context == nullptr) {
        Error("Error creating execution context");
    }
    m_inputBuffer.Init(3 * 224 * 224);
    assert(m_inputBuffer.Size() == input.size());
    m_inputBuffer.Put(input.data());
    m_outputBuffer.Init(1000);
}

void Engine::RunInfer() {
    void *bindings[2];
    bindings[0] = m_inputBuffer.Data();
    bindings[1] = m_outputBuffer.Data();
    bool ok = m_context->executeV2(bindings);
    if (!ok) {
        Error("Error executing inference");
    }
}

void Engine::EndInfer(std::vector<float> &output) {
    output.resize(m_outputBuffer.Size());
    m_outputBuffer.Get(output.data());
}

// I/O utilities

void ReadPlan(const char *path, std::vector<char> &plan) {
    std::ifstream ifs(path, std::ios::in | std::ios::binary);
    if (!ifs.is_open()) {
        Error("Cannot open %s", path);
    }
    ifs.seekg(0, ifs.end);
    size_t size = ifs.tellg();
    plan.resize(size);
    ifs.seekg(0, ifs.beg);
    ifs.read(plan.data(), size);
    ifs.close();
}

void GenerateInput(std::vector<float> &input) {
    int size = 3 * 224 * 224;
    input.resize(size);
    float *p = input.data();
    std::srand(1234);
    for (int i = 0; i < size; i++) {
        p[i] = static_cast<float>(std::rand()) / RAND_MAX;
    }
}

void PrintOutput(const std::vector<float> &output) {
    int top5p[5];
    float top5v[5];
    TopK(static_cast<int>(output.size()), output.data(), 5, top5p, top5v);
    printf("Top-5 results\n");
    for (int i = 0; i < 5; i++) {
        int label = top5p[i];
        float prob = 100.0f * top5v[i];
        printf("  [%d] %d %.2f%%\n", i, label, prob);
    }
}

// main program

int main(int argc, char *argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: trt_bench_plan <plan_path>\n");
        return 1;
    }
    const char *planPath = argv[1];

    printf("Start %s\n", planPath);

    int repeat = 100;

    std::vector<char> plan;
    ReadPlan(planPath, plan);

    std::vector<float> input;
    GenerateInput(input);

    std::vector<float> output;

    Engine engine;
    engine.Init(plan);
    engine.StartInfer(input);

    for (int i = 0; i < 10; i++) {
        engine.RunInfer();
    }

    Timer timer;
    timer.Start();
    for (int i = 0; i < repeat; i++) {
        engine.RunInfer();
    }
    timer.Stop();
    float t = timer.Elapsed();
    printf("Model %s: elapsed time %f ms / %d = %f\n", planPath, t, repeat, t / float(repeat));
    // record for automated extraction
    printf("#%s;%f\n", planPath, t / float(repeat));

    engine.EndInfer(output);

    Softmax(static_cast<int>(output.size()), output.data());
    PrintOutput(output);

    return 0;
}

This program uses the following TensorRT API object classes:

Class Engine holds smart pointers to instances of these objects. It exposes four principal public methods: Init, StartInfer, RunInfer, and RunInfer.

The Init method performs the following steps:

The StartInfer method performs the following steps:

The RunInfer method performs the following steps:

The EndInfer method performs the following step:

The program performs the following steps:

The program prints a special formatted line starting with "#" that will be later used for automated extraction of performance metrics.

NOTE: In this program we intentionally use the deprecated version of IRuntime::deserializeCudaEngine method requiring the last nullptr argument because, at the times of writing, using the new version without this argument sometimes caused unexpected program behavior on the considered GPU devices. The root cause of this problem is not yet clarified; there might be an undocumented bug in TensorRT inference library.

The shell script build_trt_bench_plan.sh must be used to compile and link this program:

#!/bin/bash

mkdir -p ./bin

g++ -o ./bin/trt_bench_plan \
    -I /usr/local/cuda/include \
    trt_bench_plan.cpp common.cpp \
    -L /usr/local/cuda/lib64 -lnvinfer -lcudart

Running this script is straightforward:

./build_trt_bench_plan.sh

The program has two command line arguments: a path to the TensorRT plan file and a path to the file containing the pre-processed input image.

To run this program for the previously created ResNet50 plan, use the command:

./bin/trt_bench_plan ./plan/resnet50.plan

The program output will look like:

Model resnet50.plan: elapsed time 179.491653 ms / 100 = 1.794917
Top-5 results
  [0] 610 4.25%
  [1] 549 3.90%
  [2] 783 3.64%
  [3] 892 3.51%
  [4] 446 3.18%

The shell script bench_plan_all.sh performs benchmarking of all supported torchvision models:

#!/bin/bash

echo "#head;TensorRT (C++)"

./bin/trt_bench_plan ./plan/alexnet.plan
./bin/trt_bench_plan ./plan/densenet121.plan
./bin/trt_bench_plan ./plan/densenet161.plan
./bin/trt_bench_plan ./plan/densenet169.plan
./bin/trt_bench_plan ./plan/densenet201.plan
./bin/trt_bench_plan ./plan/mnasnet0_5.plan
./bin/trt_bench_plan ./plan/mnasnet1_0.plan
./bin/trt_bench_plan ./plan/mobilenet_v2.plan
./bin/trt_bench_plan ./plan/mobilenet_v3_large.plan
./bin/trt_bench_plan ./plan/mobilenet_v3_small.plan
./bin/trt_bench_plan ./plan/resnet101.plan
./bin/trt_bench_plan ./plan/resnet152.plan
./bin/trt_bench_plan ./plan/resnet18.plan
./bin/trt_bench_plan ./plan/resnet34.plan
./bin/trt_bench_plan ./plan/resnet50.plan
./bin/trt_bench_plan ./plan/resnext101_32x8d.plan
./bin/trt_bench_plan ./plan/resnext50_32x4d.plan
./bin/trt_bench_plan ./plan/shufflenet_v2_x0_5.plan
./bin/trt_bench_plan ./plan/shufflenet_v2_x1_0.plan
./bin/trt_bench_plan ./plan/squeezenet1_0.plan
./bin/trt_bench_plan ./plan/squeezenet1_1.plan
./bin/trt_bench_plan ./plan/vgg11.plan
./bin/trt_bench_plan ./plan/vgg11_bn.plan
./bin/trt_bench_plan ./plan/vgg13.plan
./bin/trt_bench_plan ./plan/vgg13_bn.plan
./bin/trt_bench_plan ./plan/vgg16.plan
./bin/trt_bench_plan ./plan/vgg16_bn.plan
./bin/trt_bench_plan ./plan/vgg19.plan
./bin/trt_bench_plan ./plan/vgg19_bn.plan
./bin/trt_bench_plan ./plan/wide_resnet101_2.plan
./bin/trt_bench_plan ./plan/wide_resnet50_2.plan

Running this script is straightforward:

./bench_plan_all.sh >bench_trt.log

The benchmarking log will be saved in bench_trt.log that later will be used for performance comparison of various deployment methods.

Step 10. Extract performance metrics from benchmarking logs

The Python program merge_perf.py introduced in Article 2 extracts performance metrics from multiple benchmarking log files and merges them in a single CSV file in a form suitable for further analysis.

The program has two or more command line arguments, each argument specifying a path to the log file.

The program extracts special records starting with "#" from all input files, merges the extracted information, and saves it as a single CSV file. Each line of the output file corresponds to one model and each column corresponds to one deployment method.

Assuming that benchmarking described in the Articles 1, 2, and 3 has been performed in the sibling directories art01, art02, and art03 respectively and the current directory is art03, the following command can be used to merge the three log files considered so far:

python3 merge_perf.py ../art01/bench_torch.log ../art02/bench_ts_py.log ../art02/bench_ts.log bench_trt_py.log bench_trt.log >perf03.csv

The output file perf03.csv will look like:

Model;PyTorch;TorchScript (Python);TorchScript (C++);TensorRT (Python);TensorRT (C++)
alexnet;1.23;1.05;1.04;0.58;0.60
densenet121;19.79;13.65;13.34;3.73;3.67
densenet161;29.43;20.83;20.70;7.99;7.40
densenet169;28.47;19.33;20.11;8.17;7.32
densenet201;33.48;22.44;22.70;12.24;10.96
mnasnet0_5;5.45;3.63;3.67;0.64;0.61
mnasnet1_0;5.66;3.79;3.95;0.80;0.80
mobilenet_v2;6.19;4.12;4.02;0.77;0.76
mobilenet_v3_large;8.07;5.22;5.18;0.98;0.91
mobilenet_v3_small;6.37;4.20;4.19;0.74;0.67
resnet101;15.80;11.01;10.81;3.12;3.18
resnet152;23.66;16.65;16.37;4.57;4.57
resnet18;3.39;2.39;2.30;1.08;1.04
resnet34;6.11;4.22;4.11;1.84;1.79
resnet50;7.99;5.53;5.47;1.75;1.75
resnext101_32x8d;21.69;17.34;16.66;8.06;8.11
resnext50_32x4d;6.45;4.32;4.41;2.13;2.08
shufflenet_v2_x0_5;6.33;4.03;4.01;0.47;0.49
shufflenet_v2_x1_0;6.84;4.58;4.44;0.88;0.86
squeezenet1_0;3.05;2.28;2.33;0.41;0.42
squeezenet1_1;3.03;2.28;2.31;0.31;0.31
vgg11;1.91;1.81;1.84;1.74;1.75
vgg11_bn;2.37;1.93;1.96;1.75;1.75
vgg13;2.26;2.31;2.27;2.16;2.15
vgg13_bn;2.62;2.45;2.43;2.14;2.17
vgg16;2.82;2.75;2.88;2.64;2.61
vgg16_bn;3.23;3.10;3.06;2.61;2.65
vgg19;3.29;3.40;3.40;3.17;3.14
vgg19_bn;3.72;3.64;3.64;3.07;3.13
wide_resnet101_2;15.50;10.89;10.55;5.58;5.45
wide_resnet50_2;7.88;5.52;5.35;2.83;2.95

The Python program tab_perf.py introduced in Article 2 can be used to display the CSV data in the tabular format.

To run this program, use the following command line:

python3 tab_perf.py perf03.csv >perf03.txt

The output file perf03.txt will look like:

Model                    PyTorch      TorchScript (Python)    TorchScript (C++)    TensorRT (Python)    TensorRT (C++)
----------------------------------------------------------------------------------------------------------------------
alexnet                    1.23                1.05                  1.04                 0.58                0.60
densenet121               19.79               13.65                 13.34                 3.73                3.67
densenet161               29.43               20.83                 20.70                 7.99                7.40
densenet169               28.47               19.33                 20.11                 8.17                7.32
densenet201               33.48               22.44                 22.70                12.24               10.96
mnasnet0_5                 5.45                3.63                  3.67                 0.64                0.61
mnasnet1_0                 5.66                3.79                  3.95                 0.80                0.80
mobilenet_v2               6.19                4.12                  4.02                 0.77                0.76
mobilenet_v3_large         8.07                5.22                  5.18                 0.98                0.91
mobilenet_v3_small         6.37                4.20                  4.19                 0.74                0.67
resnet101                 15.80               11.01                 10.81                 3.12                3.18
resnet152                 23.66               16.65                 16.37                 4.57                4.57
resnet18                   3.39                2.39                  2.30                 1.08                1.04
resnet34                   6.11                4.22                  4.11                 1.84                1.79
resnet50                   7.99                5.53                  5.47                 1.75                1.75
resnext101_32x8d          21.69               17.34                 16.66                 8.06                8.11
resnext50_32x4d            6.45                4.32                  4.41                 2.13                2.08
shufflenet_v2_x0_5         6.33                4.03                  4.01                 0.47                0.49
shufflenet_v2_x1_0         6.84                4.58                  4.44                 0.88                0.86
squeezenet1_0              3.05                2.28                  2.33                 0.41                0.42
squeezenet1_1              3.03                2.28                  2.31                 0.31                0.31
vgg11                      1.91                1.81                  1.84                 1.74                1.75
vgg11_bn                   2.37                1.93                  1.96                 1.75                1.75
vgg13                      2.26                2.31                  2.27                 2.16                2.15
vgg13_bn                   2.62                2.45                  2.43                 2.14                2.17
vgg16                      2.82                2.75                  2.88                 2.64                2.61
vgg16_bn                   3.23                3.10                  3.06                 2.61                2.65
vgg19                      3.29                3.40                  3.40                 3.17                3.14
vgg19_bn                   3.72                3.64                  3.64                 3.07                3.13
wide_resnet101_2          15.50               10.89                 10.55                 5.58                5.45
wide_resnet50_2            7.88                5.52                  5.35                 2.83                2.95

Conclusion

Analysis of these performance data reveals that using TensorRT provides substantial performance increase compared to all previously considered deployment methods.

Differences between TensorRT performance data for Python and C++ are within the experimental error. Python and C++ can be considered equally good for running TensorRT.

Further reading

This documentation on NVIDIA TensorRT 8.0.3 can be used for further references:

At the time of writing, the detailed API information was available only for version 8.0.1:

The index of documents covering all TensorRT versions is available here.

Appendix A. Benchmarking results on NVIDIA RTX 3090

All recommendations and examples described in the Articles 1, 2, and 3 are also applicable to Genesis Cloud instances equipped with NVIDIA RTX 3090 GPUs. We have conducted benchmarking of inference for the image classification models on the RTX 3090 instance. Here are the results:

Model                    PyTorch      TorchScript (Python)    TorchScript (C++)    TensorRT (Python)    TensorRT (C++)
----------------------------------------------------------------------------------------------------------------------
alexnet                    1.30                0.97                  1.01                 0.52                0.52
densenet121               19.91               13.76                 13.80                 3.65                3.58
densenet161               29.76               19.78                 21.43                 7.23                7.19
densenet169               28.93               19.06                 19.67                 7.03                6.91
densenet201               34.34               21.90                 23.97                10.56               10.47
mnasnet0_5                 5.55                3.44                  3.78                 0.63                0.61
mnasnet1_0                 5.87                3.68                  3.88                 0.81                0.79
mobilenet_v2               6.21                3.90                  4.21                 0.73                0.71
mobilenet_v3_large         7.87                5.38                  5.65                 0.95                1.03
mobilenet_v3_small         6.49                4.38                  4.43                 0.70                0.79
resnet101                 16.09               10.39                 11.19                 3.26                3.07
resnet152                 24.34               15.38                 17.10                 4.54                4.55
resnet18                   3.37                2.21                  2.35                 1.05                1.08
resnet34                   6.11                4.03                  4.25                 1.84                1.75
resnet50                   8.21                5.67                  5.82                 1.72                1.75
resnext101_32x8d          22.09               16.38                 17.64                 7.85                7.97
resnext50_32x4d            6.53                4.13                  4.28                 2.05                2.12
shufflenet_v2_x0_5         6.53                3.90                  4.22                 0.49                0.57
shufflenet_v2_x1_0         7.08                4.36                  4.68                 0.89                0.91
squeezenet1_0              3.16                2.21                  2.40                 0.40                0.38
squeezenet1_1              3.09                2.15                  2.26                 0.31                0.31
vgg11                      1.92                1.55                  1.58                 1.50                1.51
vgg11_bn                   2.32                1.78                  1.84                 1.49                1.49
vgg13                      2.28                1.99                  1.99                 1.84                1.85
vgg13_bn                   2.71                2.12                  2.16                 1.84                1.85
vgg16                      2.68                2.46                  2.50                 2.27                2.25
vgg16_bn                   3.27                2.72                  3.55                 2.27                2.28
vgg19                      3.01                3.03                  3.12                 2.69                2.70
vgg19_bn                   8.21                3.11                  3.18                 2.74                2.72
wide_resnet101_2          15.34               10.02                 10.65                 5.36                5.21
wide_resnet50_2            7.98                5.22                  5.63                 2.80                2.75

For all the considered models and batch size of 1 there is almost no performance improvement compared to the results for RTX 3080 listed above.