Deployment of Deep Learning models on Genesis Cloud - Installation and basic use of CUDA, PyTorch, and torchvision

Deployment of Deep Learning models on Genesis Cloud - Installation and basic use of CUDA, PyTorch, and torchvision

Installation and basic use of CUDA, PyTorch, and torchvision

Before you start, make sure that you have created an account with Genesis Cloud and finished the on-boarding steps (phone verification, adding SSH key, providing credit card). You can create an account here and get $15 in free credits. Furthermore, ensure that you have access to one NVIDIA RTX 3080 GPU instance (if not, request quota here).

This article will guide you through the basic steps which are required for installation and basic use of PyTorch and related software components on a Genesis Cloud GPU instance. The following topics are covered:

In your Genesis Cloud account make sure that you have access to at least one NVIDIA RTX 3080 GPU and the following software versions:

Various assets (source code, shell scripts, and data files) used in this article can be found in the supporting GitHub repository.

To run examples described in this article we recommend cloning the entire repository on your Genesis Cloud instance. The subdirectory art01 must be made your current directory.

Step 1. Creating a GPU instance on Genesis Cloud

We start with creating a new GPU instance on your Genesis Cloud dashboard. This instance will be used for running examples described in this and the following articles of this series.

To create a new instance, visit the Create New Instance page. Complete the following steps:

Then click the “Create Instance” button.

Once your instance is ready, login and proceed with the following steps. If you need any additional help on how to create and run an instance check the referred knowledge base article.

Step 2. Install CUDA

As the next step, we will install CUDA.

To install the desired version of CUDA, visit the CUDA Toolkit Archive page. Select the line for CUDA Toolkit 11.3.1. You will be redirected to the corresponding page. On this page, make the following selections:

The sequence of commands for installation of the selected version will be presented. At the time of writing this article, these commands were:

sudo mv /etc/apt/preferences.d/cuda-repository-pin-600
sudo dpkg -i cuda-repo-ubuntu2004-11-3-local_11.3.1-465.19.01-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-3-local/
sudo apt-get update
sudo apt-get -y install cuda

As these commands might change in the future, we recommend using the commands that are actually presented on this page.

Enter these commands one by one (or build and execute the respective shell script). The last command will launch the CUDA installation process, which might take a while.

For this and similar installation steps, we advice to create a scratch directory (for example, ~/transit) and set it as the current directory during the installation:

mkdir -p ~/transit
cd ~/transit

Upon the successful installation, we recommend rebooting your instance by stopping and starting it from the Genesis Cloud Web console.

We strongly advise you to take time and study CUDA EULA available by reference on this page.

To validate CUDA installation, type the command:


You should get an output similar to this:

| NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:00:05.0 Off |                  N/A |
|  0%   21C    P8     6W / 320W |      5MiB / 10018MiB |      1%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    0   N/A  N/A       874      G   /usr/lib/xorg/Xorg                  4MiB |

To use the NVIDIA CUDA compiler driver nvcc (which will be needed for more advanced examples), update the PATH environment variable:

export PATH=/usr/local/cuda/bin:$PATH

Then, to check the nvcc availability, type:

nvcc --version

Step 3. Install cuDNN

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. To install cuDNN, visit the distribution page. Select packages corresponding to the desired combination of CUDA and cuDNN versions. For each such combination there are two packages of interest representing the runtime and developer libraries.

At the time of writing this article, for CUDA 11.3 and cuDNN 8.2.1, these packages were:


Download these files by entering the respective wget commands, for example:


As before, we recommend to perform installation from a separate scratch directory, e.g., ~/transit.

Then install the packages using the commands:

sudo dpkg -i libcudnn8_8.2.1.32-1+cuda11.3_amd64.deb
sudo dpkg -i libcudnn8-dev_8.2.1.32-1+cuda11.3_amd64.deb

Step 4. Simple CUDA / cuDNN program example

The C++ program cudnn_softmax implements a simple cuDNN example that initializes a CUDA tensor with random values, applies the cuDNN Softmax operation to it, and prints the top 5 values of your result:

#include <cstdio>
#include <cstdlib>
#include <cstdarg>
#include <cassert>
#include <vector>

#include <cuda_runtime.h>
#include <cudnn.h>

// error handling

void Error(const char *fmt, ...) {
    va_list args;
    va_start(args, fmt);
    vfprintf(stderr, fmt, args);
    fputc('\n', stderr);

// CUDA helpers

void CallCuda(cudaError_t stat) {
    if (stat != cudaSuccess) {
        Error("%s", cudaGetErrorString(stat));

void *Malloc(int size) {
    void *ptr = nullptr;
    CallCuda(cudaMalloc(&ptr, size));
    return ptr;

void Free(void *ptr) {
    if (ptr != nullptr) {

void Memget(void *dst, const void *src, int size) {
    CallCuda(cudaMemcpy(dst, src, size, cudaMemcpyDeviceToHost));

void Memput(void *dst, const void *src, int size) {
    CallCuda(cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice));

template<typename T>
class CudaBuffer {
        m_size(0), m_data(nullptr) { }
    ~CudaBuffer() {
    void Init(int size) {
        assert(m_data == nullptr);
        m_size = size;
        m_data = static_cast<T *>(Malloc(size * sizeof(T)));
    void Done() {
        if (m_data != nullptr) {
            m_size = 0;
            m_data = nullptr;
    int Size() const {
        return m_size;
    const T *Data() const {
        return m_data;
    T *Data() {
        return m_data;
    void Get(float *host) const {
        Memget(host, m_data, m_size * sizeof(T));
    void Put(const float *host) {
        Memput(m_data, host, m_size * sizeof(T));
    int m_size;
    T *m_data;

// cuDNN helpers

void CallCudnn(cudnnStatus_t stat) {
    if (stat != CUDNN_STATUS_SUCCESS) {
        Error("%s", cudnnGetErrorString(stat));

cudnnHandle_t g_cudnnHandle;

void CudnnInit() {

void CudnnDone() {

cudnnHandle_t CudnnHandle() {
    return g_cudnnHandle;

// wrapper class for softmax primitive

class Softmax {
    void Init(int n, int c, int h, int w);
    void Done();
    void Forward(const float *x, float *y);
    bool m_active;
    cudnnHandle_t m_handle;
    cudnnTensorDescriptor_t m_desc;

        m_desc(nullptr) { }

Softmax::~Softmax() {

void Softmax::Init(int n, int c, int h, int w) {
    m_handle = CudnnHandle();
    m_active = true;

void Softmax::Done() {
    if (!m_active) {
    m_active = false;

void Softmax::Forward(const float *x, float *y) {
    static float one = 1.0f;
    static float zero = 0.0f;

// Top-K helper

void TopK(int count, const float *data, int k, int *pos, float *val) {
    for (int i = 0; i < k; i++) {
        pos[i] = -1;
        val[i] = 0.0f;
    for (int p = 0; p < count; p++) {
        float v = data[p];
        int j = -1;
        for (int i = 0; i < k; i++) {
            if (pos[i] < 0 || val[i] < v) {
                j = i;
        if (j >= 0) {
            for (int i = k - 1; i > j; i--) {
                pos[i] = pos[i-1];
                val[i] = val[i-1];
            pos[j] = p;
            val[j] = v;

// handling input and output

void SetInput(CudaBuffer<float> &b) {
    int size = b.Size();
    std::vector<float> h(size);
    float *p =;
    for (int i = 0; i < size; i++) {
        p[i] = static_cast<float>(std::rand()) / RAND_MAX;

void GetOutput(const CudaBuffer<float> &b) {
    int size = b.Size();
    std::vector<float> h(size);
    float *p =;
    int top5p[5];
    float top5v[5];
    TopK(size, p, 5, top5p, top5v);
    for (int i = 0; i < 5; i++) {
        printf("[%d] pos %d val %g\n", i, top5p[i], top5v[i]);

// main program

int main() {
    Softmax softmax;
    int n = 1;
    int c = 1000;
    int h = 1;
    int w = 1;
    softmax.Init(n, c, h, w);
    int size = n * c * h * w;
    CudaBuffer<float> x;
    CudaBuffer<float> y;
    softmax.Forward(x.Data(), y.Data());

The shell script must be used to compile and link this program:


mkdir -p ./bin

g++ -o ./bin/cudnn_softmax -I /usr/local/cuda/include \
    cudnn_softmax.cpp \
    -L /usr/local/cuda/lib64 -lcudnn -lcudart

Running this script is straightforward:


To run the program, use the command line:


The program output will look like:

[0] pos 754 val 0.00159118
[1] pos 814 val 0.00159045
[2] pos 961 val 0.00159016
[3] pos 938 val 0.00158901
[4] pos 717 val 0.00158761

Step 5. Install PyTorch

To install and use PyTorch, the Python interpreter and package installer pip are required. When a new instance is created on Genesis Cloud, Python 3 is automatically preinstalled; however, pip must be installed explicitly. This can be done using the commands:

sudo apt install python3-pip

To install PyTorch, visit the product site and select the desired configuration as following:

The command for installation of the selected configuration will be presented. At the time of writing this article, this commands was:

pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f

Replace pip3 with python3 -m pip and execute the resulting command:

python3 -m pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f

On completion of the installation, start Python in the interactive mode and enter a few commands to validate availability of torch package, CUDA device, and cuDNN:

>>> import torch
>>> torch.__version__
>>> torch.cuda.is_available()
>>> torch.cuda.device_count()
>>> torch.cuda.get_device_name()
'NVIDIA GeForce RTX 3080'
>>> torch.backends.cudnn.version()

Step 6. Inference using torchvision

The torchvision package is part of the PyTorch project. It includes various computational assets (model architectures, image transformations, and datasets) facilitating using PyTorch for computer vision.

The torchvision package is installed automatically together with PyTorch. In this and the following articles we will use torchvision models to demonstrate various aspects of deep learning implementation on Genesis Cloud infrastructure.

We will start with an example of direct usage of torchvision models for image classification.

As input, we will need an arbitrary image containing a single object that will be classified. We will use this husky image in our experiments (it is in public domain). Use the following commands to create a subdirectory data for holding input files and to downloads the image:

mkdir data
wget -O data/husky01.jpg

Torchvision image classification models have been trained on the ImageNet dataset and use 1000 image classes labeled by consecutive integer numbers. The text file imagenet_classes.txt containsclass descriptions for all labels, which can be obtained using this command:


Torchvision provides an extensive set of image classification models. In the following example, we will use ResNet50.

The following Python program inputs the image, performs a classification and outputs the top 5 results with the respective probabilities.

import torch
import torch.nn.functional as F
from torchvision import models, transforms
from PIL import Image

IMG_PATH = "./data/husky01.jpg"

# load the pre-trained model
resnet50 = models.resnet50(pretrained=True)

# specify image transformations
transform = transforms.Compose([
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225])

# import and transform image
img =
img = transform(img)

# create a batch, run inference
input = torch.unsqueeze(img, 0)

# move the input and model to GPU
if torch.cuda.is_available():
    input ="cuda")"cuda")

with torch.no_grad():
    output = resnet50(input)

# apply softmax and get Top-5 results
output = F.softmax(output, dim=1)
top5 = torch.topk(output[0], 5)

# Read the categories
with open("imagenet_classes.txt", "r") as f:
    categories = [s.strip() for s in f.readlines()]

# print results
for ind, val in zip(top5.indices, top5.values):
    print("{0} {1:.2f}%".format(categories[ind], val * 100))

The program performs these following main actions:

The program uses the same sequence of image transformations as commonly applied during the ImageNet dataset model training.

Combination of model.eval() and torch.no_grad() calls are commonly used when running inference with PyTorch models.

To run this program, use the command:


The program output will look like:

Siberian husky 49.52%
Eskimo dog 42.90%
malamute 5.87%
dogsled 1.22%
Saint Bernard 0.32%

You can also experiment with the other classification models from the torchvision library as well as with other images.

Step 7. Benchmarking the torchvision models

We will perform simple inference benchmarking of torchscript models by running the inference multiple times and measuring the average wall clock time required for completing one run.

The Python program implements benchmarking of ResNet50:

from time import perf_counter
import torch
import torch.nn.functional as F
import torchvision.models as models

name = 'resnet50'
print('Start ' + name)

# create model

name = 'resnet50'
resnet50 = models.resnet50(pretrained=True).cuda()

# create dummy input

input = torch.rand(1, 3, 224, 224).cuda()

# benchmark model

with torch.no_grad():
    for i in range(1, 10):

start = perf_counter()
with torch.no_grad():
    for i in range(1, 100):
end = perf_counter()

elapsed = ((end - start) / 100) * 1000
print('Model {0}: elapsed time {1:.2f} ms'.format(name, elapsed))

# print Top-5 results

output = resnet50(input)
top5 = F.softmax(output, dim=1).topk(5).indices
print('Top 5 results:\n {}'.format(top5))

The program performs the following steps:

The benchmarking includes 10 “warmup” inference runs followed by 100 runs for which the total wall clock time is measured. The measured time is divided by the number of runs and the average time for one run in milliseconds is displayed.

To run this program, use the command:


The program output will look like:

Start resnet50
Model resnet50: elapsed time 8.01 ms
Top 5 results:
 tensor([[783, 610, 549, 892, 600]], device='cuda:0')

The Python program is more general; it implements benchmarking of almost any given torchvision image classification model:

import sys
from time import perf_counter
import torch
import torch.nn.functional as F
import torchvision.models as models

def main():
    if len(sys.argv) != 2:
        sys.exit("Usage: python3 bench_model <model_name>")

    name = sys.argv[1]
    print('Start ' + name)

    # create model

    builder = getattr(models, name)
    model = builder(pretrained=True).cuda()

    # create dummy input

    input = torch.rand(1, 3, 224, 224).cuda()

    # benchmark model

    with torch.no_grad():
        for i in range(1, 10):

    start = perf_counter()
    with torch.no_grad():
        for i in range(1, 100):
    end = perf_counter()

    elapsed = ((end - start) / 100) * 1000
    print('Model {0}: elapsed time {1:.2f} ms'.format(name, elapsed))
    # record for automated extraction
    print('#{0};{1:f}'.format(name, elapsed))

    # print Top-5 results

    output = model(input)
    top5 = F.softmax(output, dim=1).topk(5)
    top5p = top5.indices.detach().cpu().numpy()
    top5v = top5.values.detach().cpu().numpy()

    print("Top-5 results")
    for ind, val in zip(top5p[0], top5v[0]):
        print("  {0} {1:.2f}%".format(ind, val * 100))


The program uses a model name as its single command line argument.

NOTE: The model googlenet is not supported because of the specific format of input tensors that are used.

The program performs the following steps:

The program prints a special formatted line starting with "#" that will be later used for the automated extraction of performance metrics.

To run this program for ResNet50, use the command:

python3 resnet50

The program output will look something like (this program is not deterministic due to randomised inputs, so results differ after each execution):

Start resnet50
Model resnet50: elapsed time 7.94 ms
Top-5 results
  610 5.20%
  549 4.50%
  783 3.84%
  892 3.42%
  446 3.23%

The shell script performs benchmarking of all supported torchvision models:


echo "#head;PyTorch"

python3 alexnet
python3 densenet121
python3 densenet161
python3 densenet169
python3 densenet201
python3 mnasnet0_5
python3 mnasnet1_0
python3 mobilenet_v2
python3 mobilenet_v3_large
python3 mobilenet_v3_small
python3 resnet101
python3 resnet152
python3 resnet18
python3 resnet34
python3 resnet50
python3 resnext101_32x8d
python3 resnext50_32x4d
python3 shufflenet_v2_x0_5
python3 shufflenet_v2_x1_0
python3 squeezenet1_0
python3 squeezenet1_1
python3 vgg11
python3 vgg11_bn
python3 vgg13
python3 vgg13_bn
python3 vgg16
python3 vgg16_bn
python3 vgg19
python3 vgg19_bn
python3 wide_resnet101_2
python3 wide_resnet50_2

Running this script is straightforward:

./ >bench_torch.log

The benchmarking log will be saved in bench_torch.log, which is later used for performance comparison of various deployment methods.


Using PyTorch and Python directly is perhaps the most simple and straightforward way for running inference; however there are much more performance efficient methods for model deployment in a GPU-enabled infrastructure. We will discuss these methods in the subsequent articles.

So stay tuned and keep accelerating with Genesis Cloud