Matrix multiplication on GPU - TorchScript

Accelerating Nx with Torchscript

Matrix multiplication on GPU - TorchScript

Run in Livebook

Mix.install(
  [
    {:nx, "~> 0.4.0"},
    {:scidata, "~> 0.1.9"},
    {:torchx, "~> 0.3"}
  ],
  system_env: %{"LIBTORCH_TARGET" => "cu116"}
)
:ok

Before running notebook

This notebook has a dependency on TorchScript. Torchx can use your CPU or GPU. If you have direct access to an NVidia GPU, the notebook has a section on running matrix multiplication on a GPU. If you only have a CPU, you can comment out the last GPU section and just run on your CPU. CPU is still pretty fast for this simple notebook.

According to the documentation, https://github.com/elixir-nx/nx/tree/main/torchx#readme Torchx will need to compile the TorchScript binding. Before you run the above cell, you will need make/nmake, cmake (3.12+) and a C++ compiler. The Windows binding to TorchScript is also supported and more information can be found at the Torchx readme. At this time, the MacOS binding doesn’t support access to a GPU.

Running the first cell downloads and compiles the binding to TorchScript. The download of TorchScript took about 9 minutes and compilation took about 1 minute on our system. In the future, it is likely that the downloaded TorchScript file will be cached locally, however, right now each notebook that uses torchx will download the file.

The notebook is currently set up for an Nvidia GPU on Linux.

system_env: %{"LIBTORCH_TARGET" => "cu111"}

Feel free to read the Torchx documentation and modify to fit your needs.

Context

The notebook is a transformation of a Python Jupyter Notebook from Fast.ai’s From Deep Learning Foundations to Stable Diffusion, Practical Deep Learning for Coders part 2, 2022. Specifically, it mimics the CUDA portion of https://github.com/fastai/course22p2/blob/master/nbs/01_matmul.ipynb

The purpose of the transformation is to bring the Fast.ai concepts to Elixir focused developers. The object-oriented Python/PyTorch implementation is transformed into a functional programming implementation using Nx and Axon

Experimenting with backend control

In this notebook, we are going to experiment with swapping out backends in the same notebook. One of the strengths of Elixir’s numerical processing approach is the concept of a backend. The same Nx code can run on several different backends. This allows Nx to adapt to changes in numerical libaries and technology. Currently, Nx has support for Tensorflow’s XLA and PyTorch’s TorchScript. Theoretically, backends for SOC type devices should be possible.

We chose not to set the backend globally in this notebook. At the beginning of the notebook, we’ll repeat the approach we used in 01a_matmul_using_CPU. We begin with the Elixir Binary backend. You’ll see that it isn’t quick multiplying 10,000 rows of MNIST data by some arbitrary weights.

We’ll then repeat the same multiplication using TorchScript on the CPU. Followed again by TorchScript using an NVidia 1080Ti GPU. The 1080 Ti is not the fastest GPU, but it is tremendously faster than a “large” set of data on the BinaryBackend but only a little faster than just the CPU

17,778 times faster than Binary backend

111 times faster on the GPU vs the CPU

Default - BinaryBackend

# Without choosing a backend, Nx defaults to Nx.BinaryBackend
Nx.default_backend()
{Nx.BinaryBackend, []}
# Just in case you rerun the notebook, let's make sure the default backend is BinaryBackend
# Setting to the Nx default backend
Nx.default_backend(Nx.BinaryBackend)
Nx.default_backend()
{Nx.BinaryBackend, []}

We’ll pull down the MNIST data

{train_images, train_labels} = Scidata.MNIST.download()
{test_images, test_labels} = Scidata.MNIST.download_test()
{{<<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, {:u, 8}, {10000, 1, 28, 28}},
 {<<7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1,
    3, 4, 7, 2, 7, 1, 2, 1, 1, 7, 4, 2, 3, 5, 1, ...>>, {:u, 8}, {10000}}}
{train_images_binary, train_tensor_type, train_shape} = train_images
{test_images_binary, test_tensor_type, test_shape} = test_images
{<<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, {:u, 8}, {10000, 1, 28, 28}}
{train_tensor_type, test_tensor_type}
{{:u, 8}, {:u, 8}}

Convert into Tensors and normalize to between 0 and 1

train_tensors =
  train_images_binary
  |> Nx.from_binary(train_tensor_type)
  |> Nx.reshape({60000, 28 * 28})
  |> Nx.divide(255)
#Nx.Tensor<
  f32[60000][784]
  [
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...],
    ...
  ]
>

We’ll separate the data into 50,000 train images and 10,000 validation images.

x_train = train_tensors[0..49_999]
x_valid = train_tensors[50_000..59_999]
{x_train.shape, x_valid.shape}
{{50000, 784}, {10000, 784}}

Training is more stable when random numbers are initialized with a mean of 0.0 and a variance of 1.0

mean = 0.0
variance = 1.0
weights = Nx.random_normal({784, 10}, mean, variance, type: {:f, 32})
#Nx.Tensor<
  f32[784][10]
  [
    [1.182692050933838, 1.6625404357910156, -0.598689079284668, -0.6435468196868896, 0.25204139947891235, -1.1432150602340698, -0.9701210260391235, 1.9566036462783813, -0.6923237442970276, -1.0753910541534424],
    [0.17891690135002136, 0.42717286944389343, -0.9910821914672852, -2.649228096008301, 0.13641099631786346, 0.48691749572753906, -1.0575640201568604, 0.40385302901268005, 0.5131964683532715, 0.41488444805145264],
    [2.100423574447632, -1.2787413597106934, -1.8883213996887207, -0.49423742294311523, 0.5708040595054626, -0.48230457305908203, -0.19617703557014465, 0.7797456979751587, 0.7876895070075989, -0.33916765451431274],
    [-0.4369395673274994, 0.4421914517879486, 0.18007169663906097, 0.7891340255737305, 0.28369951248168945, -1.2312926054000854, -0.17864377796649933, -1.2232452630996704, 0.6976354718208313, 1.300831913948059],
    [-1.9821809530258179, 1.426361083984375, -2.2645328044891357, 0.26135173439979553, -0.36276111006736755, 2.7461342811584473, 0.007044021971523762, -0.18955571949481964, 0.6062670946121216, -0.4373891055583954],
    ...
  ]
>

In order to simplify timing the performance of the Nx.dot/2 function, we’ll use an 0 parameter anonymous function. Invoking the anonymous function will always use the two parameters, x_valid_cpu and weights_cpu.

large_nx_mult_fn = fn -> Nx.dot(x_valid, weights) end
#Function<43.3316493/0 in :erl_eval.expr/6>

The following anonymous function take a function and the number of times to make the call to the function.

repeat = fn timed_fn, times -> Enum.each(1..times, fn _x -> timed_fn.() end) end
#Function<41.3316493/2 in :erl_eval.expr/6>

Timing the average duration of the dot multiply function to run. The cell will output the average and total elapsed time

repeat_times = 5
{elapsed_time_micro, _} = :timer.tc(repeat, [large_nx_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, _device} = Nx.default_backend()

"#{backend} CPU avg time in #{avg_elapsed_time_ms} milliseconds  total_time #{elapsed_time_micro / 1000} milliseconds"
"Elixir.Nx.BinaryBackend CPU avg time in 31846.6328 milliseconds  total_time 159233.164 milliseconds"

TorchScript CPU only

We’ll switch to the TorchScript backend but we’ll stick with using the CPU.

Nx.default_backend({Torchx.Backend, device: :cpu})
Nx.default_backend()
{Torchx.Backend, [device: :cpu]}

In the following cell, we transfer the target data from BinaryBackend to Torchx cpu backend.

x_valid_torchx_cpu = Nx.backend_transfer(x_valid, {Torchx.Backend, device: :cpu})
weights_torchx_cpu = Nx.backend_transfer(weights, {Torchx.Backend, device: :cpu})
#Nx.Tensor<
  f32[784][10]
  Torchx.Backend(cpu)
  [
    [1.182692050933838, 1.6625404357910156, -0.598689079284668, -0.6435468196868896, 0.25204139947891235, -1.1432150602340698, -0.9701210260391235, 1.9566036462783813, -0.6923237442970276, -1.0753910541534424],
    [0.17891690135002136, 0.42717286944389343, -0.9910821914672852, -2.649228096008301, 0.13641099631786346, 0.48691749572753906, -1.0575640201568604, 0.40385302901268005, 0.5131964683532715, 0.41488444805145264],
    [2.100423574447632, -1.2787413597106934, -1.8883213996887207, -0.49423742294311523, 0.5708040595054626, -0.48230457305908203, -0.19617703557014465, 0.7797456979751587, 0.7876895070075989, -0.33916765451431274],
    [-0.4369395673274994, 0.4421914517879486, 0.18007169663906097, 0.7891340255737305, 0.28369951248168945, -1.2312926054000854, -0.17864377796649933, -1.2232452630996704, 0.6976354718208313, 1.300831913948059],
    [-1.9821809530258179, 1.426361083984375, -2.2645328044891357, 0.26135173439979553, -0.36276111006736755, 2.7461342811584473, 0.007044021971523762, -0.18955571949481964, 0.6062670946121216, -0.4373891055583954],
    ...
  ]
>

An anonymous function that calls Nx.dot/2 with data on the Torchx cpu backend.

torchx_cpu_mult_fn = fn -> Nx.dot(x_valid_torchx_cpu, weights_torchx_cpu) end
#Function<43.3316493/0 in :erl_eval.expr/6>

We’ll time using Torchx on the CPU. Notice the significant performance improvement over BinaryBackend while still using just the CPU.

repeat_times = 5
{elapsed_time_micro, _} = :timer.tc(repeat, [torchx_cpu_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, [device: device]} = Nx.default_backend()

"#{backend} #{device} avg time in milliseconds #{avg_elapsed_time_ms} total_time #{elapsed_time_micro / 1000}"
"Elixir.Torchx.Backend cpu avg time in milliseconds 1.7149999999999999 total_time 8.575"

TorchScript using GPU

We’ll switch to using the cuda device. If you have a different device, replace all the :cuda specifications with your device.

Nx.default_backend({Torchx.Backend, device: :cuda})
Nx.default_backend()
{Torchx.Backend, [device: :cuda]}

In the following cell, we transfer the target data onto the GPU.

x_valid_cuda = Nx.backend_transfer(x_valid, {Torchx.Backend, client: :cuda})
weights_cuda = Nx.backend_transfer(weights, {Torchx.Backend, client: :cuda})
#Nx.Tensor<
  f32[784][10]
  Torchx.Backend(cuda)
  [
    [1.182692050933838, 1.6625404357910156, -0.598689079284668, -0.6435468196868896, 0.25204139947891235, -1.1432150602340698, -0.9701210260391235, 1.9566036462783813, -0.6923237442970276, -1.0753910541534424],
    [0.17891690135002136, 0.42717286944389343, -0.9910821914672852, -2.649228096008301, 0.13641099631786346, 0.48691749572753906, -1.0575640201568604, 0.40385302901268005, 0.5131964683532715, 0.41488444805145264],
    [2.100423574447632, -1.2787413597106934, -1.8883213996887207, -0.49423742294311523, 0.5708040595054626, -0.48230457305908203, -0.19617703557014465, 0.7797456979751587, 0.7876895070075989, -0.33916765451431274],
    [-0.4369395673274994, 0.4421914517879486, 0.18007169663906097, 0.7891340255737305, 0.28369951248168945, -1.2312926054000854, -0.17864377796649933, -1.2232452630996704, 0.6976354718208313, 1.300831913948059],
    [-1.9821809530258179, 1.426361083984375, -2.2645328044891357, 0.26135173439979553, -0.36276111006736755, 2.7461342811584473, 0.007044021971523762, -0.18955571949481964, 0.6062670946121216, -0.4373891055583954],
    ...
  ]
>

An anonymous function that calls Nx.dot/2 with data on the GPU

torchx_gpu_mult_fn = fn -> Nx.dot(x_valid_cuda, weights_cuda) end
#Function<43.3316493/0 in :erl_eval.expr/6>

We’ll warm up the GPU by looping through 5 function calls and then timing the next 5 function calls.

repeat_times = 5
# Warmup
{elapsed_time_micro, _} = :timer.tc(repeat, [torchx_gpu_mult_fn, repeat_times])
{elapsed_time_micro, _} = :timer.tc(repeat, [torchx_gpu_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, [device: device]} = Nx.default_backend()

"#{backend} #{device} avg time in milliseconds #{avg_elapsed_time_ms} total_time #{elapsed_time_micro / 1000}"
"Elixir.Torchx.Backend cuda avg time in milliseconds 0.0718 total_time 0.359"
x_valid = Nx.backend_transfer(x_valid_cuda, Nx.BinaryBackend)
weights = Nx.backend_transfer(weights_cuda, Nx.BinaryBackend)
#Nx.Tensor<
  f32[784][10]
  [
    [1.182692050933838, 1.6625404357910156, -0.598689079284668, -0.6435468196868896, 0.25204139947891235, -1.1432150602340698, -0.9701210260391235, 1.9566036462783813, -0.6923237442970276, -1.0753910541534424],
    [0.17891690135002136, 0.42717286944389343, -0.9910821914672852, -2.649228096008301, 0.13641099631786346, 0.48691749572753906, -1.0575640201568604, 0.40385302901268005, 0.5131964683532715, 0.41488444805145264],
    [2.100423574447632, -1.2787413597106934, -1.8883213996887207, -0.49423742294311523, 0.5708040595054626, -0.48230457305908203, -0.19617703557014465, 0.7797456979751587, 0.7876895070075989, -0.33916765451431274],
    [-0.4369395673274994, 0.4421914517879486, 0.18007169663906097, 0.7891340255737305, 0.28369951248168945, -1.2312926054000854, -0.17864377796649933, -1.2232452630996704, 0.6976354718208313, 1.300831913948059],
    [-1.9821809530258179, 1.426361083984375, -2.2645328044891357, 0.26135173439979553, -0.36276111006736755, 2.7461342811584473, 0.007044021971523762, -0.18955571949481964, 0.6062670946121216, -0.4373891055583954],
    ...
  ]
>

fastai, livebook, axon, foundations, deep_learning