Along the Axon

Exploring Elixir Nx Backends

How do backends work for Nx? What is the impact of XLA and TorchScript backends on performance?

TL;DR Choose one of the accelerated backends over the default BinaryBackend.

Matrix Multiplication - shape => {10000, 784} * {784, 10}

32 seconds BinaryBackend
1.8 milliseconds TorchScript CPU
1.3 milliseconds XLA CPU
140 microseconds XLA GPU - 1080Ti
70 microseconds TorchScript GPU - 1080Ti

The metrics are for a very small problem so please don’t read too much into the relative XLA vs. TorchScript results. Some people were expecting XLA to have performed better than TorchScript.

Fast.ai Deep Learning For Coders, Part 2

I’ve started a project on GitHub, dl_foundations_in_elixir. In this project, I collect the Elixir Livebook notebooks that I create while taking the From Deep Learning Foundations to Stable Diffusion, Practical Deep Learning for Coders part 2, 2022. The course is happening live for the past four weeks, so far.

Jeremy spent several lessons on matrix multiplication. He covered a variety of foundational concepts beyond simple matrix multiplication.

Retrieving a datasource
Viewing in-memory images
Iterating through a datasource
Matrix and tensors
Random numbers
Matrix multiplication you already know how to do
Optimizing CPU matrix multiplication
Complex looking math formulas that are really simple in code
Broadcasting
- with a scalar
- vector to a matrix
Math operations on tensors
Reshaping Tensors
Optimizing matrix multiplication with GPU

This blog post introduces three notebooks where I explore Nx’s ability to optimize CPU and GPU matrix multiplication.

In PyTorch, the ability to run on the CPU is pretty simple. Unless you specifically request GPU allocated memory, PyTorch code runs on the CPU.

To run a method call on the GPU, the data needs to be moved to the GPU. Calling the method uses the GPU. The resulting data needs to be moved back to the CPU.

m1c,m2c = x_train.cuda(),weights.cuda()
r=(m1c@m2c).cpu()

Nx Backends

One of the strengths of Elixir’s numerical processing approach is the concept of a backend. The same Nx code can run on several different backends. This allows Nx to adapt to changes in numerical libaries and technology. Currently, Nx has support for Tensorflow’s XLA and PyTorch’s TorchScript. Theoretically, backends for System on a Chip (SOC) devices should be possible.

TorchScript

To specify the use of TorchScript, we need to include torchx in the Mix.install of a Livebook cell. There are requirements identified in my notebook, but the detailed requirements are at the Torchx documentation site. One item to note is that downloading TorchScript can take a while and it happens each time Mix.install is changed. Key Nx team members are aware of this challenge. It is likely they will cache the download like they do for XLA.

Mix.install(
  [
    ...
    {:torchx, "~> 0.3"}
  ]
)

Just including torchx will result in using only the CPU. While not a fast as a GPU, it is far faster than using the BinaryBackend.
To use Torchx with your Nvidia GPU, add a LIBTORCH_TARGET environment variable like:

Mix.install(
  [
    ...
    {:torchx, "~> 0.3"}
  ],
  system_env: %{"LIBTORCH_TARGET" => "cu116"}
)

In my TorchScript notebook, I explored the ability to switch between BinaryBackend, TorchScript on CPU and TorchScript on GPU. Like the PyTorch code above, we need to move the data from the CPU to the GPU before we can run an Nx function on the GPU.

x_valid_torchx_cpu = Nx.backend_transfer(x_valid, {Torchx.Backend, device: :cpu})
weights_torchx_cpu = Nx.backend_transfer(weights, {Torchx.Backend, device: :cpu})

We don’t always want all numerical operations to run on the GPU. Sometimes, we want to perform work on the CPU when working in a GPU constrained environment. For example, using the GPU to augment a batch of training data competes with the training that is also happening on the GPU.

XLA

My XLA GPU notebook and CPU notebook demonstrate using XLA as a backend. Using XLA opens up more GPU devices. XLA supports NVidia, AMD’s ROCm and Google’s TPUs. The following Mix.install uses XLA on a CPU.

Mix.install(
  [
    ...
    {:exla, "~> 0.4"}
  ]

To run on an NVidia GPU, set the XLA_TARGET like:

    {:exla, "~> 0.4"}
  ],
  system_env: %{"XLA_TARGET" => "cuda111"}
)

The reason XLA has 2 example notebooks is worth discussing. I started with XLA first and created a notebook very similar to the TorchScript notebook. However, we found the Exla wasn’t honoring Nx.default_backend/1 changes. By the way, notice that the default_backend/1 has a side-effect. It changes the state of the Nx runtime to a different backend configuration.

Nx checks where the data is allocated as a part of Nx function calls. My notebook was failing in the XLA CPU section. I had requested an Exla CPU backend but the data was being allocated on the GPU. Jose pointed out that the backend was cuda when I was expecting host.

Nx.backend_transfer(....

f32[784][10]
  EXLA.Backend<cuda:0

So a lesson learned that may help you when encountering unexpected errors. Check the backend allocation of the data used in the function call. The fix for this issue has been checked into master. For now, I used two notebooks to demonstrate XLA GPU and XLA CPU.

Because of these results, I’ll save my notebooks as either Torchx or EXLA instead of using the default BinaryBackend. If you have questions or concerns about the notebooks, please create an issue on the GitHub project, https://github.com/meanderingstream/dl_foundations_in_elixir. Pull requests are always welcome.

2022-11-06

fastai, nx, axon, foundations, deep_learning