Along the Axon

Exploring Elixir Machine Learning

Exploring Elixir Nx Backends

How do backends work for Nx? What is the impact of XLA and TorchScript backends on performance?


TLDR; Choose one of the accelerated backends over the default BinaryBackend.

Matrix Multiplication - shape => {10000, 784} * {784, 10}

  • 32 seconds BinaryBackend
  • 1.8 milliseconds TorchScript CPU
  • 1.3 milliseconds XLA CPU
  • 140 microseconds XLA GPU - 1080Ti
  • 70 microseconds TorchScript GPU - 1080Ti

The metrics are for a very small problem so please don’t read too much into the relative XLA vs. TorchScript results. Some people were expecting XLA to have performed better than TorchScript.

Fast.ai Deep Learning For Coders, Part 2

I’ve started a project on GitHub, dl_foundations_in_elixir. In this project, I collect the Elixir Livebook notebooks that I create while taking the From Deep Learning Foundations to Stable Diffusion, Practical Deep Learning for Coders part 2, 2022. The course is happening live for the past four weeks, so far.

Jeremy spent several lessons on matrix multiplication. He covered a variety of foundational concepts beyond simple matrix multiplication.

  • Retrieving a datasource
  • Viewing in-memory images
  • Iterating through a datasource
  • Matrix and tensors
  • Random numbers
  • Matrix multiplication you already know how to do
  • Optimizing CPU matrix multiplication
  • Complex looking math formulas that are really simple in code
  • Broadcasting
    • with a scalar
    • vector to a matrix
  • Math operations on tensors
  • Reshaping Tensors
  • Optimizing matrix multiplication with GPU

This blog post introduces three notebooks where I explore Nx’s ability to optimize CPU and GPU matrix multiplication.

In PyTorch, the ability to run on the CPU is pretty simple. Unless you specifically request GPU allocated memory, PyTorch code runs on the CPU.

To run a method call on the GPU, the data needs to be moved to the GPU. Calling the method uses the GPU. The resulting data needs to be moved back to the CPU.

m1c,m2c = x_train.cuda(),weights.cuda()
r=(m1c@m2c).cpu()

Nx Backends

One of the strengths of Elixir’s numerical processing approach is the concept of a backend. The same Nx code can run on several different backends. This allows Nx to adapt to changes in numerical libaries and technology. Currently, Nx has support for Tensorflow’s XLA and PyTorch’s TorchScript. Theoretically, backends for System on a Chip (SOC) devices should be possible.

TorchScript

To specify the use of TorchScript, we need to include torchx in the Mix.install of a Livebook cell. There are requirements identified in my notebook, but the detailed requirements are at the Torchx documentation site. One item to note is that downloading TorchScript can take a while and it happens each time Mix.install is changed. Key Nx team members are aware of this challenge. It is likely they will cache the download like they do for XLA.

Mix.install(
  [
    ...
    {:torchx, "~> 0.3"}
  ]
)

Just including torchx will result in using only the CPU. While not a fast as a GPU, it is far faster than using the BinaryBackend.
To use Torchx with your Nvidia GPU, add a LIBTORCH_TARGET environment variable like:

Mix.install(
  [
    ...
    {:torchx, "~> 0.3"}
  ],
  system_env: %{"LIBTORCH_TARGET" => "cu116"}
)

In my TorchScript notebook, I explored the ability to switch between BinaryBackend, TorchScript on CPU and TorchScript on GPU. Like the PyTorch code above, we need to move the data from the CPU to the GPU before we can run an Nx function on the GPU.

x_valid_torchx_cpu = Nx.backend_transfer(x_valid, {Torchx.Backend, device: :cpu})
weights_torchx_cpu = Nx.backend_transfer(weights, {Torchx.Backend, device: :cpu})

We don’t always want all numerical operations to run on the GPU. Sometimes, we want to perform work on the CPU when working in a GPU constrained environment. For example, using the GPU to augment a batch of training data competes with the training that is also happening on the GPU.

XLA

My XLA GPU notebook and CPU notebook demonstrate using XLA as a backend. Using XLA opens up more GPU devices. XLA supports NVidia, AMD’s ROCm and Google’s TPUs. The following Mix.install uses XLA on a CPU.

Mix.install(
  [
    ...
    {:exla, "~> 0.4"}
  ]

To run on an NVidia GPU, set the XLA_TARGET like:

    {:exla, "~> 0.4"}
  ],
  system_env: %{"XLA_TARGET" => "cuda111"}
)

The reason XLA has 2 example notebooks is worth discussing. I started with XLA first and created a notebook very similar to the TorchScript notebook. However, we found the Exla wasn’t honoring Nx.default_backend/1 changes. By the way, notice that the default_backend/1 has a side-effect. It changes the state of the Nx runtime to a different backend configuration.

Nx checks where the data is allocated as a part of Nx function calls. My notebook was failing in the XLA CPU section. I had requested an Exla CPU backend but the data was being allocated on the GPU. Jose pointed out that the backend was cuda when I was expecting host.

Nx.backend_transfer(....
f32[784][10]
  EXLA.Backend<cuda:0

So a lesson learned that may help you when encountering unexpected errors. Check the backend allocation of the data used in the function call. The fix for this issue has been checked into master. For now, I used two notebooks to demonstrate XLA GPU and XLA CPU.

Because of these results, I’ll save my notebooks as either Torchx or EXLA instead of using the default BinaryBackend. If you have questions or concerns about the notebooks, please create an issue on the GitHub project, https://github.com/meanderingstream/dl_foundations_in_elixir. Pull requests are always welcome.

fastai, nx, axon, foundations, deep_learning

Matrix multiplication on GPU - TorchScript

Accelerating Nx with Torchscript

Matrix multiplication on GPU - TorchScript

Run in Livebook

Mix.install(
  [
    {:nx, "~> 0.4.0"},
    {:scidata, "~> 0.1.9"},
    {:torchx, "~> 0.3"}
  ],
  system_env: %{"LIBTORCH_TARGET" => "cu116"}
)
:ok

Before running notebook

This notebook has a dependency on TorchScript. Torchx can use your CPU or GPU. If you have direct access to an NVidia GPU, the notebook has a section on running matrix multiplication on a GPU. If you only have a CPU, you can comment out the last GPU section and just run on your CPU. CPU is still pretty fast for this simple notebook.

According to the documentation, https://github.com/elixir-nx/nx/tree/main/torchx#readme Torchx will need to compile the TorchScript binding. Before you run the above cell, you will need make/nmake, cmake (3.12+) and a C++ compiler. The Windows binding to TorchScript is also supported and more information can be found at the Torchx readme. At this time, the MacOS binding doesn’t support access to a GPU.

Running the first cell downloads and compiles the binding to TorchScript. The download of TorchScript took about 9 minutes and compilation took about 1 minute on our system. In the future, it is likely that the downloaded TorchScript file will be cached locally, however, right now each notebook that uses torchx will download the file.

The notebook is currently set up for an Nvidia GPU on Linux.

system_env: %{"LIBTORCH_TARGET" => "cu111"}

Feel free to read the Torchx documentation and modify to fit your needs.

Context

The notebook is a transformation of a Python Jupyter Notebook from Fast.ai’s From Deep Learning Foundations to Stable Diffusion, Practical Deep Learning for Coders part 2, 2022. Specifically, it mimics the CUDA portion of https://github.com/fastai/course22p2/blob/master/nbs/01_matmul.ipynb

The purpose of the transformation is to bring the Fast.ai concepts to Elixir focused developers. The object-oriented Python/PyTorch implementation is transformed into a functional programming implementation using Nx and Axon

Experimenting with backend control

In this notebook, we are going to experiment with swapping out backends in the same notebook. One of the strengths of Elixir’s numerical processing approach is the concept of a backend. The same Nx code can run on several different backends. This allows Nx to adapt to changes in numerical libaries and technology. Currently, Nx has support for Tensorflow’s XLA and PyTorch’s TorchScript. Theoretically, backends for SOC type devices should be possible.

We chose not to set the backend globally in this notebook. At the beginning of the notebook, we’ll repeat the approach we used in 01a_matmul_using_CPU. We begin with the Elixir Binary backend. You’ll see that it isn’t quick multiplying 10,000 rows of MNIST data by some arbitrary weights.

We’ll then repeat the same multiplication using TorchScript on the CPU. Followed again by TorchScript using an NVidia 1080Ti GPU. The 1080 Ti is not the fastest GPU, but it is tremendously faster than a “large” set of data on the BinaryBackend but only a little faster than just the CPU

  • About 32 seconds using BinaryBackend with only a CPU.
  • 1.8 milliseconds using TorchScript with only a CPU

17,778 times faster than Binary backend

  • 70 microseconds using TorchScript with a warmed up, but old, GPU

111 times faster on the GPU vs the CPU

Default - BinaryBackend

# Without choosing a backend, Nx defaults to Nx.BinaryBackend
Nx.default_backend()
{Nx.BinaryBackend, []}
# Just in case you rerun the notebook, let's make sure the default backend is BinaryBackend
# Setting to the Nx default backend
Nx.default_backend(Nx.BinaryBackend)
Nx.default_backend()
{Nx.BinaryBackend, []}

We’ll pull down the MNIST data

{train_images, train_labels} = Scidata.MNIST.download()
{test_images, test_labels} = Scidata.MNIST.download_test()
{{<<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, {:u, 8}, {10000, 1, 28, 28}},
 {<<7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4, 9, 6, 6, 5, 4, 0, 7, 4, 0, 1, 3, 1,
    3, 4, 7, 2, 7, 1, 2, 1, 1, 7, 4, 2, 3, 5, 1, ...>>, {:u, 8}, {10000}}}
{train_images_binary, train_tensor_type, train_shape} = train_images
{test_images_binary, test_tensor_type, test_shape} = test_images
{<<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, {:u, 8}, {10000, 1, 28, 28}}
{train_tensor_type, test_tensor_type}
{{:u, 8}, {:u, 8}}

Convert into Tensors and normalize to between 0 and 1

train_tensors =
  train_images_binary
  |> Nx.from_binary(train_tensor_type)
  |> Nx.reshape({60000, 28 * 28})
  |> Nx.divide(255)
#Nx.Tensor<
  f32[60000][784]
  [
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...],
    ...
  ]
>

We’ll separate the data into 50,000 train images and 10,000 validation images.

x_train = train_tensors[0..49_999]
x_valid = train_tensors[50_000..59_999]
{x_train.shape, x_valid.shape}
{{50000, 784}, {10000, 784}}

Training is more stable when random numbers are initialized with a mean of 0.0 and a variance of 1.0

mean = 0.0
variance = 1.0
weights = Nx.random_normal({784, 10}, mean, variance, type: {:f, 32})
#Nx.Tensor<
  f32[784][10]
  [
    [1.182692050933838, 1.6625404357910156, -0.598689079284668, -0.6435468196868896, 0.25204139947891235, -1.1432150602340698, -0.9701210260391235, 1.9566036462783813, -0.6923237442970276, -1.0753910541534424],
    [0.17891690135002136, 0.42717286944389343, -0.9910821914672852, -2.649228096008301, 0.13641099631786346, 0.48691749572753906, -1.0575640201568604, 0.40385302901268005, 0.5131964683532715, 0.41488444805145264],
    [2.100423574447632, -1.2787413597106934, -1.8883213996887207, -0.49423742294311523, 0.5708040595054626, -0.48230457305908203, -0.19617703557014465, 0.7797456979751587, 0.7876895070075989, -0.33916765451431274],
    [-0.4369395673274994, 0.4421914517879486, 0.18007169663906097, 0.7891340255737305, 0.28369951248168945, -1.2312926054000854, -0.17864377796649933, -1.2232452630996704, 0.6976354718208313, 1.300831913948059],
    [-1.9821809530258179, 1.426361083984375, -2.2645328044891357, 0.26135173439979553, -0.36276111006736755, 2.7461342811584473, 0.007044021971523762, -0.18955571949481964, 0.6062670946121216, -0.4373891055583954],
    ...
  ]
>

In order to simplify timing the performance of the Nx.dot/2 function, we’ll use an 0 parameter anonymous function. Invoking the anonymous function will always use the two parameters, x_valid_cpu and weights_cpu.

large_nx_mult_fn = fn -> Nx.dot(x_valid, weights) end
#Function<43.3316493/0 in :erl_eval.expr/6>

The following anonymous function take a function and the number of times to make the call to the function.

repeat = fn timed_fn, times -> Enum.each(1..times, fn _x -> timed_fn.() end) end
#Function<41.3316493/2 in :erl_eval.expr/6>

Timing the average duration of the dot multiply function to run. The cell will output the average and total elapsed time

repeat_times = 5
{elapsed_time_micro, _} = :timer.tc(repeat, [large_nx_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, _device} = Nx.default_backend()

"#{backend} CPU avg time in #{avg_elapsed_time_ms} milliseconds  total_time #{elapsed_time_micro / 1000} milliseconds"
"Elixir.Nx.BinaryBackend CPU avg time in 31846.6328 milliseconds  total_time 159233.164 milliseconds"

TorchScript CPU only

We’ll switch to the TorchScript backend but we’ll stick with using the CPU.

Nx.default_backend({Torchx.Backend, device: :cpu})
Nx.default_backend()
{Torchx.Backend, [device: :cpu]}

In the following cell, we transfer the target data from BinaryBackend to Torchx cpu backend.

x_valid_torchx_cpu = Nx.backend_transfer(x_valid, {Torchx.Backend, device: :cpu})
weights_torchx_cpu = Nx.backend_transfer(weights, {Torchx.Backend, device: :cpu})
#Nx.Tensor<
  f32[784][10]
  Torchx.Backend(cpu)
  [
    [1.182692050933838, 1.6625404357910156, -0.598689079284668, -0.6435468196868896, 0.25204139947891235, -1.1432150602340698, -0.9701210260391235, 1.9566036462783813, -0.6923237442970276, -1.0753910541534424],
    [0.17891690135002136, 0.42717286944389343, -0.9910821914672852, -2.649228096008301, 0.13641099631786346, 0.48691749572753906, -1.0575640201568604, 0.40385302901268005, 0.5131964683532715, 0.41488444805145264],
    [2.100423574447632, -1.2787413597106934, -1.8883213996887207, -0.49423742294311523, 0.5708040595054626, -0.48230457305908203, -0.19617703557014465, 0.7797456979751587, 0.7876895070075989, -0.33916765451431274],
    [-0.4369395673274994, 0.4421914517879486, 0.18007169663906097, 0.7891340255737305, 0.28369951248168945, -1.2312926054000854, -0.17864377796649933, -1.2232452630996704, 0.6976354718208313, 1.300831913948059],
    [-1.9821809530258179, 1.426361083984375, -2.2645328044891357, 0.26135173439979553, -0.36276111006736755, 2.7461342811584473, 0.007044021971523762, -0.18955571949481964, 0.6062670946121216, -0.4373891055583954],
    ...
  ]
>

An anonymous function that calls Nx.dot/2 with data on the Torchx cpu backend.

torchx_cpu_mult_fn = fn -> Nx.dot(x_valid_torchx_cpu, weights_torchx_cpu) end
#Function<43.3316493/0 in :erl_eval.expr/6>

We’ll time using Torchx on the CPU. Notice the significant performance improvement over BinaryBackend while still using just the CPU.

repeat_times = 5
{elapsed_time_micro, _} = :timer.tc(repeat, [torchx_cpu_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, [device: device]} = Nx.default_backend()

"#{backend} #{device} avg time in milliseconds #{avg_elapsed_time_ms} total_time #{elapsed_time_micro / 1000}"
"Elixir.Torchx.Backend cpu avg time in milliseconds 1.7149999999999999 total_time 8.575"

TorchScript using GPU

We’ll switch to using the cuda device. If you have a different device, replace all the :cuda specifications with your device.

Nx.default_backend({Torchx.Backend, device: :cuda})
Nx.default_backend()
{Torchx.Backend, [device: :cuda]}

In the following cell, we transfer the target data onto the GPU.

x_valid_cuda = Nx.backend_transfer(x_valid, {Torchx.Backend, client: :cuda})
weights_cuda = Nx.backend_transfer(weights, {Torchx.Backend, client: :cuda})
#Nx.Tensor<
  f32[784][10]
  Torchx.Backend(cuda)
  [
    [1.182692050933838, 1.6625404357910156, -0.598689079284668, -0.6435468196868896, 0.25204139947891235, -1.1432150602340698, -0.9701210260391235, 1.9566036462783813, -0.6923237442970276, -1.0753910541534424],
    [0.17891690135002136, 0.42717286944389343, -0.9910821914672852, -2.649228096008301, 0.13641099631786346, 0.48691749572753906, -1.0575640201568604, 0.40385302901268005, 0.5131964683532715, 0.41488444805145264],
    [2.100423574447632, -1.2787413597106934, -1.8883213996887207, -0.49423742294311523, 0.5708040595054626, -0.48230457305908203, -0.19617703557014465, 0.7797456979751587, 0.7876895070075989, -0.33916765451431274],
    [-0.4369395673274994, 0.4421914517879486, 0.18007169663906097, 0.7891340255737305, 0.28369951248168945, -1.2312926054000854, -0.17864377796649933, -1.2232452630996704, 0.6976354718208313, 1.300831913948059],
    [-1.9821809530258179, 1.426361083984375, -2.2645328044891357, 0.26135173439979553, -0.36276111006736755, 2.7461342811584473, 0.007044021971523762, -0.18955571949481964, 0.6062670946121216, -0.4373891055583954],
    ...
  ]
>

An anonymous function that calls Nx.dot/2 with data on the GPU

torchx_gpu_mult_fn = fn -> Nx.dot(x_valid_cuda, weights_cuda) end
#Function<43.3316493/0 in :erl_eval.expr/6>

We’ll warm up the GPU by looping through 5 function calls and then timing the next 5 function calls.

repeat_times = 5
# Warmup
{elapsed_time_micro, _} = :timer.tc(repeat, [torchx_gpu_mult_fn, repeat_times])
{elapsed_time_micro, _} = :timer.tc(repeat, [torchx_gpu_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, [device: device]} = Nx.default_backend()

"#{backend} #{device} avg time in milliseconds #{avg_elapsed_time_ms} total_time #{elapsed_time_micro / 1000}"
"Elixir.Torchx.Backend cuda avg time in milliseconds 0.0718 total_time 0.359"
x_valid = Nx.backend_transfer(x_valid_cuda, Nx.BinaryBackend)
weights = Nx.backend_transfer(weights_cuda, Nx.BinaryBackend)
#Nx.Tensor<
  f32[784][10]
  [
    [1.182692050933838, 1.6625404357910156, -0.598689079284668, -0.6435468196868896, 0.25204139947891235, -1.1432150602340698, -0.9701210260391235, 1.9566036462783813, -0.6923237442970276, -1.0753910541534424],
    [0.17891690135002136, 0.42717286944389343, -0.9910821914672852, -2.649228096008301, 0.13641099631786346, 0.48691749572753906, -1.0575640201568604, 0.40385302901268005, 0.5131964683532715, 0.41488444805145264],
    [2.100423574447632, -1.2787413597106934, -1.8883213996887207, -0.49423742294311523, 0.5708040595054626, -0.48230457305908203, -0.19617703557014465, 0.7797456979751587, 0.7876895070075989, -0.33916765451431274],
    [-0.4369395673274994, 0.4421914517879486, 0.18007169663906097, 0.7891340255737305, 0.28369951248168945, -1.2312926054000854, -0.17864377796649933, -1.2232452630996704, 0.6976354718208313, 1.300831913948059],
    [-1.9821809530258179, 1.426361083984375, -2.2645328044891357, 0.26135173439979553, -0.36276111006736755, 2.7461342811584473, 0.007044021971523762, -0.18955571949481964, 0.6062670946121216, -0.4373891055583954],
    ...
  ]
>

fastai, livebook, axon, foundations, deep_learning

Matrix multiplication on GPU - XLA

Accelerating Nx with XLA and the GPU

Matrix multiplication on GPU - XLA

Run in Livebook

Mix.install(
  [
    {:nx, "~> 0.4.0"},
    {:scidata, "~> 0.1.9"},
    {:axon, "~> 0.3.0"},
    {:exla, "~> 0.4"}
  ],
  system_env: %{"XLA_TARGET" => "cuda111"}
)

Before running notebook

This notebook has a dependency on EXLA. XLA supports systems with direct access to an NVidia GPU, AMD ROCm or a Google TPU. According to the documentation, https://github.com/elixir-nx/nx/tree/main/exla#readme EXLA will try to find a precompiled version that matches your system. If it doesn’t find a match. you will need to install CUDA and CuDNN for your system.

The notebook is currently configured for Nvidia GPU via

system_env: %{"XLA_TARGET" => "cuda111"}

Review the configuration documentation for more options. https://hexdocs.pm/exla/EXLA.html#module-configuration

We had to install CUDA and CuDNN but that was several months ago. Your experience may vary from ours.

Context

This Livebook is a transformation of a Python Jupyter Notebook from Fast.ai’s From Deep Learning Foundations to Stable Diffusion, Practical Deep Learning for Coders part 2, 2022. Specifically, it mimics the CUDA portion of https://github.com/fastai/course22p2/blob/master/nbs/01_matmul.ipynb

The purpose of the transformation is to bring the Fast.ai concepts to Elixir focused developers. The object-oriented Python/PyTorch implementation is transformed into a functional programming implementation using Nx and Axon

Experimenting with backend control

In this notebook, we are going to experiment with swapping out backends in the same notebook. One of the strengths of Elixir’s numerical processing approach is the concept of a backend. The same Nx code can run on several different backends. This allows Nx to adapt to changes in numerical libaries and technology. Currently, Nx has support for Tensorflow’s XLA and PyTorch’s TorchScript. Theoretically, backends for SOC type devices should be possible.

We chose not to set the backend globally throughout the notebook. At the beginning of the notebook we’ll repeat the approach we used in 01a_matmul_using_CPU. We begin with the Elixir Binary backend. You’ll see that it isn’t quick multiplying 10,000 rows of MNIST data by some arbitrary weights. We’ll then repeat the same multiplication using an NVidia 1080Ti GPU. The 1080 Ti is not the fastest GPU, but it is tremendously faster than a “large” set of data on the BinaryBackend.

  • 31649.26 milliseconds using BinaryBackend with a CPU only.
  • 0.14 milliseconds using XLA with a warmed up GPU

226,000 times faster on an old GPU

Default - BinaryBackend

# Without choosing a backend, Nx defaults to Nx.BinaryBackend
Nx.default_backend()
# Just in case you rerun the notebook, let's make sure the default backend is BinaryBackend
# Setting to the Nx default backend
Nx.default_backend(Nx.BinaryBackend)
Nx.default_backend()

We’ll pull down the MNIST data

{train_images, train_labels} = Scidata.MNIST.download()
{train_images_binary, train_tensor_type, train_shape} = train_images
train_tensor_type

Convert into Tensors and normalize to between 0 and 1

train_tensors =
  train_images_binary
  |> Nx.from_binary(train_tensor_type)
  |> Nx.reshape({60000, 28 * 28})
  |> Nx.divide(255)

We’ll separate the data into 50,000 train images and 10,000 validation images.

x_train_cpu = train_tensors[0..49_999]
x_valid_cpu = train_tensors[50_000..59_999]
{x_train_cpu.shape, x_valid_cpu.shape}

Training is more stable when random numbers are initialized with a mean of 0.0 and a variance of 1.0

mean = 0.0
variance = 1.0
weights_cpu = Nx.random_normal({784, 10}, mean, variance, type: {:f, 32})

In order to simplify timing the performance of the Nx.dot/2 function, we’ll use an 0 parameter anonymous function. Invoking the anonymous function will always use the two parameters, x_valid_cpu and weights_cpu.

large_nx_mult_fn = fn -> Nx.dot(x_valid_cpu, weights_cpu) end

The following anonymous function takes function and the number of times to make the call to the function.

repeat = fn timed_fn, times -> Enum.each(1..times, fn _x -> timed_fn.() end) end

Timing the average duration of the dot multiply function to run. The cell will output the average and total elapsed time

repeat_times = 5
{elapsed_time_micro, _} = :timer.tc(repeat, [large_nx_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, _device} = Nx.default_backend()

"#{backend} CPU avg time in #{avg_elapsed_time_ms} milliseconds, total_time #{elapsed_time_micro / 1000} milliseconds"

XLA using GPU

We’ll switch to the XLA backend and use the cuda device. If you have a different device, replace all the :cuda specifications with your device.

Nx.default_backend({EXLA.Backend, device: :cuda})
Nx.default_backend()

In the following cell, we transfer the target data onto the GPU.

x_valid_cuda = Nx.backend_transfer(x_valid_cpu, {EXLA.Backend, client: :cuda})
weights_cuda = Nx.backend_transfer(weights_cpu, {EXLA.Backend, client: :cuda})

An anonymous function that calls Nx.dot/2 with data on the GPU

exla_gpu_mult_fn = fn -> Nx.dot(x_valid_cuda, weights_cuda) end

We’ll warm up the GPU by looping through 5 function calls and then timing the next 5 function calls.

repeat_times = 5
# Warm up one epoch
{elapsed_time_micro, _} = :timer.tc(repeat, [exla_gpu_mult_fn, repeat_times])
# The real timing starts here
{elapsed_time_micro, _} = :timer.tc(repeat, [exla_gpu_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, [device: device]} = Nx.default_backend()

"#{backend} #{device} avg time in #{avg_elapsed_time_ms} milliseconds total_time #{elapsed_time_micro / 1000} milliseconds"
x_valid_cpu = Nx.backend_transfer(x_valid_cuda, Nx.BinaryBackend)
weights_cpu = Nx.backend_transfer(weights_cuda, Nx.BinaryBackend)

fastai, livebook, axon, foundations, xla, deep_learning

Matrix multiplication on CPU - XLA

Accelerating Nx with XLA and just the CPU

Matrix multiplication on CPU- XLA

Run in Livebook

Mix.install(
  [
    {:nx, "~> 0.4.0"},
    {:scidata, "~> 0.1.9"},
    {:axon, "~> 0.3.0"},
    {:exla, "~> 0.4"}
  ]
)
Resolving Hex dependencies...
Dependency resolution completed:
New:
  axon 0.3.0
  castore 0.1.18
  complex 0.4.2
  elixir_make 0.6.3
  exla 0.4.0
  jason 1.4.0
  nimble_csv 1.2.0
  nx 0.4.0
  scidata 0.1.9
  xla 0.3.0
* Getting nx (Hex package)
* Getting scidata (Hex package)
* Getting axon (Hex package)
* Getting exla (Hex package)
* Getting elixir_make (Hex package)
* Getting xla (Hex package)
* Getting castore (Hex package)
* Getting jason (Hex package)
* Getting nimble_csv (Hex package)
* Getting complex (Hex package)
==> jason
Compiling 10 files (.ex)
Generated jason app
==> nimble_csv
Compiling 1 file (.ex)
Generated nimble_csv app
==> complex
Compiling 2 files (.ex)
Generated complex app
==> nx
Compiling 27 files (.ex)
Generated nx app
==> axon
Compiling 24 files (.ex)
Generated axon app
==> elixir_make
Compiling 1 file (.ex)
Generated elixir_make app
==> xla
Compiling 2 files (.ex)
Generated xla app
==> exla
Unpacking /home/ml3/.cache/xla/0.3.0/cache/download/xla_extension-x86_64-linux-cpu.tar.gz into /home/ml3/.cache/mix/installs/elixir-1.14.1-erts-13.1/45e4038ac8aacd103fe2688496702add/deps/exla/cache
g++ -fPIC -I/home/ml3/.asdf/installs/erlang/25.1/erts-13.1/include -Icache/xla_extension/include -O3 -Wall -Wno-sign-compare -Wno-unused-parameter -Wno-missing-field-initializers -Wno-comment -shared -std=c++14 c_src/exla/exla.cc c_src/exla/exla_nif_util.cc c_src/exla/exla_client.cc -o cache/libexla.so -Lcache/xla_extension/lib -lxla_extension -Wl,-rpath,'$ORIGIN/lib'
Compiling 21 files (.ex)
Generated exla app
==> castore
Compiling 1 file (.ex)
Generated castore app
==> scidata
Compiling 13 files (.ex)
Generated scidata app
:ok

Before running notebook

This notebook has a dependency on EXLA. XLA support systems with direct access to an NVidia GPU, AMD ROCm or a Google TPU. According to the documentation, https://github.com/elixir-nx/nx/tree/main/exla#readme EXLA will try to find a precompiled version that matches your system. If it doesn’t find a match. you will need to install CUDA and CuDNN for your system.

The notebook is currently configured for Nvidia GPU via

system_env: %{"XLA_TARGET" => "cuda111"}

Review the configuration documentation for more options. https://hexdocs.pm/exla/EXLA.html#module-configuration

We had to install CUDA and CuDNN but that was several months ago. Your experience may vary from ours.

Context

This Livebook is a transformation of a Python Jupyter Notebook from Fast.ai’s From Deep Learning Foundations to Stable Diffusion, Practical Deep Learning for Coders part 2, 2022. Specifically, it mimics the CUDA portion of https://github.com/fastai/course22p2/blob/master/nbs/01_matmul.ipynb

The purpose of the transformation is to bring the Fast.ai concepts to Elixir focused developers. The object-oriented Python/PyTorch implementation is transformed into a functional programming implementation using Nx and Axon

Experimenting with backend control

In this notebook, we are going to experiment with swapping out backends in the same notebook. One of the strengths of Elixir’s numerical processing approach is the concept of a backend. The same Nx code can run on several different backends. This allows Nx to adapt to changes in numerical libaries and technology. Currently, Nx has support for Tensorflow’s XLA and PyTorch’s TorchScript. Theoretically, backends for SOC type devices should be possible.

We chose not to set the backend globally throughout the notebook. At the beginning of the notebook we’ll repeat the approach we used in 01a_matmul_using_CPU. We begin with the Elixir Binary backend. You’ll see that it isn’t quick multiplying 10,000 rows of MNIST data by some arbitrary weights. We’ll then repeat the same multiplication using an NVidia 1080Ti GPU. The 1080 Ti is not the fastest GPU, but it is tremendously faster than a “large” set of data on the BinaryBackend.

  • 31649.26 milliseconds using BinaryBackend with a CPU only.
  • 0.14 milliseconds using XLA with a warmed up GPU

226,000 times faster on an old GPU

Backends

# Without choosing a backend, Nx defaults to Nx.BinaryBackend
Nx.default_backend()
{Nx.BinaryBackend, []}

Let’s change to EXLA with CPU

Nx.default_backend({EXLA.Backend, device: :host})
Nx.default_backend()
{EXLA.Backend, [device: :host]}

We’ll pull down the MNIST data

{train_images, train_labels} = Scidata.MNIST.download()
{{<<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, {:u, 8}, {60000, 1, 28, 28}},
 {<<5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9, 4, 0, 9, 1, 1, 2, 4, 3, 2, 7, 3, 8,
    6, 9, 0, 5, 6, 0, 7, 6, 1, 8, 7, 9, 3, 9, 8, ...>>, {:u, 8}, {60000}}}
{train_images_binary, train_tensor_type, train_shape} = train_images
{<<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>, {:u, 8}, {60000, 1, 28, 28}}
train_tensor_type
{:u, 8}

Convert into Tensors and normalize to between 0 and 1

train_tensors =
  train_images_binary
  |> Nx.from_binary(train_tensor_type)
  |> Nx.reshape({60000, 28 * 28})
  |> Nx.divide(255)

18:50:30.293 [info] XLA service 0x7fe6d40e2330 initialized for platform Host (this does not guarantee that XLA will be used). Devices:

18:50:30.295 [info]   StreamExecutor device (0): Host, Default Version
#Nx.Tensor<
  f32[60000][784]
  EXLA.Backend<host:0, 0.2851900150.286654488.81191>
  [
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...],
    ...
  ]
>

We’ll separate the data into 50,000 train images and 10,000 validation images.

x_train_cpu = train_tensors[0..49_999]
x_valid_cpu = train_tensors[50_000..59_999]
{x_train_cpu.shape, x_valid_cpu.shape}
{{50000, 784}, {10000, 784}}

Training is more stable when random numbers are initialized with a mean of 0.0 and a variance of 1.0

mean = 0.0
variance = 1.0
weights_cpu = Nx.random_normal({784, 10}, mean, variance, type: {:f, 32})
#Nx.Tensor<
  f32[784][10]
  EXLA.Backend<host:0, 0.2851900150.286654488.81194>
  [
    [-0.973583996295929, 1.3404284715652466, 0.5889155268669128, -0.06439179182052612, -2.2255215644836426, -0.3939111828804016, -1.5497547388076782, -1.1714494228363037, 1.0855729579925537, -0.4689534306526184],
    [-0.31778475642204285, 0.07520100474357605, 0.053238045424222946, 0.42360711097717285, -2.253004312515259, -0.3818463981151581, -0.5468025803565979, 1.3460612297058105, 1.509813904762268, 0.10178464651107788],
    [2.7212319374084473, -0.6341637969017029, 1.9983967542648315, 0.4862823486328125, 0.951216459274292, -0.8570582270622253, 1.7834625244140625, -0.1596108078956604, -0.369051992893219, 0.7038326263427734],
    [-1.321571946144104, -0.573075532913208, -0.5281657576560974, -1.528030276298523, 0.5641341209411621, -0.13296610116958618, -0.20917919278144836, -0.5405102372169495, 0.13647650182247162, 1.0692965984344482],
    [1.1940683126449585, -1.0889204740524292, 0.26889121532440186, -0.8505605459213257, 0.31284958124160767, 0.8289848566055298, 0.23549814522266388, 0.5921769738197327, 0.506867527961731, 0.6787563562393188],
    ...
  ]
>

In order to simplify timing the performance of the Nx.dot/2 function, we’ll use an 0 parameter anonymous function. Invoking the anonymous function will always use the two parameters, x_valid_cpu and weights_cpu.

large_nx_mult_fn = fn -> Nx.dot(x_valid_cpu, weights_cpu) end
#Function<43.3316493/0 in :erl_eval.expr/6>

The following anonymous function takes function and the number of times to make the call to the function.

repeat = fn timed_fn, times -> Enum.each(1..times, fn _x -> timed_fn.() end) end
#Function<41.3316493/2 in :erl_eval.expr/6>

Timing the average duration of the dot multiply function to run. The cell will output the average and total elapsed time

repeat_times = 5
{elapsed_time_micro, _} = :timer.tc(repeat, [large_nx_mult_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

{backend, device} = Nx.default_backend()

"#{backend} CPU avg time in #{avg_elapsed_time_ms} milliseconds, total_time #{elapsed_time_micro / 1000} milliseconds"
"Elixir.EXLA.Backend CPU avg time in 1.2837999999999998 milliseconds, total_time 6.419 milliseconds"

fastai, livebook, axon, foundations, xla, deep_learning

Intro to Deep Learning from the Foundations, in Elixir

Developing a set of notebooks to follow the Fast.ai Deep Learning from the Foundations to Stable Diffusion Course

I’ve started a project on GitHub, dl_foundations_in_elixir. In this project, I collect the Elixir Livebook notebooks that I will create while taking the From Deep Learning Foundations to Stable Diffusion, Practical Deep Learning for Coders part 2, 2022. The course is happening live for the past three weeks, so far.

The content of the course is in two parts. We are exploring Stable Diffusion as it exists today and we are exploring the latest papers that improve or explore Stable Diffusion related concepts. We are learning the skills to read the math in the papers, what parts of a paper are most useful to explore in depth and gaining confidence in reading papers. So far, 4 hours of the main lesson by Jeremy have been focused here. There are additional videos by community members that expand upon the main lessons.

For each of the past two weeks, Jeremy has been exploring the 01_matmul Python/Jupyter notebook. We aren’t quite done with the notebook just yet. We are just about to transition from CPU focused cells to running a matrix multiplication on the GPU. I’ve released an Elixir/Livebook version of the Fast.ai notebook called, 01a_matmul_using_CPU.livemd. I’m focused on the CPU portions of the notebook. Where there is a direct mapping from the Python cell to my Elixir cell, I provide the Python and the Elixir in one cell. Elixir developers can see how Python machine learning methods get translated into Elixir modules. Please checkout our progress so far. We are open to pull requests and the discussion capabilities of Github are available.

Run in Livebook

Next, I’ve created a Livebook notebook where I try to help PyTorch/Jupyter developers that are exploring Elixir/Livebook with key aspects of Livebook, . In a short notebook, I explain some of the key differences with Elixir and Livebook. Again, we are open to pull requests and discussion.

Run in Livebook

fastai, axon, foundations, deep_learning

Elixir/Livebook for Python/Jupyter Developers

Quick introduction to Elixir and Livebooks for Jupyter Developers

Elixir/Livebook for Python/Jupyter Developers

Run in Livebook

Quick Overview

Python/Jupyter focused developers that are casually looking at Elixir and Livebook notebooks will see concepts that kind of look the same but are different. Knowing the key differences could help ease understanding about key Elixir concepts in the notebook. The goal of this guide is to help Python focused people look at a Livebook notebook and grasp what is happening in the notebook.

Installing Livebook

We’ve had good success with installing Livebook.dev, https://livebook.dev/#install. There are native applications for Windows and Mac. On a Linux system, we go to the Github site, https://github.com/livebook-dev/livebook, and install via Escript. Don’t forget to set the shims. Also, when running on a local Linux server, the firewall ports for Kino and other interactive cells, are different than the livebook server port. Be sure to pay attention to the environment variable options in the Readme.md

Let’s discuss how Livebook/Elixir is a little different from Jupyter/Python

Function vs Object Oriented

Elixir is a functional language. State exists outside of a function with values passed into a function. Most Elixir functions will probably transform the inputs then supply an output back. You could think of them as procedures that have everything passed in, don’t update any object state, and return the result.

However, there are a few situations where state is held after a function call. In Elixir, we think of these functions as having a side effect. Common side effects are storing data in a database, file, or operating resources. The database “write” and file “write” functions result in changing a resource that can later be retrieved. There are a several other examples of side effect situations. We’ll even see a few examples in Elixir machine learning libraries.

Elixir has modules that hold function definitions and may define a data structure. Python has class definitions that hold state and function definitions. Where a variable can have a method invocation in Python, i.e. list_a.sum(), Elixir values must be passed as arguments into a module’s function, Enum.sum(list_b)

Immutable state in Elixir

For Machine Learning notebooks, state is referenced in variable names specific to the notebook. This is very similar to how variable state is held in a Jupyter notebook. The variable values are held by the notebook until the Elixir notebook is closed.

One pretty big difference in Elixir is that all state is immutable. A function can receive state as an argument variable, however, the variable is immutable so it can’t be changed. The function may transform the information, but any transformations must be returned back as a newly created value. One convenient approach in Elixir is to assign the resulting function call back to the same variable name. But there are better conventions that we’ll see below

list_b = [1,2,3]
list_b = Enum.map(list_b, fn(value) -> value * value end)

Like Jupyter notebooks, shift-return key will execute the current cell. The other keyboard shortcuts can be found in the keypad-like icon on the left bar. There is also a mouse approach with the > Execute that appears above the active cell. Click on the Execute button will also work. If you’ve already install Livebook, try executing the following code cell.

list_b = [1, 2, 3]
list_b = Enum.map(list_b, fn value -> value * value end)

Chaining function calls

In many object-oriented languages method calls can be chained sequentially, i.e. array_a.square().sum()

Elixir has a special notation for chaining function calls together.

list_b
  |> Enum.map(fn(value) -> value * value end)
  |> Enum.sum()

The |>, pipe operator, takes the result of the previous function and passes it as the first argument to the following function call. Note that the first argument, a list or enumerable, for Enum.map and Enum.sum aren’t shown because the pipe operator represents the output from the previous line of code.

list_b
|> Enum.map(fn value -> value * value end)
|> Enum.sum()
# All in one line also works
list_b |> Enum.map(fn value -> value * value end) |> Enum.sum()

Elixir functions in modules

As long as the code in a Livebook is calling existing functions, variable assignment works pretty much like they do in Python/Jupyter. However, Python supports the definition of standalone functions in notebooks.

def chunks(x, sz):
    for i in range(0, len(x), sz): yield x[i:i+sz]

Alll Elixir function definitions must be inside a module definition

defmodule ModA do
  def funct_a() do
  end
end

Elixir has an anonymous function capability. In the above Enum function call, the fn(something) -> transform(something) end) is creating an anonymous function, like Python’s lambda. Anonymous functions can be assigned to variable names and called. Note the .(args) when the named anonymous function is called.

sum_of_squares = fn(value) ->
  Enum.map(value, fn(v) -> v * v end) 
  |> Enum.sum() 
end

sum_of_squares.(list_b)
sum_of_squares = fn value ->
  Enum.map(value, fn v -> v * v end)
  |> Enum.sum()
end

sum_of_squares.(list_b)

Livebook module version management

One item to note about Livebook, modules are installed with a version. Rather than a requirements.txt for an entire folder of Jupyter notebooks, the module dependencies are defined within each notebook. The Livebook convention is to use the first cell to define any module dependencies. In this notebook, the basic Elixir language capabilities were sufficient so no modules were Mix.installed. Watch for the contents of the first cell to explore the modules used in notebooks. The specific modules used helps with repeatability challenges with notebooks. However, you’ll note that the Elixir and Erlang versions are not defined in the notebook. Neither was the version of Livebook the notebook was run under. Operating system dependencies, like Cuda, CudaDNN, cmake, make, etc. are not defined in notebooks either.

Livebook file format

Livebook’s file format is a markdown file. The use of a well defined standard format allows support for understandable Git pull requests against the .livemd file.

Left sidebar and hints

We’ve already mentioned the keyboard shortcuts. Other icons represent the table of Section labels and connected users. The lock captures secrets that you don’t want stored in your Livebook. Secrets can be things like a database login, etc. The runtime settings is kind of an advanced setting. We suggest finding the documentation or blog posts on how to use the settings. These settings don’t have a strong mapping to Jupyter. Finally, a Big Hint, if you accidently delete a cell, you can retrieve the cell from the bin/trash.

For the active cell, there are some icons above the cell to the right. The up and down arrow move the active cell up or down in your notebook. We just noticed, in Livebook 0.7.2, that there is an icon to insert an image into a markdown cell. We’ll need to try it out.

Another big hint: Livebook knows the cells that are stale. If you go to the bottom of the notebook, or someplace in the middle of notebook, execute the cell and any cells that are out of date with your edits are executed down to your cell. This is one technique for executing all of the cells in a notebook. However, it doesn’t force the re-executing of all cells. Only stale cells are run. If you re-execute the first cell and then execute the last cell, all cells will be executed.

A Livebook notebook opened from a web resource will not be saved locally unless to instruct Livebook to save the notebook. Click on the floppy disk icon in the lower right and choose someplace you want to store the notebook.

Try out some Livebook notebooks

This hasn’t been a complete guide to Livebook, but hopefully it provides some context for your exploration of Elixir and Livebook. Have fun!

fastai, livebook, axon, foundations, deep_learning

Matrix multiplication from foundations

CPU focus introduction to matrix multiplication for deep learning

Matrix multiplication from foundations - CPU only

Mix.install([
  {:nx, "~> 0.4.0"},
  {:binary, "~> 0.0.5"},
  {:stb_image, "~> 0.5.2"},
  {:scidata, "~> 0.1.9"},
  {:kino, "~> 0.7.0"},
  {:axon, "~> 0.3.0"}
])
:ok

Run in Livebook

References

This Livebook is a transformation of a Python Jupyter Notebook from Fast.ai’s From Deep Learning Foundations to Stable Diffusion, Practical Deep Learning for Coders part 2, 2022. Specifically, it mimics https://github.com/fastai/course22p2/blob/master/nbs/01_matmul.ipynb

The purpose of the transformation is to bring the Fast.ai concepts to Elixir focused developers. The object-oriented Python/PyTorch implementation is transformed into a functional programming implementation using Nx and Axon

About Fast.ai’s Teaching Philosophy

We’ll be leveraging the best available research on teaching methods to try to fix these problems with technical teaching, including:

  • Teaching “the whole game”–starting off by showing how to use a complete, working, very usable, state of the art deep learning network to solve real world problems, by using simple, expressive tools. And then gradually digging deeper and deeper into understanding how those tools are made, and how the tools that make those tools are made, and so on…
  • Always teaching through examples: ensuring that there is a context and a purpose that you can understand intuitively, rather than starting with algebraic symbol manipulation
  • Simplifying as much as possible: we’ve spent months building tools and teaching methods that make previously complex topics very simple
  • Removing barriers: deep learning has, until now, been a very exclusive game. We’re breaking it open, and ensuring that everyone can play

From: https://www.fast.ai/posts/2016-10-08-teaching-philosophy.html

In other words, focus on student success from the beginning. Help students become confident in their growing skills. Use a code-first approach to teaching Deep Learning. Provide plenty of examples of functioning neural networks that can be applied by the students.

This part 2 course is not exactly like the above description

The Fast.ai part 1 course fits the above description really well. When we went through each of the past 4 years of the part 1 course, we felt like the course spoke to our needs. In the best world, Elixir developers could learn from the part 1 course and come away several kinds of near state of the art neural net models running in Elixir. However, an Elixir version of the part 1 course doesn’t exist yet.

The part 2 course has a different focus. Jeremy Howard likes to call part 2 the (im)practical deep learning for coders. Part 2 goes under the hood and helps students understand the pieces of a neural network and a best practice focused training library. The foundations are taught with examples that help students understand how the pieces really work. It’s impractical because the problems are simpler, already solved, examples that don’t translate to a real-world problem. The examples used in part 2 are smaller, well known problems, but the focus is on understanding how the software skills you use daily are transformed into a neural network concepts that can utilize the GPU for time efficient training. The confidence gained from the part 2 course is knowing how to modify and change a model to fit your domain situation.

Foundation notebooks and previous videos

The 2022 Python/PyTorch “from the foundations” notebooks are in https://github.com/fastai/course22p2. This notebook is being written while the live course is happening. Fast.ai course work is restricted to paid participants until after the course is completed. The notebooks are available in GitHub, but the videos and forum converstations are restricted. After the course completes, the videos and forums are open to everyone in the form of a massive open online course. However, fundamentals don’t change that much. The 2019 course videos, https://course19.fast.ai/videos/?lesson=8, would be a fine video companion for these Elixir notebooks, for now.

The first two lesson videos from the 2022 course were released early. In the second lesson, Jeremy covers the first portion of this notebook.

To Stable Diffusion

The 2022 part 2 course is Foundations to Stable Diffusion. In 2022, Fast.ai is focusing on understanding the pieces of Stable Diffusion and discussions on the latest research papers that improve upon Stable Diffusion. As a taste of what is coming, Fast.ai has released the videos from the first 2 weeks. https://www.fast.ai/posts/part2-2022-preview.html

At the current time, Stable Diffusion doesn’t run in Elixir. Fast.ai part 2 is split into two types of notebooks. A set of notebooks focused on Stable Diffusion and another set of notebooks focused on the foundations. For now, we are focused only on the foundation notebooks.

Fast.ai’s book

There was a recent Twitter discussion expressing a desire to see Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a PhD examples in Elixir/Livebook. The meanderingstream/dl_foundations_in_elixir notebooks correspond to chapters 17, 18 and 19 in the book. Further resources related to the book can be found on the Fast.ai book page.

Part 2 Foundations approach

We’ll start with standard Elixir examples of the fundamentals. An Elixir focused developer should recognize the standard Elixir code. The part 2 “Game” is:

  • Once a representative example is implemented with our code, we can then use the corresponding Nx and Axon code.

Because we are transforming Python/PyTorch into Elixir, some concepts don’t perfectly match back to the original Python code. There are some library differences and some of the tooling for Elixir and Livebook don’t perfectly match. Nx, Axon and Livebook are very recent technologies and their capabilities are growing each month.

Because we are mapping from Python/PyTorch to Elixir and the vast majority of machine learning examples are written in Python, we are often going to show the original Python from the Fast.ai notebook on top of the Elixir code. Hopefully this will help Elixir developers transform other PyTorch code into Elixir code

# Pytorch
# some python from a Jupyter notebook
# --> The result
#     from executing the cell goes here

some elixir code here

Brief Introduction to Elixir and Numerical Elixir

Elixir’s primary numerical datatypes and structures are not optimized for numerical programming. Nx is a library built to bridge that gap.

Elixir Nx is a numerical computing library to smoothly integrate to typed, multidimensional data implemented on other platforms (called tensors). This support extends to the compilers and libraries that support those tensors. Nx has three primary capabilities:

  • In Nx, tensors hold typed data in multiple, named dimensions.
  • Numerical definitions, known as defn, support custom code with tensor-aware operators and functions.
  • Automatic differentiation, also known as autograd or autodiff, supports common computational scenarios such as machine learning, simulations, curve fitting, and probabilistic models.

From https://hexdocs.pm/nx/intro-to-nx.htm. Note that this url is a really a livebook notebook. When you click on the Run in Livebook button, it navigates to an intermediate page where you can choose the location of your LiveBook application. It then opens the page in your LiveBook application.

Course Start: From the foundations

Jeremy’s introduction: This part of the course will require some serious tenacity and a certain amount of patience. We think you are going to learn a lot. A lot people have given Jeremy feedback that the previous iteration of this course is the Best Course they’ve ever done. This course will be dramatically better than any previous version. Hopefully you’ll find that the hard work and patience pays off.

Our goal in this course is to get to stable diffusion from the foundations. We have to define what are the foundations. Jeremy resticted the Python foundations to:

  • Python
  • The Python standard library
  • matplotlib
  • Jupyter notebooks and nbdev

In Elixir we’ll have our own foundation.

To be clear, we are allowed to use other libraries once we have reimplemented them correctly. If we reimplement something from NumPy or PyTorch, we are then allowed to use those libraries. Sometimes we are going to implement things that haven’t been created before. Those things will become part of our own library. We are going to be calling that library miniai. We are going to be building our own little framework as we go.

One challenge that we have, the models we use in Stable Diffusion were trained on millions of dollars of equipment for months. We don’t have the time or money for those compute resources. Another trick we are going to do is create identical but smaller versions of them. Once we have them working, we’ll be allowed to use the big pre-trained version.

So we are going to have to end up with our own variational auto-encoder, our own U-Net, our own CLIP encoder, and so forth.

To certain extent, Jeremy assumes that you’ve gone through part 1. If you find something that doesn’t make sense to you. Go back to the part 1 course or Google for what you don’t understand. For stuff that wasn’t covered in part 1, we’ll go over it thoroughly and carefully.

Reference: Jeremy’s discussion in the Lesson 10 video.

Elixir foundations

In our foundations version, we’ll make the following assumptions throughout these Elixir versions of Fast.ai’s notebooks:

The documentation for Nx and Axon are found at https://hexdocs.pm/nx/Nx.html and https://hexdocs.pm/axon/Axon.html

To run these notebooks, you will need to install a local version of Livebook or get access to a cloud server. Many of our foundation notebooks don’t need a GPU. Nx comes with an Elixir only BinaryBackend that runs on any CPU that supports Livebook. If EXLA or Torchx aren’t in the Mix.install at the top of a notebook, it can be run on any computer. Please give it a try.

We’ll follow roughly the same approach as the PyTorch version of the course. We’ll start with standard Elixir, with some additional libraries. Once we’ve implemented a capability, we’ll move on using Nx and Axon libraries. We’ll invent our own libraries as needed.

Getting the Data

We are going to need some input data. Fast.ai uses MNIST for this part of the course. Elixir has the SciData library that contains small standard datasets, including MNIST.

We are diverging from the cell by cell transformation of the 01_matmul.ipynb because SciData works differently from the .pth files used in Fast.ai

{train_images, train_labels} = Scidata.MNIST.download()
{test_images, test_labels} = Scidata.MNIST.download_test()
# Let's unpack the images like...
{train_images_binary, tensor_type, train_shape} = train_images
{test_images_binary, tensor_type, test_shape} = test_images

The Fast.ai source for MNIST training data returns normalized data with a shape of (50000, 784). 50,000 items that is 784 numbers long. The numbers are all between 0 and 1. We’ll need to change our binary into numbers and divide the numbers by 255 to normalize the values.

# Normalize the values first.
train_normalized_long_list =
  Binary.to_list(train_images_binary)
  |> Enum.map(fn value -> value / 255 end)

The data source Fast.ai used split the 60,000 image MNIST train data into 50,000 train images and 10,000 validation images. We’ll do a similar split after the first 50,000 images.

{train_list_784, valid_list_784} =
  Enum.chunk_every(train_normalized_long_list, 784)
  |> Enum.split(50_000)
Enum.count(train_list_784)
train_imgs_28_28 =
  Enum.map(
    train_list_784,
    fn img ->
      Enum.chunk_every(img, 28)
    end
  )

Let’s check that we still have 50000 images and that the count of rows in the first image is 28 and the count of columns in the first row of the first image is 28

{Enum.count(train_imgs_28_28), Enum.count(Enum.at(train_imgs_28_28, 0)),
 Enum.count(Enum.at(Enum.at(train_imgs_28_28, 0), 0))}

Visualizing Normalized Data

We have a normalized image in memory, how do we check that it really represents an image?

first_img_28_28 = Enum.at(train_imgs_28_28, 0)

We don’t know of a convenient method to convert a normalized list of lists into an image. However, if we convert to a tensor, we can load the tensor into StbImage. We are going to cheat and look ahead at some concepts described below, but we’ll be able to show the image.

first_img =
  first_img_28_28
  |> Enum.map(fn row ->
    Enum.map(row, fn column ->
      round(column * 255)
    end)
  end)
  |> Nx.tensor(type: :u8)
  |> Nx.reshape({28, 28, 1})
  |> StbImage.from_nx()
  |> StbImage.to_binary(:png)
# Python
# mpl.rcParams['image.cmap'] = 'gray'
# plt.imshow(list(chunks(lst1, 28)));

# Kino currently assumes the image is larger than the box
image = Kino.Image.new(first_img, :png)
label = Kino.Markdown.new("**MNIST Image**")

images = [
  Kino.Layout.grid([image, label], boxed: true)
]

Kino.Layout.grid(images, columns: 3)

Matrix and tensor

Let’s pull an individual value from a list of lists

# Find a row with some non-zero values

# 8th row
first_non_zero_in_row =
  Enum.at(first_img_28_28, 8)
  |> Enum.find_index(fn x -> x != 0.0 end)
# Let's find a value somewhere in that list of lists

Enum.at(first_img_28_28, 8)
|> Enum.at(10)

Convenience module to make it easier to access an element in list of lists

defmodule Matrix do
  def at(matrix, row, column) do
    Enum.at(matrix, row) |> Enum.at(column)
  end
end
Matrix.at(first_img_28_28, 8, 10)

Now that we’ve demonstrated how to load SciData into normal Elixir list of lists, access elements within the list of list, and shown the in-memory image data. Let’s start using Nx Tensors instead of lists of lists.

x_tensors =
  train_images_binary
  |> Nx.from_binary(tensor_type)
  |> Nx.reshape({60000, 28 * 28})
  |> Nx.divide(255)

Again, we’ll split the SciData training dataset into train and valid. We’ll use the names that are in the Fast.ai notebook

x_train = x_tensors[0..49_999]
x_valid = x_tensors[50_000..59_999]
{x_train.shape, x_valid.shape}

CAUTION: Even though it kind of looks like we called a function on a data object, all we really did was access the shape field of a struct. Just simple data field access. The human readable representation of the struct simplifies the representation to make it easier to see. Tensors can have a lot of data in their struct fields. Type is another field. See how the type and shape are scunched together in the print view.

x_valid.type
x_valid

Let’s load an Nx normalized tensor and visualize with Kino.Image

img_tensor =
  x_train[0]
  |> Nx.reshape({28, 28})

Let’s visualize the image like we did earlier, except this time the source is an Nx.Tensor

first_img_from_tensor =
  img_tensor
  |> Nx.reshape({28, 28, 1})
  |> Nx.multiply(255)
  |> Nx.round()
  |> Nx.as_type({:u, 8})
  |> StbImage.from_nx()
  |> StbImage.to_binary(:png)
# Python
# plt.imshow(imgs[0]);

image = Kino.Image.new(first_img_from_tensor, :png)
label = Kino.Markdown.new("**MNIST Image from tensor**")

images = [
  Kino.Layout.grid([image, label], boxed: true)
]

Kino.Layout.grid(images, columns: 3)

Let’s parse out the classification labels of each image. Each element identifies the digit each handwritten image represents.

{train_y_binary, y_tensor_type, y_shape} = train_labels
y_tensors =
  train_y_binary
  |> Nx.from_binary(y_tensor_type)
  |> Nx.reshape(y_shape)

We’ll split into train and valid like the Fast.ai data source.

y_train = y_tensors[0..49_999]
y_valid = y_tensors[50_000..59_999]
{y_train.shape, y_valid.shape}

We couldn’t find a min function in Nx that corresponds to the min, or max, function in Python that works on tensors. We’ll convert to a flat, normal Elixir list. We’ll use Enum to find the min or max and then convert back to Nx tensor scalars.

# PyTorch
# y_train.min(), y_train.max()
# --> (tensor(0), tensor(9))

{Nx.tensor(Enum.min(Nx.to_flat_list(y_train))), Nx.tensor(Enum.max(Nx.to_flat_list(y_train)))}

Random Numbers

For now, we are going to treat the random number section of the Fast.ai notebook as a problem specific to PyTorch. The problematic situation comes from using OS.fork() to parallelize some work that calls the rand() function. In Python, the fork create a copy of the current process. The particular problem is the fork includes the global rnd_state of the parent process. Each process that calls rand() will receive the same sequence of psuedo-random numbers.

The discussion on how psuedo-random numbers in the video is well worth watching

TODO: How does Elixir handle psuedo-random number sequences in two Elixir processes.

Tensor rank

The rank of a tensor is the number of indices required to uniquely select each element of the tensor. Rank is also known as “order”, “degree”, or “ndims.”

from https://www.tensorflow.org/api_docs/python/tf

In Livebook/Nx, the rank can be observed from the number of square bracket pairs behind the type label, i.e. s64.

# Rank 1 tensor
Nx.tensor([1, 2, 3])
# Rank 2 tensor
Nx.tensor([[1, 2], [2, 3]])
# Rank 3 tensor
Nx.tensor([[[1, 2], [2, 3]], [[4, 5], [5, 6]]])
# Rank 0 tensor
Nx.tensor(3)

Online Matrix multiplication

We are working on the start of a forward pass of a very simple linear model, a multi-layer perceptron, for MNIST. We now need to multiply tensors together.

There are several websites that can provide visual examples of matrix multiplication.

Matrix multiplication is a fundamental capability of deep learning. We are going to look at how to do matrix multiplication in standard Elixir and then use Nx to perform the multiplication as a tensor.

Mutable data approaches vs Immutable data

Many software languages have mutable data. Certainly Python has mutable data. Let’s go into details about how immutable data in Elixir is different than working with mutable data.

In Python, Jeremy uses this approach to multiply two tensors.

for i in range(ar):         # 5
    for j in range(bc):     # 10
        for k in range(ac): # 784
            t1[i,j] += m1[i,k] * m2[k,j]

Turning it into a function would look like:

def py_multiply(m1, m2, t1)
    ar,ac = m1.shape # n_rows * n_cols
    br,bc = m2.shape
    for i in range(ar):         # 5
        for j in range(bc):     # 10
            for k in range(ac): # 784
                t1[i,j] += m1[i,k] * m2[k,j]

t1 is the resulting matrix. In the Python notebook, it is set to zeros via

t1 = torch.zeros(ar, bc)

t1 is mutable. The value at t1[i,j] is replaced with a new value via += . When the function has completed, the variable, call it t_init, passed as the third argument has the new values.

We’ve mentioned before that Elixir has immutable data. When trying the same kind of approach in Elixir.

defmodule DoesntWork
  def add(m1, m2, t)
    t = m1, m2
  end
end

Let’s try it

defmodule DoesntWork do
  def add(m1, m2, t) do
    t = m1 + m2
  end
end
t = 5
DoesntWork.add(3, 6, t)
"t is #{t} not 9"
"because Elixir's data is immutable"

The above long winded point is that matrix operations in pure Elixir can’t use the for loop approach in mutable data languages.

Matrix multiplication in plain Elixir

We’ll diverge from Fast.ai’s notebook to dig into how we can do matrix multiplication without using for loops. We’ll also focus on Elixir list of lists rather than Nx.tensors like the Python notebooks.

Let’s look at a matrix multiplication in traditional Elixir. This example was modified from https://rosettacode.org/wiki/Matrix_multiplication#Elixir

defmodule Matrix do
  def mult(m1, m2) do
    Enum.map(m1, fn x ->
      Enum.map(transpose(m2), fn y ->
        Enum.zip(x, y)
        |> Enum.map(fn {x, y} -> x * y end)
        |> Enum.sum()
      end)
    end)
  end

  # transpose
  def transpose(m) do
    List.zip(m) |> Enum.map(&Tuple.to_list(&1))
  end
end
# Let's set up an example multiplication using an example 
# from http://matrixmultiplication.xyz/
m_3x3 = [[1, 2, 1], [0, 1, 0], [2, 3, 4]]

m_3x2 = [[2, 5], [6, 7], [1, 8]]
# Let's check that matrix multiplication works.  We should get
# [[15,27],
#  [ 6, 7],
#  [26,63]]

Matrix.mult(m_3x3, m_3x2)

Go to http://matrixmultiplication.xyz/ and put in your own matrix values to try it out.

Let’s dig into the Elixir code in our module

As Elixir developers, we understand how Enum.map works. But not everyone may have a good understanding. Let’s explore Enum.map. Enum.map/2 takes one argument of some Enumerable like a list. It also receives another argument which must be a function, generally an anonymous function. For each element in the Enumerable, it calls the function with the current element and appends the result to a list. It then returns the resulting list.

some_list = [1, 2, 3]
another_list = [4, 5, 6]

Enum.map(some_list, fn x ->
  # The inner map() can see the outer x
  Enum.map(another_list, fn y ->
    IO.puts("x is #{x} y is #{y}")
    "x=#{x},y=#{y}"
  end)
end)

Next we’ll dig into the transpose function.

transpose = fn m ->
  List.zip(m)
  |> IO.inspect(label: "one of the rows in zip")

  # |> Enum.map(&Tuple.to_list(&1))
end

# In matrix multiplication the another list needs to be vertical.
another_list = [[4, 5, 6]]

transpose.(another_list)
transpose = fn m ->
  List.zip(m)
  # We know that the first and only element in the list is 
  # [{4}, {5}, {6}]
  # Which means that three items are in the Enumerable passed to Enum.app
  #  the first is {4}
  #  the first is then transformed from a tuple, i.e. {something} into 
  #    a list of [something]
  |> Enum.map(&Tuple.to_list(&1))
end

# In matrix multiplication the another_list needs to be vertical.
another_list = [[4, 5, 6]]

transpose.(another_list)

There is a really funky set of code above. &Tuple.to_list(&1) THe & is an Elixir capture operator. Here is a blog post explaining the capture operator. https://dockyard.com/blog/2016/08/05/understand-capture-operator-in-elixir

Personally, we are more comfortable with the slightly more verbose form of creating an anonymouse function. Our minds grok this form easier. They both result in the same answer.

transpose = fn m ->
  List.zip(m)
  # We know that the first and only element in the list is 
  # [{4}, {5}, {6}]
  # Which means that three items are in the Enumerable passed to Enum.app
  #  the first is {4}
  #  the first is then transformed from a tuple, i.e. {something} into 
  #    a list of [something]
  |> Enum.map(fn x -> Tuple.to_list(x) end)
end

# In matrix multiplication the another_list needs to be vertical.
another_list = [[4, 5, 6]]

transpose.(another_list)
transpose.(m_3x2)

The next set of code takes the first matrix and the transform of the second matrix and zips them together. As before we see that we end up with list of tuples

Enum.map(m_3x3, fn x ->
  Enum.map(transpose.(m_3x2), fn y ->
    Enum.zip(x, y)
  end)
end)

The next step is to take the list of list of list of tuples and run through Enum.map with a multiply function and return a list.

Enum.map(m_3x3, fn x ->
  Enum.map(transpose.(m_3x2), fn y ->
    Enum.zip(x, y)
    |> Enum.map(fn {x, y} -> x * y end)
  end)
end)

Finally, we sum up the elements in the inner most list.

Enum.map(m_3x3, fn x ->
  Enum.map(transpose.(m_3x2), fn y ->
    Enum.zip(x, y)
    |> Enum.map(fn {x, y} -> x * y end)
    |> Enum.sum()
  end)
end)

And we get the same answer from calling the Matrix.mult function above.

Whew. I hope you could follow along and we didn’t lose you. We’ve now implemented matrix multiplication using standard Elixir. Thus, we can now use Nx.dot()

t_3x3 = Nx.tensor(m_3x3)
t_3x2 = Nx.tensor(m_3x2)
{t_3x3.shape, t_3x2.shape}

We’ve now implemented matrix multiplication using standard Elixir. Thus, we can now use Nx.mult(). Still the same answer, just now it is with tensors

Nx.dot(t_3x3, t_3x2)

Let’s measure how fast, really kind of slow, the BinaryBackend. Remember, we aren’t using the GPU in this notebook so don’t compare with the PyTorch when Jeremy is using a GPU.

Timing operations

In Elixir, the erlang timer tc function can be use to time function calls. Here is a link to a discussion on :timer.tc. https://til.hashrocket.com/posts/9jxsfxysey-timing-a-function-in-elixir So we can call the same function multiple times, we’ll create a named anonymous repeat function. We’ll has create function that calls represents are target function. We’ll hard code the same arguments in the target function.

repeat = fn timed_fn, times -> Enum.each(1..times, fn _x -> timed_fn.() end) end
matrix_mult_w_dot_fn = fn -> Nx.dot(t_3x3, t_3x2) end
repeat_times = 50
{elapsed_time_micro, _} = :timer.tc(repeat, [matrix_mult_w_dot_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

"avg time in milliseconds #{avg_elapsed_time_ms} total_time #{elapsed_time_micro / 1000} milliseconds"

Not to bad performance but the tensors are small.

Matrix multiplication

Let’s create some tensor random weights with a mean of about 0.0 and variance of about 1.0

# PyTorch
# weights = torch.randn(784,10)
# bias = torch.zeros(10)
# weights, weights.max(), weights.mean(), weights.var()

mean = 0.0
variance = 1.0
weights = Nx.random_normal({784, 10}, mean, variance, type: {:f, 32})

# In elixir, Nx doesn't have the ability to create a Tensor of 0s or 1s.  We have to use 
# Axon's initializers
init_zeros = Axon.Initializers.zeros()
bias = init_zeros.({10}, {:f, 32})
{bias, weights}
{Nx.mean(weights), Nx.variance(weights)}

Let’s take the first 5 rows, m1, of the training dataset, 5x784, images x pixels. For every one of the 784 pixels in each row of the tensor, we need a weight multiplication factor. The weights map to each one of the 10 potential digits in the y_valid data, 784x10. The first column of weights will identify all of the weights to figure out whether the pixels represent a 0. The second column will determine the weights to tell us the probability the pixels represent a 1, etc. up to 9.

# PyTorch
# x_valid[:5]

m1 = x_valid[0..4]
m2 = weights
{m1.shape, m2.shape}
# PyTorch
# ar,ac = m1.shape # n_rows * n_cols
# br,bc = m2.shape
# (ar,ac),(br,bc)

{ar, ac} = m1.shape
{br, bc} = m2.shape
{{ar, ac}, {br, bc}}
# PyTorch
# t1 = torch.zeros(ar, bc)
# t1.shape

t1 = init_zeros.({ac, bc}, {:f, 32})
t1.shape

When we multiply matrices together, we take row 1 of the first matrix. We take column 1 of the second matrix. We multiply the row 1 elements and column 2 elements in turn. r1[1] times c1[1], r1[2] times c2[2]…. and we sum them together. The sum would give the value for the very first cell in the resulting 5x10 matrix

Let’s compare the time to multiply two standard Elixir matrices with the time to multiply using Nx tensors with the BinaryBackend.

Nx.dot(m1, m2)

Let’s time our Nx matrix multiplication.

dot_m1_m2_fn = fn -> Nx.dot(m1, m2) end

repeat_times = 50
{elapsed_time_micro, _} = :timer.tc(repeat, [dot_m1_m2_fn, repeat_times])
avg_elapsed_time_ms = elapsed_time_micro / 1000 / repeat_times

"avg time in milliseconds #{avg_elapsed_time_ms} total_time #{elapsed_time_micro / 1000} milliseconds"

Let’s return to closely following the Fast.ai notebook.

Elementwise ops

The point of this section is to perform a function on each element of the Python tensor. The Elixir implementation would use the non tensor data loaded above.

# PyTorch
# a = tensor([10., 6, -4])
# b = tensor([2., 8, 7])
# a,b
# --> (tensor([10.,  6., -4.]), tensor([2., 8., 7.]))

a = Nx.tensor([10.0, 6, -4])
b = Nx.tensor([2.0, 8, 7])
{a, b}
# PyTorch
# a + b
# --> tensor([12., 14.,  3.])

Nx.add(a, b)
# PyTorch
# (a < b).float().mean()
# --> tensor(0.67)

Nx.less(a, b)
|> Nx.as_type({:f, 32})
|> Nx.mean()
# PyTorch
# m = tensor([[1., 2, 3], [4,5,6], [7,8,9]]); m
# --> 
# tensor([[1., 2., 3.],
#         [4., 5., 6.],
#         [7., 8., 9.]])

# In Livebook, we don't need to specify what to show results
# on, if the item of interest is the last calculation.
# So we don't need the ;m at the end
m = Nx.tensor([[1.0, 2, 3], [4, 5, 6], [7, 8, 9]])

Frobenius norm:

We’ll use the Frobenius norm from time to time as we do generative modeling

It’s the sum over all of the rows and columns of the matrix. We are going to take each one and square it. We are going to add them up and take the square root

$$\| A \|F = \left( \sum{i,j=1}^n | a_{ij} |^2 \right)^{1/2}$$

Hint: you don’t normally need to write equations in LaTeX (really KaTeX) yourself, instead, you can click ‘edit’ in Wikipedia and copy the LaTeX from there (which is what Jeremy did for the above equation). Or on arxiv.org, click “Download: Other formats” in the top right, then “Download source”; rename the downloaded file to end in .tgz if it doesn’t already, and you should find the source there, including the equations to copy and paste. This is the source LaTeX that Jeremy pasted to render the equation above:

$$\| A \|_F = \left( \sum_{i,j=1}^n | a_{ij} |^2 \right)^{1/2}$$

In my case, I went to the Fast.ai notebook code, .ipynb file, to copy the KaTeX from Jeremy’s code

To implement Frobenius norm in Elixir, it is m times m, sum them up and square root.

# PyTorch
# (m*m).sum().sqrt()
# --> tensor(16.88)

Nx.multiply(m, m)
|> Nx.sum()
|> Nx.sqrt()

This looked like a complicated math function when you initially looked at it. A whole bunch of squiggly things. But when you look at the code, it’s just multiply itself, sum and then square root.

A lot of machine learning papers have complicated looking math notations for simple or relatively simple functions in code.

Broadcasting

The term broadcasting describes how arrays with different shapes are treated during arithmetic operations. The term broadcasting was first used by Numpy.

From the Numpy Documentation:

The term broadcasting describes how numpy treats arrays with 
different shapes during arithmetic operations. Subject to certain 
constraints, the smaller array is “broadcast” across the larger 
array so that they have compatible shapes. Broadcasting provides a 
means of vectorizing array operations so that looping occurs in C
instead of Python. It does this without making needless copies of 
data and usually leads to efficient algorithm implementations.

In addition to the efficiency of broadcasting, it allows developers to write less code, which typically leads to fewer errors.

This section was adapted from Chapter 4 of the fast.ai Computational Linear Algebra course.

In turn, it was copied from the Fast.ai 01_matmul.ipynb code

# PyTorch
# a
# --> tensor([10.,  6., -4.])

a
# PyTorch
# a > 0
# -> tensor([ True,  True, False])
Nx.greater(a, 0)

How are we able to do a > 0? 0 is being broadcast to have the same dimensions as a.

For instance you can normalize our dataset by subtracting the mean (a scalar) from the entire data set (a matrix) and dividing by the standard deviation (another scalar), using broadcasting.

Other examples of broadcasting with a scalar:

# PyTorch
# a + 1
# --> tensor([11.,  7., -3.])

Nx.add(a, 1)
# The scalar can be in either position
Nx.add(1, a)
m = Nx.tensor([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]])
# PyTorch
# 2*m
# --> 
# tensor([[ 2.,  4.,  6.],
#         [ 8., 10., 12.],
#         [14., 16., 18.]])
Nx.multiply(m, 2)
Nx.multiply(2, m)

Broadcasting a vector to a matrix

Although broadcasting a scalar is an idea that dates back to APL, the more powerful idea of broadcasting across higher rank tensors comes from a little known language called Yorick.

We can also broadcast a vector to a matrix:

# PyTorch
# c = tensor([10.,20,30]); c
# --> tensor([10., 20., 30.])

# the vector
c = Nx.tensor([10.0, 20.0, 30.0])
# the matrix
m
# PyTorch
# m.shape,c.shape
# --> (torch.Size([3, 3]), torch.Size([3]))

{m.shape, c.shape}
# PyTorch
# m + c
# --> 
# tensor([[11., 22., 33.],
#         [14., 25., 36.],
#         [17., 28., 39.]])

# The vector is broadcast across the matrix shape and added
Nx.add(c, m)
# reverse the order and still the same answer
Nx.add(m, c)

Here is the trick that allows the matrix and vector to be added. The expand_as method expands the vector to be the same shape as m.

We don’t really copy the rows, but it looks as if we did. In fact, the rows are given a stride of 0.

Elixir: We aren’t sure whether the Nx.broadcast actually copies the rows or looks like it does ala PyTorch

# PyTorch
# t = c.expand_as(m); t
# --> 
# tensor([[10., 20., 30.],
#         [10., 20., 30.],
#         [10., 20., 30.]])
# 
# 

t = Nx.broadcast(c, m.shape)

I don’t believe the following tensor code has an Nx equivalent

# PyTorch
# t.storage()  # Not sure there is an Nx equivalent
# t.stride(), # Not sure there is an Nx equivalent

In PyTorch, tou can index with the special value [None] or use unsqueeze() to convert a 1-dimensional array into a 2-dimensional array (although one of those dimensions has value 1).

The Nx equivalent is the Nx.reshape.

# PyTorch
# c
# -->
# tensor([10., 20., 30.])
c

Both unsqueeze and c[something, something_else] map to Nx.reshape. We’ll just show the Nx.reshape once.

This is how we create a matrix with one row

# PyTorch
# c.unsqueeze(0), c[None, :]
# --> (tensor([[10., 20., 30.]]), tensor([[10., 20., 30.]]))

Nx.reshape(c, {1, :auto})
# PyTorch
# c.shape, c.unsqueeze(0).shape
# --> (torch.Size([3]), torch.Size([1, 3]))

{c.shape, Nx.reshape(c, {1, :auto}).shape}
# c.unsqueeze(1), c[:, None]
# --> (tensor([[10.],
#          [20.],
#          [30.]]),
#  tensor([[10.],
#          [20.],
#          [30.]]))

# This is how we create a matrix with one column.
Nx.reshape(c, {:auto, 1})
# PyTorch
# c.shape, c.unsqueeze(1).shape
# --> (torch.Size([3]), torch.Size([3, 1]))

{c.shape, Nx.reshape(c, {:auto, 1}).shape}

In PyTorch, they can skip trailling ‘:’s. And ‘…’ means ‘all preceding dimensions’

# PyTorch
# c[None].shape,c[...,None].shape
# --> (torch.Size([1, 3]), torch.Size([3, 1]))

{Nx.reshape(c, {1, :auto}).shape, Nx.reshape(c, {:auto, 1}).shape}

Below, we are taking the vector, transforming into a matrix with one column then we broadcast the result into a matrix of m shape.

# PyTorch
# c[:,None].expand_as(m)
# --> tensor([[10., 10., 10.],
# [20., 20., 20.],
# [30., 30., 30.]])

Nx.reshape(c, {:auto, 1})
|> Nx.broadcast(m.shape)

As a reminder, in this case we are adding the vector to each row.

# PyTorch
# m + c
# -->
# tensor([[11., 22., 33.],
#         [14., 25., 36.],
#         [17., 28., 39.]])

Nx.add(m, c)

Here we are transforming the vector c into a matrix with one column and then broadcasting into the shape of m. Then we add the two matrices together.

# PyTorch
# m + c[:,None]
# --> tensor([[11., 12., 13.],
# [24., 25., 26.],
# [37., 38., 39.]])

Nx.add(m, Nx.reshape(c, {:auto, 1}) |> Nx.broadcast(m.shape))

Here we are transforming the vector c into a matrix with one row and then broadcasting into the shape of m. Then we add the two matrices together.

# PyTorch
# m + c[None,:]
# tensor([[11., 22., 33.],
#         [14., 25., 36.],
#         [17., 28., 39.]])

# Nx.add(m, Nx.reshape(c, {1, :auto}) |> Nx.broadcast(m.shape) )
Nx.add(m, Nx.reshape(c, {1, :auto}) |> Nx.broadcast(m.shape))

Broadcasting Rules

# PyTorch
# c[None,:]
# --> tensor([[10., 20., 30.]])

Nx.reshape(c, {1, :auto})
# PyTorch
# c[None,:].shape
# --> torch.Size([1, 3])

Nx.reshape(c, {1, :auto}).shape
# PyTorch
# c[:,None]
# --> tensor([[10.],
#         [20.],
#         [30.]])

Nx.reshape(c, {:auto, 1})
# PyTorch
# c[:,None].shape
# --> torch.Size([3, 1])

Nx.reshape(c, {:auto, 1}).shape

Here we are taking a vector, c, and reshaping into a matrix of one column. Then we take a vector, c, and reshaping into a matrix of one row. Then the multiply function with expand the one column into a 3 columns with the same values. The same thing happens for the matrix of one row. It expands into 3 rows.

We end up with 3 rows of 10,20,30 and 3 columns of

10,

20,

30

When we multiply them together and we get this answer. This is an outer product without any special function. Just broadcasting. Not just products, we can do outer boolean operations, etc.

# PyTorch
# c[None,:] * c[:,None]
# --> tensor([[100., 200., 300.],
#         [200., 400., 600.],
#         [300., 600., 900.]])

Nx.multiply(Nx.reshape(c, {1, :auto}), Nx.reshape(c, {:auto, 1}))

Here is the examples of the outer boolean.

# PyTorch
# c[None] > c[:,None]
# --> tensor([[False,  True,  True],
#         [False, False,  True],
#         [False, False, False]])

Nx.greater(Nx.reshape(c, {:auto}), Nx.reshape(c, {:auto, 1}))

When operating on two arrays/tensors, Numpy/PyTorch compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

  • they are equal, or
  • one of them is 1, in which case that dimension is broadcasted to make it the same size

Arrays do not need to have the same number of dimensions. For example, if you have a 256*256*3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:

Image  (3d array): 256 x 256 x 3
Scale  (1d array):             3
Result (3d array): 256 x 256 x 3

The numpy documentation includes several examples of what dimensions can and can not be broadcast together.

Matmul using Nx

As a reminder, we defined these tensors further back in the notebook.

x_valid
weights
tr = Nx.dot(x_valid, weights)

Using the default BinaryBackend, the above dot() function returns in about 28 seconds on our Linux computer. Not nearly as quick as the broadcast example in the Fast.ai course.

Let’s explore how this same matrix multiplication works for different backends next. To keep things simple and focused, we’ll stop this notebook here and create different notebooks to focus on Nx on XLA using the CPU and XLA using the GPU.

TODOs:

  • I haven’t explored the difference between Nx.dot and Nx.multiply. In light of the what we learned so far, when would multiply be more appropriate?

  • Need to explore swapping out backends.

  • Demonstrate EXLA CPU backend, speed improvements vs BinaryBackend. Demonstrate EXLA GPU backend. Would like to demonstrate how to switch from CPU to GPU and back to CPU in the same notebook.

  • Would like to demonstrate TorchScript backend. TorchX hasn’t had as much focus as the XLA backend. The dynamic UNet from Fast.ai probably won’t work well in XLA. If so, then TorchX might prove important.

fastai, livebook, axon, foundations, matrix_multiplication, deep_learning

Sharing Axon Models on Huggingface

Finding a place to share our models

The Axon, https://github.com/elixir-nx/axon, community was looking for techniques to share models. What if we used Huggingface’s Hub as a model repository? I knew that the Hub is designed to be model framework agnostic. How agnostic are they really?

Huggingface Hub is kind of like GitHub, but focused on machine learning concerns like models and shared datasets. Their Hub model repository is built around the concept of a model card. The card contains documentation about the model, its intended uses, limitations, along with other descriptive information. The objective of the card is to help Hub users identify whether the model will fit their needs. The Hub might work for us.

Now we need to find a model to publish. There are a wealth of already developed models in a variety of frameworks. The axon_onnx package supports the ability to read in ONNX models and convert them to Axon models. Axon_onnx is a young library and doesn’t handle all ONNX capabilities. It does provide support for ResNet models.

Where can we find reusable ONNX ResNet models that we can convert to Axon? How about the ONNX model repository on Github, https://github.com/onnx/models. I focused on the ImageNet trained classification models. ResNet ImageNet models are often used as the backbone for transfer training a model in a new business domain. Having access to Axon ResNet models can help the community grow to supporting useful ML models.

We have a potential model repository and an ONNX model. I wrote some prototype code to convert onnx models into Axon models. There were some challenges. I tried a ResNet34-v2 model first and found some shape related errors. The small to medium ResNet-v1 model worked well though. At the time, I couldn’t get a ResNet101 model to convert to Axon. So, we have 3 models that we can try publishing on Huggingface.

  1. ResNet18-v1
  2. ResNet34-v1
  3. ResNet50-v1

I worked with my contact at Huggingface to make sure that they were ok with the Axon community using their Hub as an Axon model repository. There are some expectations they have. Each page has only one model. So the three models I’ve converted each need their own model page. The model card should be robust and useful. Beyond those expectations, the Axon community is welcome to use Huggingface Hub as a model repository. THANKS Huggingface!

Now let’s upload our first model. ResNet34 is my favorite. It can be pretty accurate while small enough to be relatively efficient running in a fairly intense production environment. The instructions at https://huggingface.co/docs/hub/adding-a-model worked pretty well. If, by some chance, the upload connection times out, be sure to check your email for notification of upload complete.

Huggingface has some expectations for model sharing. As their documentation states, the model card “is arguably as important as the model”. We want to help Hub users find our Axon models. The Huggingface documentation for a model card is found at: Building a Model Card. For the ResNet models, I copied the model information from the ONNX Classifcation ResNet page. Be sure to add the meta information to the README.md

---
license: apache-2.0
tags:
- Axon
- Elixir
datasets:
- ImageNet
---

We’ve updated our model card. All is good. How do we share with other Axon users? There is kind of a trick to finding Axon models. Axon isn’t big enough for Huggingface to add it to the list of libraries. So there are no buttons we can push on Huggingface’s Hub. The web page search capability finds by the model owner name or model name. We have to find Axon models by their tags. It’s possible, but cryptic, to find Axon models by tags on the Huggingface website.

https://huggingface.co/models?other=Axon
or
https://huggingface.co/models?other=Elixir

Stefano Falsetto to the Rescue!!. When I first mentioned the potential of Huggingface Hub on the Erlang Ecosystem Foundation’s slack channel. There were some concerns about our Axon users needing python in order to interact with the Hub in a command line manner. Stefano took on the challenge. He quickly worked up a prototype conversion of https://github.com/huggingface/huggingface_hub.
He even got the attention of some folks at Huggingface’s twitter account. I know my contact called me about his library. The initial version was just a shell. However, about the time that I was preparing this blog post, he’s made sufficient progress with his library, https://github.com/stefkohub/Elixir_Huggingface_hub.
The big news is that you can search for Axon models on Huggingface directly.

# List all Axon models.
iex> HuggingfaceHub.list_axon_models()

There is also a way to download a model using his library. Check out his library and consider contributing and improving Elixir’s Huggingface Hub interface.

I’ve created an Axon Organization on Huggingface. My expectation for this organization is that we can store Axon models for the more famous modeling approaches in the industry. Essentially, someday the vast majority of the ONNX model repository could be mirrored in the Axon Huggingface organization. My hope is that every model in the Axon organization has an excellent model card.

If you want to contribute to axon_onnx and the Huggingface Axon organization, try seeing which models can be easily converted to Axon. When you are successful, request to be added to the Axon organization. The admins will hopefully ask about what model you want to contribute and discuss model card expectations. Once granted access you can add your model to the organization. When you find problematic conversions, consider finding the solution and contribute the capability to axon_onnx.

Not every model belongs in the Axon organization. For example, I have plans for a pet breeds model based upon the Oxford pets dataset. Such a model wouldn’t be appropriate for the Axon organization. I’ll upload it to my Huggingface account. You can always add your model to your account.

Summary: We have Huggingface’s permission to store Axon models in their public repository.
We have Stefano’s HuggingfaceHub Elixir library to help access and interact with the Hub. We know a good source of ONNX libraries and a path to convert the libraries to Axon. What will you share?

axon, huggingface, imagenet

Introducing Along the Axon

An Exploration of the Elixir Machine Learning Framework - Axon

Welcome to Along the Axon. My name is Scott Mueller. In this blog, I will be writing about using Axon for Machine Learning. I’ll be exploring the framework from an applied point of view. How can we use Axon to meet business needs?

My initial goals are to discuss these areas of the framework

  1. Realistic use of the framework
  2. Code focused on applying Axon from a business point of view
  3. Different kinds of business problems that can be solved with Axon
  4. Where Axon compares to other frameworks
  5. Can we team with other frameworks to build realistic business capabilities?
  6. What are current weaknesses and how we can improve Axon?
  7. Incoporating best practices of other ML training frameworks
  8. Incrementally add capabilities that mimic the best parts of other ML frameworks
  9. What can we do to improve the training capability of Axon?
  10. A code focus rather than math focus
  11. I’ll leave other sources to focus on the math and Axon

My point of view comes from trying to use Elixir and Machine Learning in an embedded system.
Admittedly, a powerful computer system, but inference on the edge rather than cloud focused.

blog