Introduction
Deep learning systems have produced incredible results in recent years. However, due to fracturing ecosystem and the performance-demanding nature of the domain, the problem of infrastructure poses substantial ongoing engineering challenge for developers looking to deploy this technology on their platform of choice – phones, laptops, embedded devices – all are capable of running models of differing size. We will examine some of the most prevalent technologies to the problem of machine learning infrastructure. At it’s core, we are talking about differentiable algorithms operating on a ‘Tensor’ data structure. It follows, given the nature of such tensor computations, these algorithms are “embarrassingly parallel” and are best executed on many cores, typically with a GPU. First, we examine a paper surveying different directions in hardware accelerators for deep learning, in order to motivate the next discussion of an article on some of the current deep learning compilers. Then, we cover another article which surveys the ecosystem for deep learning on mobile and embedded platforms, specifically, optimizations meant to reduce the size, latency, and generally to improve performance for resource-constrained platforms. Finally, we look at some different new frameworks in order to better understand how infrastructure contributes to the deployment of deep learning to different platforms, both high-performance and resource-constrained.
A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms
GPU
The GPU (graphics processing unit) architecture stands alone the most well-known and popular choice for massively parallel computing in the current day, and it is not difficult to see why. The GPU is a logical extension of multi-core processor designs, and has had nearly 30 years of widespread development in its modern form. A meaningful distinction between cores in GPUs and CPUs being the lack of branching support, making GPU programming generally not Turing-complete. Generally speaking, the hardware implementation and programming model for GPUs is the same regardless of vendor or product equipment, ignoring the (often prorietary) driver stack and supported programming. The GPU consists of a number of ‘streaming multiprocessors’, which dispatch parallel computations as atomic vector operations on ‘warps’ (CUDA nomenclature). There is also the concept of ‘local’ and ‘shared’ memory pools. Local memory is specific to a streaming multiprocessor, and shared memory is shared between a locality of multiprocessors. A single GPU program is called a ‘kernel’ and is executed as a single write-back to main memory.
TPU
TPUs (tensor processing units), developed internally at Google, are the next most popular deep learning solution for the datacenter, leveraging ‘tiling’ also known as ‘systolic arrays’, a matrix multiplication efficiency to better accommodate deep learning applications. Systolic arrays are now widely used in specialized deep learning hardware such as NVidia "tensor cores" now found in their GPUs SMs.
FPGA
FGPAs (field-programmable gated arrays), lend themselves to high efficiency and low power consumption in deep learning due to its reconfigurable hardware design, which allows the silicon to closely mirror the model being deployed. An FPGA can be reconfigured by the programmer to take on the designs of any of the others described here, at the cost of space efficiency on the chip. The FPGA can be used to test and debug many different designs without the prohibitive cost of taping new silicon, so it is seeing more widespread use in data centers in recent years.
ASIC
ASIC means ‘Application-Specific Integrated Circuit’; basically, it is an extension of an existing design that allows offloading of some part of the application specific compute task, such as multiplication, convolution, or matrix multiplication to the special module. Often in the form of so-called "chiplets", these devices can be incorporated into an existing design as a module on the same chip of silicon. The NPU (neural processing unit), for example, is designed to efficiently compute many fused-multiply-accumulate operations, which are the main processing elements in most deep learning models (more on this later); we have seen such integrated circuits get incorporated into several new mobile computing products, such as phones and laptops. These designs may include specialized memory stores or and often DMA (direct memory access). The author also touches on approaches which use a DAC (digital analog converter) to compute the multiply-accumulate operations over an analog signal (did Mythic run out of money yet?). Trade-offs in design are often made on the planes of memory organization and specialized ISA (instruction set architecture) as it relates to Flynn’s taxonomy. Some of these designs make the choice to adopt a VLIW (very long instruction word) architecture rather than the traditional superscaler architecture for describing parallelizing instructions. Such design decisions have significant effect on the compiler backend.
RISC-V
The RISC-V ISA (instruction set architecture) is an open architecture which aims to be more minimal, understandable, and more extensible than existing architectures. Several extensions to the core RISC-V instruction set exist which support Instruction-level parallelism for scalar, vector, and matrix operations. This allows researchers to easily implement new accelerators as instruction set extensions, rather than implementing a whole new bespoke core for their accelerator. For example, the NPUs discussed above can be implemented as a RISC-V extension. Quantization, discussed in more detail in a later section, can be implemented on-device as an extension. This approach is particularly useful for resource-constrained low-power and embedded devices, a space in which RISC-V is already becoming dominant.
Emerging technologies
The author also discusses accelerators for sparse matrices, a type of compressed matrix which only expresses non-zero terms; such an accelerator requiring the data to be sparsely packed before processing. Another emerging technology discussed here is ‘in-memory computing’, which embeds processing elements in-memory, which is a very promising approach potentially. Another one discussed, which has gotten a bit of buzz in the industry, is the ‘neuromorphic’ chip, which uses con-integrated memory and processing elements, aiming to replicate the bio-inspired design of ‘spiking neural networks’. Several other classes of designs were discussed in this paper, but I am leaving them out for brevity’s sake and to avoid speculating on designs which have not yet been taped. In each of these designs, one the main bottleneck is off-chip memory access. For this reason, it can be advantageous to have the accelerator on the same chip as the main memory and main data path. It is also for this reason that motivates nearly all of the design decisions discussed above. Given optimal code generated for these accelerators, We should be able to perform all computation necessary in a minimal number of accesses to main memory.
The Deep Learning Compiler: A Comprehensive Survey
Deep learning models are designed using a number of higher-level frameworks, predominantly in convenient scripting language like Python or Lua, and implemented in a lower-level language like C/C++. The most popular of these frameworks being PyTorch, Tensorflow, and MXNet, among others. In light of the discussion of the above paper, we have seen that this target hardware can be quite diverse, and even when limited to what is typically available, the CPU and GPU. Thus, the goal of these technologies in the following discussion is to provide interoperability between the researcher’s implementation of a model in their framework of choice to deployment on diverse hardware. The compilers discussed in the following section are generally designed with first-class support for GPU (or the hardware of the backing organization) but the long-term goal is ideally the potential for compatibility, and there is mutual incentive for organizations to reduce engineering costs by supporting open source. To quote the paper “The DL compilers take the model definitions described in the DL frameworks as inputs, and generate efficient code implementations on various DL hardware as outputs”.
Common Design Architecture of Deep Learning compilers
A Compiler is a translation program where code text is used to specify machine programs. In modern compilers, there is intermediate representations (IRs) in between. The IR allows for program optimizations to be approached systematically. This concept was introduced by the compiler design of LLVM (low-level virtual machine), where the original source code is translated into LLVM IR, and then passed on to hardware-specific language-agnostic implementations of LLVM. The newer generation of deep learning compilers takes advantage of this technology and extends the concept to ‘multi-level IR’. Typically what this means is that there is a high-level IR, which is responsible for representing the computational graph, and a low-level IR, which is responsible for representing the computations as optimized parallel routines, analogous to GPU kernels. The front-end in this scenario must take a model from one of the frameworks described above and transform it into the a graph of computations described by the programmer. The back-end must also differ in a number of ways from a typical LLVM back-end, because, in a heterogeneous compute scenario, it must take IR fine-grained enough to produce optimized instruction-level binaries for its different components, like the CPU, GPU, etc., yet be flexible enough to be somewhat platform/vendor independent. The Translation from high-level IR to low-level IR is also considered part of the back-end of the compiler. There are different optimizations taken at each stage of the compiler, which the paper discusses in detail.
Key components
High-Level IR
Previously we discussed high-level IR as a graph representation where nodes are computations and edges are tensors. The graph is acyclic without loops, which differs from the dependency graph of the AST (abstract syntax tree) in a typical compiler. However, it does not capture the concept of ‘scope’ in the way the AST does. To address this drawback, many introduce ‘let’-bindings where certain functions may contain a ‘let’ expression which creates a node in the graph which points to the operator and variable, which can be accessed (read and written) at different points in the dependency graph. Nearly all deep learning compilers implement their own high-level IR. The representations must be flexible and extensible enough to support diverse deep learning models. Choice made in graph representation are quite important; whether or not to support control flow (non-deterministic branching) is a design decision with serious implications (recall that if branching is required, CPU is required) , as is dynamic shaping and symbolic programming. In order to support training, high-level IRs must be able to support a derivative operator which differentiates other operators (either automatically or pre-made).
Low-Level IR
The paper breaks the discussion of low-level IRs into three categories: Halide-based IR, polyhedral-based IR, and others. Halide was built originally to parallelize image processing, and it separates the concepts of computation and schedule, trying multiple different schedules to find the one which makes the least memory accesses. Polyhedral-based IR uses mathematical methods to optimize loop-based codes. Where the bounds of memory in Halide are purely rectangular, polyhedral IR can represent any shapes in the polyhedral model. This makes it easier to apply some optimizations, and harder to apply others. Other than these two types, there is also Glow’s (a Facebook DL compiler project) low-level IR, and MLIR. The goal of MLIR is to provide an extensible framework for creating IRs to enable interoperability between different existing IR platforms and low-level APIs. Existing platforms can be implemented as MLIR “dialects”. It also has the capability to transform between dialects. This is an exciting development in the compilers space, and backs many emerging technologies such as the Mojo programming language, a project led by the creator of LLVM and MLIR. (Note that Modular AI will probably be closed source) This discussion of low-level IR leads into the concept of JIT (just in time) versus traditional AOT (ahead of tim) compilation. JIT means that the executables are generated on-the fly at runtime, which can be useful and convenient, but may have access to less optimizations due to real-time requirements. JIT is necessary when using an interpreter like Python.
Front-end optimizations
This may be the most important section in this review in terms of performance. There are many front-end optimizations which can be carried out at the graph stage. The most important and generic form being ‘operator fusion’. We discussed in the previous paper how accesses to main memory are the primary bottleneck of massively parallel systems. Maximally fusing operators allows for the single pipeline or kernel on the GPU to contain the most amount of computation before writing back to main memory, and thus will greatly improve performance if it can be done effectively. Theoretically, with well designed and flexible IRs, the entire model can be made to fit into a single GPU kernel. This is done by passing over the computational graph several time and doing things such as removing nodes, fusing nodes based on known simplifications. These simplifications include simple algebraic simplifications, and rearranging to allow optimized loops. For example, if there is the following sequence of nodes ‘Multiply - Multiply - Multiply - Add - Add’ then that can be simplified in to a single ‘Multiply-Add’ operation on an input tensor.
Back-end optimizations
Based on the actual underlying hardware, certain constructs already have optimized implementations in-hardware, for example, maybe the hardware engineers have built a very fast 8x8 matrix multiply. Such hardware intrinsics are the basis of the PyTorch + NVidia CuDNN integration which dominates the space currently. NVidia hardware engineers provide (at fairly high level) implementations of common deep learning constructs, and they are directly targeted by the PyTorch’s codegen. All the places this type of operations occurs in the low-level IR, it should be replaced. Operations that do not share a dependent relation on each other can be executed in parallel (as in, different CPU threads). Other optimizations include the loop fusions, loop unrolls, memory optimizations, register allocations, strength reductions, and many others. In most cases, there is less to be gained here in theory than there is with front-end optimizations.
Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications
In the interest of deploying deep learning models to the wide array of diverse mobile platforms and embedded devices found on the market, there are several techniques which have been found to increase the performance. Compared to the optimizing compilers covered in the previous work, the methods covered here are at the level of the model itself, generally dealing with the representation of data. We will consider the interplay and how compilers will affect the deployment ease of use for these techniques. Where the compilers can be considered a form of ‘lossless’ compression, we look now at forms of ‘lossy’ compression that can be used to further shrink and speed up deep learning.
Model Compression
Pruning
We know that neural networks are represented by a set of ‘layers’ – dense linear connections between vectors, represented by tensors. Pruning is method where certain connections between layers which can be dropped because they do not contribute a lot to the result. Essentially, numbers close enough to zero in the weights matrices are converted to zeros, and the matrix can then be represented sparsely, in order to make the model smaller and faster. The problem with this method is that the pruned matrices are not uniform, making efficiently representing the computations difficult for the programmer. The paper expands more on some interesting methods of doing this well automatically.
Low-Rank Factorization
Low-Rank factorization is a method that has gained popularity for quickly adapting models to new specialization and training sets. This method can also be used to reduce model size. By taking the low rank factorization of each of the weights matrices, a significant bulk of the information in the matrix can be represented in a low rank matrix. This method is highly lossy.
Quantization
Quantization is the name for a techniques wherein the numerical representation of the model is converted to a smaller representation. For example, traditionally most models are trained in 32-bit floating point, but it is possible to each number in the model to 16-bit or even 8-bit representation, at the cost of accuracy. This is a simple concept with a lot of complexity, and there are many interesting works covered which seek to reduce the information loss associated with quantization. There has been some very interesting work on developing new numerical representations such
Selection of Software projects
OpenXLA
OpenXLA is the deep learning compiler infrastructure developed within the Google ecosystem. XLA has a long history of co-development with the TPU, and has recently come into the open source ecosystem. It defines over 2000 operations and is integrated with the MLIR ecosystem. It has it’s own high-level IR called StableHLO, with robust graph-fusions, and support for training and inference, and non-deterministic branching behavior. In order to accommodate these impressive features, it is quite a large package, and has more overhead, but produces a highly optimized executable.
Tensorflow Lite Micro
Tensorflow Lite for Micro-controllers is an interesting project out of the Tensorflow ecosystem which originally aimed to target very resource-constrained platforms like micro-controllers where there is potentially no operating system. It is not a compiler like XLA, but an interpreter, an interesting choice due to the performance losses. In order accommodate these target platforms, the interpreter makes some interesting design decisions. It uses a reduced operation set, and optimizes the TFLite graph-based IR into its own operators. It has a optimizing memory planner, and supports multithreading and multi-tenancy.
GGML
GGML is a new project, which has gained popularity in the open source community for its speedy and efficient implementations of the newest models. GGML is not a compiler, it is simply a tensor implementation which is meant to be more accessible and hackable than the complexity of existing deployment technologies. It is very minimal, meaning that developers have an easier time of going in and hand-optimizing their model for inference on a certain platform. However, in the long run, hand-tuning implementations in this way may be questionable. Nonetheless, it is interesting to see the technology gain support, and enable fast inference times on mobile platforms.
Discussion
What we have observed in the past several years and the past decade is a massive proliferation of deep learning technologies. However, this exciting technology has been largely under the exclusive control of a single hardware equipment manufacturer, and the software companies that this monopolizing force chooses to work with. The AlexNet results from 2012 make evident that this deep learning revolution, and revolution in computing generally, is really a hardware revolution – a hardware utilization revolution. As is often the case with new technologies, tools that are useful to the common hobbyist or small business are at odds with the interests of big industry. Take for example John Deere tractors or Ford F150s trucks compared to the basic and fundamental designs of a tractor or truck that the owner can easily understand and repair. In the case of deep learning, open-core software stacks are the documented, repairable tools and equipment, where the current fractured landscape is the mass-produced, patent enforced products of the factory. Even ignoring to the philosophy of the problem, we must recognize this dichotomy exists in technics and civilization. Certainly there is something compelling about being able to run your favorite deep learning technologies right on the processor of your smartphone or laptop, deep in the wilderness, without any need for network to connect to a corporation’s servers, and with arguably less headache for the software developer.
Bibliography
Cristina Silvano, Daniele Ielmini, Fabrizio Ferrandi, Leandro Fiorin,
Serena Curzel, Luca Benini, Francesco Conti, Angelo Garofalo, Cristian
Zambelli, Enrico Calore, Sebastiano Fabio Schifano, Maurizio Palesi,
Giuseppe Ascia, Davide Patti, Stefania Perri, Nicola Petra, Davide De
Caro, Luciano Lavagno, Teodoro Urso, Valeria Cardellini, Gian Carlo
Cardarilli, and Robert Birke. 2023. A Survey on Deep Learning Hardware
Accelerators for Heterogeneous HPC Platforms. Retrieved October 30, 2023
from http://arxiv.org/abs/2306.15552
Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang,
Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. 2021. The Deep
Learning Compiler: A Comprehensive Survey. IEEE Trans. Parallel Distrib.
Syst. 32, 3 (March 2021), 708–727.
DOI:https://doi.org/10.1109/TPDS.2020.3030548
Han Cai, Ji Lin, Yujun Lin, Zhijian Liu, Haotian Tang, Hanrui Wang,
Ligeng Zhu, and Song Han. 2022. Enable Deep Learning on Mobile Devices:
Methods, Systems, and Applications. ACM Trans. Des. Autom. Electron.
Syst. 27, 3 (May 2022), 1–50. DOI:https://doi.org/10.1145/3486618
Robert David, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries,
Jian Li, Nick Kreeger, Ian Nappier, Meghna Natraj, Shlomi Regev, Rocky
Rhodes, Tiezhen Wang, and Pete Warden. 2021. TensorFlow Lite Micro:
Embedded Machine Learning on TinyML Systems. Retrieved October 30, 2023
from http://arxiv.org/abs/2010.08678
Daniel Snider and Ruofan Liang. 2023. Operator Fusion in XLA: Analysis
and Evaluation. Retrieved October 30, 2023 from
http://arxiv.org/abs/2301.13062
ggml.ai. Retrieved October 30, 2023 from http://ggml.ai/