As machine learning is used in increasingly diverse applications, ranging from autonomous drones and IoT edge devices to self-driving vehicles, specialized computing architectures and platforms are emerging as alternatives to CPUs and GPUs, to meet energy, cost and performance (throughput/latency) requirements imposed by these applications.
This tutorial starts with an overview of the compute and data complexity for deep neural networks (DNNs), the underlying operations and how these can be realized as Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs). Exploiting the underlying parallelism in DNNs requires large computational arrays and high-bandwidth memory accesses for weights, feature maps and inter-layer communication. These arrays, consisting of adders, multipliers, square root and division circuits consume expensive chip real estate. Memory accesses, necessary to store network parameters and processed data, impose high bandwidth requirements, necessitating both on-chip memory as well as high-bandwidth off-chip memory interconnects. The tutorial discusses algorithm-hardware co-design, starting with benchmarking metrics and energy-driven DNN models and covers a number of different hardware optimizations including reduction of parameters and floating-point operations, network pruning and compression, and data size reduction. The power and latency cost of memory accesses have prompted new near-memory and in-memory computing architectures which reduce energy cost by embedding computations in memory structures.
This tutorial is targeted towards both system designers and machine learning practitioners, who want to understand the underlying architecture and hardware implementations of machine learning systems, algorithms and performance, power and model size tradeoffs when developing machine learning-based hardware systems and algorithms.