ICSA Faculty Talk - 23/02/2023

Title: Designing Scalable Systems for Machine Learning 

Abstract: In the era of AI, managing the vast amounts of data and model checkpoints required for machine learning systems presents a significant challenge. To address this challenge, we recognise the unique data access characteristics of machine learning workloads, such as data sparsity, irregularity, and locality, as well as the heterogeneous nature of AI servers with GPU-NUMA architectures. In this talk, we will introduce two scalable machine learning systems, Ekko and Quiver, which leverage these characteristics to achieve scalability. Both Ekko and Quiver have been adopted by leading  AI practitioners, and benefit billions of users each day. Moreover, we will discuss how these design principles can be applied to support emerging machine learning models, including mixture-of-experts and large language models.