Chip and software breakthrough makes AI ten times faster

A research team led by Dr Luo Mai from the School of Informatics, has developed WaferLLM, a breakthrough software system that enables large language models to run up to ten times faster on wafer-scale chips. Tested at EPCC—the UK’s National Supercomputing Centre—the system dramatically improves inference speed and energy efficiency, paving the way for real-time AI in healthcare, finance, and scientific discovery.

A system has been developed that enables large language models (LLMs) to process information up to ten times faster than current AI systems, according to new research. 

The process is based on new software that lets trained LLMs draw conclusions from new data – a process called inference - in a much more efficient way.   

The breakthrough was made in combination with the world's largest computer chip, which is roughly the same size as a dinner plate, called a wafer-scale chip.

Abstract arrangement of transparent geometric shapes with glowing purple accents on a black backdrop, evoking futuristic and technological concepts with a visually striking aesthetic.

The accelerated performance could have a major impact on industries that need LLMs to generate fresh insights in real-time in under a millisecond, such as chatbots, finance, healthcare, and scientific discovery, experts say.   

After an AI model has been trained on vast amounts of data, most day-to-day AI functions such as inference are currently carried out by chips called graphics processing units (GPUs).

In recent years, there has been a lot of interest in how wafer-scale chips could be used for inference AI tasks that require a lot of simultaneous calculations and memory use.  

Wafer-scale chips differ from typical AI chips not only in terms of size but also in how they operate. The larger chips have been designed to carry out many computation tasks simultaneously within a single chip, which is aided by massive on-chip memory.   

With all the computation taking place on the same piece of silicon, data can move between different parts of the chip much faster than if it had to travel between separate groups of chips and memory via a network.  

A wafer-scale chip can integrate hundreds of thousands of computation cores all working in parallel, making it exceptionally good at the mathematical operations that power neural networks – the backbone of LLMs.   

Yet the hardware promise comes with software challenges. Wafer-scale chips need entirely new system software from today’s AI systems, one that can intelligently run an AI model, coordinate enormous parallel computations and data movement across a huge number of processing cores.  

Researchers at the University of Edinburgh have developed a software system called WaferLLM, designed specifically for wafer-scale chips to help improve their performance by working to get the most out of the chips' parallel processing, memory, and latency – the response time to queries.  

The team evaluated the software at EPCC, the UK’s National Supercomputing Centre based at the University of Edinburgh, which operates Europe's largest cluster of Cerebras' third-generation Wafer Scale Engine processors as part of the Edinburgh International Data Facility.  

A series of tests were carried out to measure how the wafer-scale chips performed when running several LLMs, such as LLaMA and Qwen. A dramatic acceleration in inference speed resulted in a tenfold increase in how quickly it responded to queries, known as latency performance, compared with a cluster of 16 GPUs.    

The wafer-scale chips were also found to provide energy efficiency benefits. When operating at scale, the chips could be up to two times more energy-efficient in running LLMs compared with GPUs.   

WaferLLM acts as an exemplar of how to effectively design software to unlock the performance of wafer-scale chips, experts say. The team has published it as open-source software so others can design their own applications for use on wafer-scale chips.   

The findings from this work were peer reviewed and presented at the 2025 USENIX Symposium on Operating Systems Design and Implementation (OSDI).

Wafer-scale computing has shown remarkable potential, but software has been the key barrier to putting it to work. With WaferLLM, we show that the right software design can unlock that potential, delivering real gains in speed and energy efficiency for large language models. This is a step toward a new generation of AI infrastructure – one that can support real-time intelligence in science, healthcare, education, and everyday life.

The Cerebras CS-3 systems are a unique resource at Edinburgh to allow researchers to explore novel approaches to AI. Dr Mai’s work is truly ground-breaking and show how the cost of inference can be massively reduced.

Related links