Researchers at ETH Zurich have developed a new technique that can significantly boost the speed of neural networks. They’ve demonstrated that altering the inference process can drastically cut down the computational requirements of these networks. In experiments conducted on BERT, a transformer model employed in various language tasks, they achieved an astonishing reduction of more than 99% in computations. This innovative technique can also be applied to transformer models used in large language models (LLMs) like GPT-3, opening up new possibilities for faster, more efficient language processing. Transformers, the neural networks underpinning LLMs, are comprised of various layers, including attention layers and feedforward layers. The latter, accounting for a substantial portion of the model’s parameters, are computationally demanding due to the necessity of calculating the product of all neurons and input dimensions. However, the researchers’ paper shows that not all neurons within the feedforward layers need to be active during the inference process for every input. They propose the introduction of “fast feedforward” layers (FFF) as a replacement for traditional feedforward layers. FFF uses a mathematical operation known as conditional matrix multiplication (CMM), which replaces the dense matrix multiplications (DMM) used by conventional feedforward networks.
Full research : New technique can accelerate language models by 300x.