Large Language Models (LLMs) Feed Forward Network (FFN) layer sequences, FFN Fusion, drastically reduces inference latency and processing cost
Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base) from Llama-3.1-405B-Instruct, the authors show the efficacy of FFN Fusion
The Feed Forward Network(FFN) Fusion Concept, which is frequently made possible by the Puzzle neural architecture search framework
FFN layers, the methodology permits parallel execution on several GPUs while maintaining model functionality
FFN Fusion can improve memory footprint, dramatically reduce inference latency, and cut per-token cost
Ultra-253B-Base, a highly effective model developed from Llama-405B, the study demonstrates the usefulness of FFN Fusion
FFN Fusion: Four larger FFN layers were created by fusing 49 of the 50 successive FFN levels
Inference latency is 1.71x faster and per-token cost is 35x lower than Llama-405B on a single NVIDIA H100 node at batch size 32
FFN Fusion worked well on a 49B derivative of Llama-70B, with incremental fusion stages exhibiting a trade-off between accuracy and latency reduction
FFN Fusion can be used independently of quantization and pruning methods, indicating that combining them could result in multiplicative efficiency increases
To effectively benefit from parallel execution, fused FFN layers must be implemented efficiently with the right hardware and software support
FFN Fusion offers a promising method to greatly increase the inference efficiency of large language models
The development of Ultra-253B-Base provides a powerful illustration of the practical application of this method