Feed Forward Network (FFN)Fusion

Large Language Models (LLMs) Feed Forward Network (FFN) layer sequences, FFN Fusion, drastically reduces inference latency and processing cost

Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base) from Llama-3.1-405B-Instruct, the authors show the efficacy of FFN Fusion

The Feed Forward Network(FFN) Fusion Concept, which is frequently made possible by the Puzzle neural architecture search framework

FFN layers, the methodology permits parallel execution on several GPUs while maintaining model functionality

FFN Fusion can improve memory footprint, dramatically reduce inference latency, and cut per-token cost

Ultra-253B-Base, a highly effective model developed from Llama-405B, the study demonstrates the usefulness of FFN Fusion

FFN Fusion: Four larger FFN layers were created by fusing 49 of the 50 successive FFN levels

Inference latency is 1.71x faster and per-token cost is 35x lower than Llama-405B on a single NVIDIA H100 node at batch size 32

FFN Fusion worked well on a 49B derivative of Llama-70B, with incremental fusion stages exhibiting a trade-off between accuracy and latency reduction

FFN Fusion can be used independently of quantization and pruning methods, indicating that combining them could result in multiplicative efficiency increases

To effectively benefit from parallel execution, fused FFN layers must be implemented efficiently with the right hardware and software support

FFN Fusion offers a promising method to greatly increase the inference efficiency of large language models

The development of Ultra-253B-Base provides a powerful illustration of the practical application of this method