The latest DirectML accelerates AMD GPU AWQ-based LM

Minimize Memory Usage and Enhance Performance while Running LLMs on AMD Ryzen AI and Radeon Platforms Overview of 4-bit quantization

Over the past year, AMD and Microsoft have collaborated to accelerate generative AI workloads on AMD systems utilising ONNXRuntime with DirectML

The number of LLM parameters (7B, 13B, 70B, etc.) greatly increases system memory consumption, making workload management difficult

Microsoft and AMD are thrilled to offer AWQ-based LM acceleration on AMD GPU architectures in the newest DirectML and AMD driver preview

When possible, AWQ reduces weights to 4-bit without impacting accuracy. This significantly decreases LLM model memory and boosts speed

AMD driver resident ML layers dequantize parameters and accelerate on ML hardware during runtime to increase AMD Radeon GPU performance

This 4-bit AWQ quantization is carried out utilizing Microsoft Olive toolchains for DirectML

This method makes it possible to execute language models (LM) on a device with limited memory