The latest DirectML accelerates AMD GPU AWQ-based LM
Minimize Memory Usage and Enhance Performance while Running LLMs on AMD Ryzen AI and Radeon Platforms Overview of 4-bit quantization
Over the past year, AMD and Microsoft have collaborated to accelerate generative AI workloads on AMD systems utilising ONNXRuntime with DirectML
The number of LLM parameters (7B, 13B, 70B, etc.) greatly increases system memory consumption, making workload management difficult
Microsoft and AMD are thrilled to offer AWQ-based LM acceleration on AMD GPU architectures in the newest DirectML and AMD driver preview
When possible, AWQ reduces weights to 4-bit without impacting accuracy. This significantly decreases LLM model memory and boosts speed
AMD driver resident ML layers dequantize parameters and accelerate on ML hardware during runtime to increase AMD Radeon GPU performance
This 4-bit AWQ quantization is carried out utilizing Microsoft Olive toolchains for DirectML
This method makes it possible to execute language models (LM) on a device with limited memory
AMD Ryzen AI platforms with AMD Radeon 780m have a memory footprint equal to 16-bit weight00000s AMD Radeon 7900 XTX computers
For more details Govindhtech.com