HEYINTELLIGENCE used an AMD Radeon RX 7900XTX GPU to optimize ChatGLM2-6B inference
High-performance AI solutions in both hardware and software are provided by HEYINTELLIGENCE
RMSNorm, MatMul fused with Rotary-EMB, MatMul fused with SwiGLU, and Decoding Attention in ChatGLM2-6B utilized
ChatGLM2-6B is an open-source large language model (LLM) focused on bilingual conversations in Chinese and English
Quantization can reduce GPU memory consumption and increase concurrent users in LLM applications
GCT retrieves the per-channel, 8-bit weight using the “Smooth Quant” method and saves the FP16 scaled value in a file because multiplication uses a lot of CPU
HEYINTELLIGENCE has amassed a vast amount of expertise in the practical deployment of AI models and hardware platforms