Weight Only Quantization WOQ

WOQ reduces neural network model size without losing functionality in machine learning, notably deep learning

Quantization involves converting from float32 to lower-precision data types like float16, INT8, or INT4 for weights and activations

If lower-precision data types are supported by your hardware, the performance improvements are substantial

WOQ, preserves the original accuracy of the activations while only quantizing the model weights

The Intel Extension for Transformers' Weight Only Quantization (WOQ) methods quantize the upgraded Mistral-7B model

A binary file format called GGUF was created expressly to store deep learning models like LLMs especially for CPU inference

Generally, one would need to utilize an extra library like Llama_cpp in order to execute models in GGUF format