vLLM V1 Engine Strengthens LLM Serving On Intel GPUs

A quick and simple library for LLM inference and serving is called vLLM. It has developed into a community-driven initiative that incorporates input from business and academia

This approach optimises inter-token delay (ITL) and GPU utilisation by batching compute-bound (prefill) and memory-bound (decode) requests and prioritising decode. The Intel Extension for PyTorch kernel is used to execute models in the vLLM v1 engine on Intel GPUs

Using a short, quick draft model to forecast future tokens, spec decoding in vLLM is a technique intended to reduce inter-token delay during LLM inference

By processing longer context lengths for individual requests or managing more concurrent request batches, this increase in storage capacity improves throughput

There are issues with the ranchlai/chatglm3-6B-gptq-4bit model. The Transformers implementation is incompatible with vLLM error, and ChatGLMForConditionalGeneration lacks vLLM implementation

Accuracy testing is not supported by the run-lm-eval-gsm-vllm-baseline.sh script in the docker image that is referenced in this blog

Memory usage for AWQ models is greater than the model size. 8.6GB of RAM was used by the Casperhansen/llama-3-8b-instruct-awq model, which had a capacity of 5.74 GB

Intel tested vLLM V1's performance utilising docker container environments and commands on a machine with an Intel Core Ultra 5 245KF CPU and Intel Arc B580 discrete graphics card

In this benchmarking arrangement, sending additional prompts to the vLLM server increased throughput. With 16 concurrent requests, it peaked and steadied as hardware resources drained