Using GPU Utilization To Scale Inference Servers Efficiently
GPU utilization is a measure that is comparable to GPUs. The GPU duty cycle, or the duration of the GPU’s activity, is represented by GPU utilization
The percentage of a graphics processing unit’s (GPU) processing power that is being used at any given moment is known as GPU usage
GPUs are specialized hardware parts that manage intricate mathematical computations for parallel computing and graphic rendering
CPU or memory use are the autoscaling metrics by default. This is effective for CPU-based workloads
However, as inference servers rely heavily on GPUs, these metrics are no longer a reliable way to measure job resource consumption alone
Keep in mind that the mean-time-per-token graph represents TGI’s metric for the total amount of time spent on prefilling and decoding
Google used a single L4 GPU g2-standard-16 computer to run TGI with Llama 2 7b using the HPA custom metrics stackdriver adaptor
For more details visit Govindhtech.com