Hexe-LLM: High-Efficiency LLM for TPU-Vertex AI

Google Cloud aims to provide extremely effective and cost-optimized ML workflow recipes with Vertex AI Model Garden

[{"selector":"#anim-582ecdad-91ae-40d2-92de-a65aa7b55cf8 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-34.56919450980065%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-d2e06500-1c7a-4e5c-97eb-9188d78b42f3","keyframes":{"opacity":[0,1]},"delay":0,"duration":600,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}] [{"selector":"#anim-b4b15456-1f9d-4c68-a028-a1886e2ac3fe","keyframes":{"transform":["translate3d(0px, -197.34547%, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.2, 0.6, 0.0, 1)","fill":"both"}]

Google debuted the well-liked open source LLM serving stack vLLM on GPUs at Vertex Model Garden

[{"selector":"#anim-73e33e14-69d0-4512-a46c-e4caa17d6f31 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(34.179687404002955%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-83562b4a-5e77-48df-a03d-8dbb294de260","keyframes":{"transform":["translate3d(114.92064%, 0px, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":1000,"easing":"cubic-bezier(.2, 0, .8, 1)","fill":"both"}] [{"selector":"#anim-249bf9fb-3ccd-458f-99d1-53b6b687331c","keyframes":{"transform":["rotateZ(180deg)","rotateZ(0deg)"]},"delay":0,"duration":1000,"easing":"cubic-bezier(.2, 0, .5, 1)","fill":"forwards"}]

Vertex AI's Hex-LLM LLM serving framework was developed for Google Cloud TPU hardware, part of the AI Hypercomputer

[{"selector":"#anim-c3a04d74-1103-41fb-82c7-5928dddcc796 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate(0%, 0%) scale(1.5)","translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}] [{"selector":"#anim-82400f54-97b2-4026-806a-e6abd34aab43","keyframes":{"transform":["translate3d(-115.2381%, 0px, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-ad741d86-5b19-4482-a3e9-702900bb3d53","keyframes":{"opacity":[0,1]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"both"}] [{"selector":"#anim-854c6b88-237a-47c3-b54c-ad2af913afb9","keyframes":{"transform":["scale(0.15)","scale(1)"]},"delay":0,"duration":600,"easing":"cubic-bezier(0.4, 0.4, 0.0, 1)","fill":"forwards"}]

Google is committed to provide Hex-LLM with the latest foundation models and advanced technology as LLM develops

[{"selector":"#anim-2f917c93-444a-4929-a3a1-867531e7586d [data-leaf-element=\"true\"]","keyframes":{"transform":["translate(8.200508704398548%, 0%) scale(1.5)","translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}] [{"selector":"#anim-3660356e-9677-47c8-a3f3-46cd9202784e","keyframes":{"transform":["translate3d(118.95425%, 0px, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":1000,"easing":"cubic-bezier(.2, 0, .8, 1)","fill":"both"}] [{"selector":"#anim-7c037b68-1f45-4718-bb1d-bcd6a189dbc5","keyframes":{"transform":["rotateZ(180deg)","rotateZ(0deg)"]},"delay":0,"duration":1000,"easing":"cubic-bezier(.2, 0, .5, 1)","fill":"forwards"}]

A sample of the ShareGPT dataset, a commonly used dataset with prompts and outputs of varying durations, is used to benchmark Hex-LLM

[{"selector":"#anim-904ec0ea-9212-4a15-8b56-c1ec6c0190d2 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-9.744131540380932%, 0, 0) translate(-25%, 0%) scale(1.5)","translate3d(0%, 0, 0) translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"fill":"forwards"}] [{"selector":"#anim-1406371b-cf42-4f04-84f9-d02eae6c9817","keyframes":{"opacity":[0,1]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-ff2030b6-fe25-41c5-9756-8107f5cdef0d","keyframes":{"transform":["scale(0.3333333333333333)","scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}]

The performance of the Llama 2 70B (int8 weight quantised) and Gemma 7B versions on eight TPU v5e chips

[{"selector":"#anim-0b553329-d33d-44fa-8437-0f8f2e23e9dc","keyframes":{"transform":["rotate(-540deg) scale(0.1)","none"],"opacity":[0,1]},"delay":0,"duration":1000,"fill":"both","iterations":1}] [{"selector":"#anim-498989a7-2e9f-418d-ad10-198def07bbc1 [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(-55.644974782124315%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}]

Vertex Artificial Intelligence Model, a pre-deployed Vertex AI Prediction endpoint that is incorporated into the user interface is Garden’s playground

[{"selector":"#anim-e3af3624-9124-4869-8082-1bc75754033d [data-leaf-element=\"true\"]","keyframes":{"transform":["translate(0%, 0%) scale(1.5)","translate(0%, 0%) scale(1)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"forwards"}] [{"selector":"#anim-386dedba-1d8d-4fdc-a36a-9ae54265d5e9","keyframes":{"transform":["scale(1)","scale(1.5)","scale(0.95)","scale(1)"],"offset":[0,0.33,0.66,1]},"delay":0,"duration":1450,"easing":"ease-in-out","fill":"both","iterations":1}]

For optimum flexibility, use the Vertex Python SDK to deploy a Vertex Prediction endpoint with Hex-LLM using Colab Enterprise notebook examples For more details govindhtech.com

[{"selector":"#anim-1f6d155c-7ce7-47d6-87ab-80b93f5bbcec [data-leaf-element=\"true\"]","keyframes":{"transform":["translate3d(34.926037757634774%, 0, 0)","translate3d(0%, 0, 0)"]},"delay":0,"duration":2000,"easing":"cubic-bezier(.3,0,.55,1)","fill":"both"}] [{"selector":"#anim-cd81f70e-5f8c-4908-a384-b0e61f2c7d6c","keyframes":{"transform":["translate3d(-115.55555%, 0px, 0)","translate3d(0px, 0px, 0)"]},"delay":0,"duration":1000,"easing":"cubic-bezier(.2, 0, .8, 1)","fill":"both"}] [{"selector":"#anim-17bb6c98-1c77-4388-824f-ab60bf6e022b","keyframes":{"transform":["rotateZ(-180deg)","rotateZ(0deg)"]},"delay":0,"duration":1000,"easing":"cubic-bezier(.2, 0, .5, 1)","fill":"forwards"}]