Vertex AI Prediction Dedicated Endpoints

Vertex AI Prediction Dedicated Endpoints

Vertex AI Prediction Dedicated Endpoints are designed to meet the demands of generative AI and large-scale models, ensuring reliable performance and resource isolation

Supports real-time, interactive applications like chatbots and content creation with APIs such as StreamRawPredict for bidirectional streaming

Offers an interface compatible with OpenAI Chat Completion API to simplify migration and promote interoperability

Provides native support for gRPC, enabling low-latency, high-throughput communication for demanding AI workloads

Allows arbitrary timeouts for prediction queries, accommodating longer inference times for large models

Enhances stability and performance by efficiently managing CPU/GPU, memory, and network bandwidth demands

Dedicated Endpoints are now the default serving mechanism for self-deployed models in Vertex AI Model Garden

Enables secure and efficient networking for Dedicated Endpoints Private, routing traffic exclusively through Google Cloud’s network

Private Endpoints with PSC ensure requests originate from within your Virtual Private Cloud (VPC), avoiding public internet exposure

PSC reduces latency variability by bypassing the public internet, ensuring predictable performance for intensive workloads

Reduced Noisy Neighbor Impact: PSC improves network traffic isolation, minimizing performance interference from other users

Public Endpoint Option: Dedicated Endpoints Public remains available for models accessible via the public internet

Private Endpoints with PSC are advised for workloads requiring stringent security and predictable latency