logo
Alauda AI
logo
Alauda AI
Navigation

Overview

intro
Features Overview
Quick Start
Release Notes

Install

Pre-installation Configuration
Install Alauda AI Essentials
Install Alauda AI

Upgrade

Upgrade from AML 1.2

Infrastructure Management

Device Management

About Alauda Build of Hami
About Alauda Build of NVIDIA GPU Device Plugin

Multi-Tenant

Guides

Namespace Management
Permissions

Model Deployment & Inference

Overview

Introduction
Features

Inference Service

Introduction

Guides

Feature Introduction

Troubleshooting

Experiencing Inference Service Timeouts with MLServer Runtime
Permissions

how_to

Configure External Access for Inference Services

Model Management

Introduction

Guides

Feature Introduction
Permissions

Monitoring & Ops

Overview

Introduction
Features Overview

Logging & Tracing

Introduction

Guides

Logging

Resource Monitoring

Introduction

Guides

Resource Monitoring
Permissions

API Reference

Introduction

Kubernetes APIs

Manage APIs

AmlNamespace [manage.aml.dev/v1alpha1]

Operator APIs

AmlCluster [amlclusters.aml.dev/v1alpha1]

Inference Service APIs

ClusterServingRuntime [serving.kserve.io/v1alpha1]
InferenceService [serving.kserve.io/v1beta1]
glossary
📝 Edit this page on GitHub
Previous PageTroubleshooting
Next PagePermissions

#Experiencing Inference Service Timeouts with MLServer Runtime

#TOC

#Problem Description

When using the inference service experience feature with the MLServer runtime, timeout errors may occur due to the following two reasons:

Insufficient computing power or excessive length of inference output tokens:

  • Symptom: The inference service returns a 502 Bad Gateway error, displaying the message "Http failure response for [inference service URL]: 502 OK".
  • Detailed error information: Often includes an HTML formatted error page indicating "502 Bad Gateway".
  • Response time: Significantly exceeds the expected response time, potentially lasting several minutes.

MLServer runtime non-streaming return: The current implementation of MLServer waits for the entire inference process to complete before returning the result. This means that if the inference time is long, users will wait for a long time, which may eventually trigger a timeout.

#Root Cause Analysis

  • Insufficient computing power: The computing resources required for model inference exceed the server's capacity. This may be due to the large size of the model itself, complex input data, or low server configuration.
  • Excessive length of inference output tokens: The length of the text generated by the model exceeds the server's processing capacity or preset timeout limit.

#Solutions

To address the above problems, the following solutions can be adopted:

  1. Increase computing resources:

    • Upgrade server configuration: Consider using higher-performance CPUs, GPUs, or increasing memory.
  2. Limit the length of inference output tokens:

    • Adjust model parameters: When calling the inference service, set parameters such as max_new_tokens to limit the maximum number of tokens generated by the model.
  3. Optimize model and input data:

    • Model quantization or pruning: Reduce the model size and computational complexity, thereby reducing the inference time.
    • Data preprocessing: Preprocess the input data, such as removing redundant information and simplifying the data structure, to reduce the amount of data processed by the model.

#Summary

MLServer timeout errors are usually caused by insufficient computing resources, excessive length of inference output tokens, or the non-streaming return of the MLServer runtime. Resolving such problems requires comprehensive consideration of factors such as hardware resources, model characteristics, and runtime configuration, and selecting appropriate solutions according to the actual situation.