Setup On-Premise Reranker Service for Hyperflux
Follow below steps in Alauda AI to setup a rerank model using vLLM:
- Upload the desired rerank model to the model repository, for example: Alibaba-NLP/gte-reranker-modernbert-base
- Click the "Publish Inference Service" button, configure appropriate resources, and select the vLLM runtime (
>=vllm-0.9.2-cuda-12.6-x86). - Do not click "Publish" yet. Click the YAML button in the upper right corner to switch to YAML editing mode.
- Modify the
spec.model.commandsection in the YAML file as follows (note that you only need to delete the originalpython3startup part and replace it with thevllm servestartup command below; the preceding script does not need to be modified): - After the rerank model starts, ensure that the model's API address is accessible by the global cluster (the cluster where Hyperflux is deployed). If it's across clusters, you need to configure it as NodePort, Ingress, or AI Gateway, etc.
- Modify the Hyperflux configuration items: change Cohere Reranker BaseUrl to the access address of the inference service mentioned above, change Cohere Reranker Model to the model name (usually the name of the created
InferenceService), and Cohere Reranker API key (fill in anything; vLLM does not set this by default). - After the smart-doc container restarts successfully, the process is complete.
Sample vLLM startup command: