{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# RAG evaluation with the Ragas Python SDK (modern metrics API)\n", "\n", "This notebook shows a minimal end-to-end flow using Ragas modern metrics: build a small evaluation table (questions, retrieved contexts, and model answers), run metrics from `ragas.metrics.collections`, and inspect per-row and aggregate scores.\n", "\n", "**Requirements**\n", "\n", "- Python 3.10+ recommended.\n", "- OpenAI-compatible API access: set `EVAL_LLM_API_KEY` and, when needed, `EVAL_LLM_BASE_URL` (the notebook also accepts `OPENAI_API_KEY` / `OPENAI_BASE_URL` as fallback).\n", "- Optional separate embedding endpoint credentials: `EVAL_EMBED_API_KEY` and `EVAL_EMBED_BASE_URL`.\n", "- Explicit model configuration in the notebook (`EVAL_LLM_MODEL` and optional `EVAL_EMBED_MODEL`; default is `text-embedding-3-small`).\n", "- For retrieval-related metrics, use the same embedding model as the production RAG retriever whenever possible.\n", "- Ragas calls the LLM (and embeddings where needed) multiple times; expect latency and cost proportional to dataset size and metrics.\n", "- A dedicated virtual environment or workbench image reduces dependency conflicts with other projects.\n", "\n", "**References**\n", "\n", "- [Ragas documentation](https://docs.ragas.io/)\n", "- Alauda AI docs: *Evaluating RAG with Ragas* — grouped metric overview and prerequisites (when browsing the documentation site)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install dependencies\n", "\n", "Run this once per environment (for example a new workbench or virtualenv)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use current kernel's Python so PATH does not point to another env\n", "# If download is slow, add: -i https://pypi.tuna.tsinghua.edu.cn/simple\n", "import sys\n", "!{sys.executable} -m pip install \"ragas\" \"datasets\" \"openai\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Configure API credentials\n", "\n", "Set `EVAL_LLM_API_KEY` (recommended) or `OPENAI_API_KEY` before running evaluation. If the endpoint is not the provider default, set `EVAL_LLM_BASE_URL` (or `OPENAI_BASE_URL`) as well.\n", "\n", "Do not commit secrets into version control; use platform secret injection or notebook environment variables instead.\n", "\n", "Optional: disable Ragas analytics (`RAGAS_DO_NOT_TRACK=true`) if required by policy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "# Config LLM API\n", "# os.environ[\"EVAL_LLM_API_KEY\"] = \"sk-...\"\n", "# os.environ[\"EVAL_LLM_BASE_URL\"] = \"https://your-openai-compatible-endpoint/v1\" # optional\n", "# os.environ[\"EVAL_LLM_MODEL\"] = \"...\"\n", "\n", "# Config Embeddings API\n", "# os.environ[\"EVAL_EMBED_API_KEY\"] = \"sk-...\"\n", "# os.environ[\"EVAL_EMBED_BASE_URL\"] = \"https://your-embedding-endpoint/v1\"\n", "# os.environ[\"EVAL_EMBED_MODEL\"] = \"...\"\n", "\n", "\n", "# Optional: disable Ragas analytics\n", "# os.environ[\"RAGAS_DO_NOT_TRACK\"] = \"true\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from openai import AsyncOpenAI\n", "from ragas.embeddings import OpenAIEmbeddings\n", "from ragas.llms import llm_factory\n", "\n", "EVAL_LLM_API_KEY = os.getenv(\"EVAL_LLM_API_KEY\", os.getenv(\"OPENAI_API_KEY\", \"\"))\n", "EVAL_LLM_BASE_URL = os.getenv(\"EVAL_LLM_BASE_URL\", os.getenv(\"OPENAI_BASE_URL\", \"\"))\n", "EVAL_LLM_MODEL = os.getenv(\"EVAL_LLM_MODEL\", \"\")\n", "\n", "EVAL_EMBED_API_KEY = os.getenv(\"EVAL_EMBED_API_KEY\", EVAL_LLM_API_KEY)\n", "EVAL_EMBED_BASE_URL = os.getenv(\"EVAL_EMBED_BASE_URL\", EVAL_LLM_BASE_URL)\n", "EVAL_EMBED_MODEL = os.getenv(\"EVAL_EMBED_MODEL\", \"\")\n", "\n", "if not EVAL_LLM_MODEL:\n", " raise RuntimeError(\"Set EVAL_LLM_MODEL to an available model ID from your endpoint.\")\n", "\n", "if not EVAL_EMBED_MODEL:\n", " raise RuntimeError(\"Set EVAL_EMBED_MODEL to an available model ID from your endpoint.\")\n", "\n", "llm_client = AsyncOpenAI(\n", " api_key=EVAL_LLM_API_KEY,\n", " base_url=EVAL_LLM_BASE_URL or None,\n", ")\n", "embed_client = AsyncOpenAI(\n", " api_key=EVAL_EMBED_API_KEY,\n", " base_url=EVAL_EMBED_BASE_URL or None,\n", ")\n", "\n", "llm = llm_factory(EVAL_LLM_MODEL, client=llm_client)\n", "embeddings = OpenAIEmbeddings(\n", " model=EVAL_EMBED_MODEL,\n", " client=embed_client,\n", ")\n", "\n", "print(f\"llm_base_url={EVAL_LLM_BASE_URL or '(provider default)'}\")\n", "print(f\"llm={EVAL_LLM_MODEL}\")\n", "print(f\"embed_base_url={EVAL_EMBED_BASE_URL or '(provider default)'}\")\n", "print(f\"embeddings={EVAL_EMBED_MODEL}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Build an evaluation dataset\n", "\n", "For the modern metrics API in this notebook, organize data as row samples (one dictionary per sample).\n", "\n", "Each row uses argument-aligned names:\n", "\n", "- `user_input`: user query\n", "- `retrieved_contexts`: list of retrieved passages for that row\n", "- `response`: model response to score\n", "- `reference`: reference answer or expected facts (needed by retrieval/reference-based metrics)\n", "\n", "This row-first structure matches `ascore()` usage and avoids extra mapping." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasets import Dataset\n", "\n", "samples = [\n", " {\n", " \"user_input\": \"What is the capital of France?\",\n", " \"retrieved_contexts\": [\n", " \"Paris is the capital and most populous city of France.\"\n", " ],\n", " \"response\": \"The capital of France is Paris.\",\n", " \"reference\": \"Paris\",\n", " },\n", " {\n", " \"user_input\": \"Who patented an early practical telephone?\",\n", " \"retrieved_contexts\": [\n", " \"Alexander Graham Bell was a Scottish-born inventor who patented the first practical telephone.\"\n", " ],\n", " \"response\": \"Alexander Graham Bell patented an early practical telephone.\",\n", " \"reference\": \"Alexander Graham Bell\",\n", " },\n", " {\n", " \"user_input\": \"What is photosynthesis?\",\n", " \"retrieved_contexts\": [\n", " \"Photosynthesis is the process by which plants convert light energy into chemical energy.\"\n", " ],\n", " \"response\": \"Photosynthesis is how plants turn sunlight into chemical energy.\",\n", " \"reference\": \"Plants convert light energy into chemical energy during photosynthesis.\",\n", " },\n", "]\n", "\n", "dataset = Dataset.from_list(samples)\n", "dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run evaluation (modern metrics)\n", "\n", "- **Faithfulness**: whether the answer is supported by the retrieved contexts.\n", "- **Answer relevancy**: whether the answer addresses the question.\n", "\n", "This section uses metrics from `ragas.metrics.collections` with the modern embeddings/LLM interfaces." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ragas.metrics.collections import AnswerRelevancy, Faithfulness\n", "\n", "faithfulness_metric = Faithfulness(llm=llm)\n", "answer_relevancy_metric = AnswerRelevancy(llm=llm, embeddings=embeddings)\n", "\n", "\n", "async def score_baseline_rows(ds):\n", " rows = ds.to_list()\n", " scored = []\n", " for row in rows:\n", " faithfulness_result = await faithfulness_metric.ascore(\n", " user_input=row[\"user_input\"],\n", " response=row[\"response\"],\n", " retrieved_contexts=row[\"retrieved_contexts\"],\n", " )\n", " answer_relevancy_result = await answer_relevancy_metric.ascore(\n", " user_input=row[\"user_input\"],\n", " response=row[\"response\"],\n", " )\n", " scored.append(\n", " {\n", " \"user_input\": row[\"user_input\"],\n", " \"faithfulness\": faithfulness_result.value,\n", " \"answer_relevancy\": answer_relevancy_result.value,\n", " }\n", " )\n", " return scored\n", "\n", "\n", "baseline_scores = await score_baseline_rows(dataset)\n", "faithfulness_avg = sum(item[\"faithfulness\"] for item in baseline_scores) / len(baseline_scores)\n", "answer_relevancy_avg = sum(item[\"answer_relevancy\"] for item in baseline_scores) / len(baseline_scores)\n", "\n", "print(\"Aggregate means:\")\n", "print(f\"faithfulness={faithfulness_avg:.4f}\")\n", "print(f\"answer_relevancy={answer_relevancy_avg:.4f}\")\n", "print(\"\\nPer-row scores:\")\n", "for idx, item in enumerate(baseline_scores, start=1):\n", " print(\n", " f\"{idx}. user_input={item['user_input']} | \"\n", " f\"faithfulness={item['faithfulness']:.4f} | \"\n", " f\"answer_relevancy={item['answer_relevancy']:.4f}\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Add retrieval-focused metrics (modern metrics)\n", "\n", "- **Context precision**: whether retrieved chunks are useful for answering the question.\n", "- **Context recall**: whether retrieved contexts cover what the reference (`ground_truth`) states.\n", "\n", "This pass issues additional LLM calls. If validation errors mention missing columns, adjust the dataset or choose another metric variant per the [Ragas metrics documentation](https://docs.ragas.io/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from ragas.metrics.collections import ContextPrecision, ContextRecall\n", "\n", "context_precision_metric = ContextPrecision(llm=llm)\n", "context_recall_metric = ContextRecall(llm=llm)\n", "\n", "\n", "async def score_retrieval_rows(ds):\n", " rows = ds.to_list()\n", " scored = []\n", " for row in rows:\n", " context_precision_result = await context_precision_metric.ascore(\n", " user_input=row[\"user_input\"],\n", " reference=row[\"reference\"],\n", " retrieved_contexts=row[\"retrieved_contexts\"],\n", " )\n", " context_recall_result = await context_recall_metric.ascore(\n", " user_input=row[\"user_input\"],\n", " retrieved_contexts=row[\"retrieved_contexts\"],\n", " reference=row[\"reference\"],\n", " )\n", " scored.append(\n", " {\n", " \"user_input\": row[\"user_input\"],\n", " \"context_precision\": context_precision_result.value,\n", " \"context_recall\": context_recall_result.value,\n", " }\n", " )\n", " return scored\n", "\n", "\n", "retrieval_scores = await score_retrieval_rows(dataset)\n", "context_precision_avg = sum(item[\"context_precision\"] for item in retrieval_scores) / len(retrieval_scores)\n", "context_recall_avg = sum(item[\"context_recall\"] for item in retrieval_scores) / len(retrieval_scores)\n", "\n", "print(\"Aggregate means:\")\n", "print(f\"context_precision={context_precision_avg:.4f}\")\n", "print(f\"context_recall={context_recall_avg:.4f}\")\n", "print(\"\\nPer-row scores:\")\n", "for idx, item in enumerate(retrieval_scores, start=1):\n", " print(\n", " f\"{idx}. user_input={item['user_input']} | \"\n", " f\"context_precision={item['context_precision']:.4f} | \"\n", " f\"context_recall={item['context_recall']:.4f}\"\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Troubleshooting\n", "\n", "- **Model not found (`Model Not Exist`)**: set `EVAL_LLM_MODEL` (and, when overridden, `EVAL_EMBED_MODEL`) to an available model ID from the endpoint (for example via `/models`).\n", "- **Credentials or endpoint setup**: set `EVAL_LLM_API_KEY` / `EVAL_LLM_BASE_URL` (fallback: `OPENAI_API_KEY` / `OPENAI_BASE_URL`). If embeddings use a separate endpoint, also set `EVAL_EMBED_API_KEY` / `EVAL_EMBED_BASE_URL`.\n", "- **Notebook async execution**: this notebook uses `await metric.ascore(...)` in cells. If running outside notebook contexts, use `asyncio.run(...)` or metric `.score(...)` in synchronous scripts.\n", "- **Version-related warnings**: metric classes and signatures can change across Ragas releases. Pin package versions for reproducible runs and confirm behavior against [docs.ragas.io](https://docs.ragas.io/)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 4 }