Publicación - Design and Implementation of an Evaluation Methodology for Retrieval-Augmented Generation in Large Language Models

Design and Implementation of an Evaluation Methodology for Retrieval-Augmented Generation in Large Language Models

María Muñoz de Luna Eusebio, C. (2025). Design and Implementation of an Evaluation Methodology for Retrieval-Augmented Generation in Large Language Models. Trabajo Fin de Titulación (TFG). Universidad Politécnica de Madrid, ETSI Telecomunicación.

Abstract:

The rapid evolution of Large Language Models (LLMs) has revolutionized the way information is processed and accessed, enabling new possibilities in domains such as question answering, summarization, and decision support. However, these advances also introduce significant challenges. Traditional LLMs, trained on static data, often generate responses that may be outdated, inaccurate, or unsupported, a phenomenon known as hallucination. This limitation becomes especially critical in sensitive fields where accuracy and reliability are essential. To address these issues, Retrieval-Augmented Generation (RAG) systems combine the power of LLMs with real-time access to external knowledge, grounding responses in verifiable, up-to-date information. However, evaluating RAG quality remains challenging, as classic metrics do not guarantee factual support and human evaluation is expensive and slow. This work proposes and implements an automated, human-centered evaluation method- ology specifically designed for RAG systems. The approach begins with the generation of synthetic evaluation datasets using advanced LLMs and prompt engineering, covering a wide variety of realistic and challenging scenarios. The evaluation framework combines hybrid metrics: LLM-based measures (such as context recall and faithfulness) are used alongside classical similarity metrics, providing a comprehensive assessment of both factual accuracy and the quality of the retrieval process. Additionally, this work introduces a novel plain language metric,enabling measurement of the clarity and accessibility of AI-generated responses, an essential factor for effective user communication. The methodology is validated through a real-world case study, demonstrating its capacity to identify common failure modes and to interpret evaluation results systematically. This approach contributes to the development of more transparent, reliable, and user-oriented RAG systems, supporting their continuous improvement in practical applications.s

Bibtex:

@mastersthesis{ragevaluation2025munozdeluna,
author = "Mar{\'i}a Mu{\~n}oz de Luna Eusebio, Carlota",
abstract = "The rapid evolution of Large Language Models (LLMs) has revolutionized the way information is processed and accessed, enabling new possibilities in domains such as question answering, summarization, and decision support. However, these advances also introduce significant challenges. Traditional LLMs, trained on static data, often generate responses that may be outdated, inaccurate, or unsupported, a phenomenon known as hallucination. This limitation becomes especially critical in sensitive fields where accuracy and reliability
are essential.
To address these issues, Retrieval-Augmented Generation (RAG) systems combine the power of LLMs with real-time access to external knowledge, grounding responses in verifiable, up-to-date information. However, evaluating RAG quality remains challenging, as classic metrics do not guarantee factual support and human evaluation is expensive and slow.
This work proposes and implements an automated, human-centered evaluation method- ology specifically designed for RAG systems. The approach begins with the generation of synthetic evaluation datasets using advanced LLMs and prompt engineering, covering a wide variety of realistic and challenging scenarios. The evaluation framework combines hybrid metrics: LLM-based measures (such as context recall and faithfulness) are used alongside classical similarity metrics, providing a comprehensive assessment of both factual accuracy and the quality of the retrieval process. Additionally, this work introduces a novel plain language metric,enabling measurement of the clarity and accessibility of AI-generated responses, an essential factor for effective user communication.
The methodology is validated through a real-world case study, demonstrating its capacity to identify common failure modes and to interpret evaluation results systematically. This approach contributes to the development of more transparent, reliable, and user-oriented RAG systems, supporting their continuous improvement in practical applications.s
",
address = "ETSI Telecomunicaci{\'o}n",
institution = "Universidad Polit{\'e}cnica de Madrid",
keywords = "LLM;RAG;evaluation",
month = "July",
title = "{D}esign and {I}mplementation of an {E}valuation {M}ethodology for {R}etrieval-{A}ugmented {G}eneration in {L}arge {L}anguage {M}odels",
type = "TFG",
year = "2025",
}