Understanding Ragas and Its Applications
Artificial Intelligence
RAG
LLMs
Summary
The RAG model combines retrieval and generative models for better information retrieval and response quality. Open-source Ragas simplifies synthetic data generation for RAG testing, saving time. It offers performance metrics, custom prompts, language adaptation, and CI integration for quality control.
Key insights:
Defining Ragas: It is an open-source tool devised to make the process of continuous learning for the applications of RAG effective and allows for the synthesis of test datasets and thereby the performance evaluation of an application.
Synthetic Test Dataset Generation: Ragas provides a wide range of synthetic test datasets, which will reduce manual effort and time by up to 90% with complex scenarios for thorough evaluation.
Metrics-Driven Development: It follows the development methodology of Metrics-Driven Development to ensure that applications of LLMs are reliable, data-driven, and reproducible.
Specialized Evaluation Metrics: It provides specialized metrics for each component of the RAG pipeline. For example, it includes faithfulness, context precision, and answer relevancy.
Production Monitoring: Building capabilities that monitor RAG applications in production will help identify issues retrieval, evasive responses, or formatting errors that may consistently affect the quality of an application.
Custom Prompt Creation: Users are allowed to create and apply their prompts for evaluation, which can be tested more precisely on specific instructions and examples tailored to their needs.
Adaptation of Language: This tool allows automatic adaptation of the metrics to be used for different languages. Ragas is fit for use in multilingual applications by translating prompts and instructions where needed.
Introduction
In order to provide more accurate, context-rich responses, Retrieval-Augmented Generation (RAG) is an advanced technique in natural language processing (NLP) that combines the strengths of retrieval and generative models. This insight explores how the open-source tool Ragas improves RAG system development and assessment. It covers a number of topics, such as creating custom prompts, metrics-driven development, and creating fake test datasets. It also emphasizes how continuous integration (CI) pipeline integration, automated language adaption, and thorough production monitoring are made possible by Ragas to guarantee reliable performance. In-depth discussions of each of these subjects provide information on how Ragas might expedite the development and assessment of RAG systems.
What is RAG and Ragas?
Retrieval-Augmented Generation is a sophisticated approach to Natural Language Processing, integrating two key building blocks: retrieval and generation. The retrieval module in a RAG system performs a search over a large corpus of data for information relevant to a query or context given. Later, this obtained information is taken by the generative component, often a large language model (LLM), and coherent responses suitable for the situation are developed. RAG combines the two, improving the generated output's quality and specificity; therefore, it is preferred when the systems are asked questions requiring detailed and context-rich answers, for instance, in question-answering and complex information retrieval.
Ragas is an open-source tool created to help with continuous learning in different applications for RAG. Among other things, Ragas allows the creation of diversified test datasets to assess applications. LLMs assist the estimation of metrics of evaluation for objective evaluation of the application's performance. Smaller and cheaper models give actionable insights to monitor the quality of the apps in production. Further iterations can be done by leveraging these insights to improve the application.
Ragas uses a metrics-driven approach, where decisions are guided by data, to support a two-step process. Firstly, it allows the evaluation to assess a RAG application and have its experiments conducted in a metric-informed manner. This makes the results reliable, with great dependability and reproducibility. Secondly, monitoring gives the ability to get insightful, actionable messages from production data points that allow for continuous improvement in the quality of the application.
Dataset Generation
1. Why Do We Need Synthetic Data Test Generation?
Creating hundreds of QA samples from the documents is not only labor-intensive but also time-consuming. Besides, human-generated questions might not always reach the required level of complexity to get proper evaluation, thus affecting the quality of the assessment in general. Meanwhile, synthetic data generation cuts up to 90% of the time spent by a developer on data aggregation.
2. Introducing Ragas: How it Helps
Ragas can help in the creation of such a synthetic dataset for testing your RAG pipeline.
Ragas uses openAI, so generating a synthetic test set for testing a RAG pipeline requires an OpenAI API key. Ragas uses a set of documents to generate the dataset. These will be documents enriched with metadata, in particular, the `filename` attribute, so chunks from the same document can be recognized. Once the documents are prepared, the TestsetGenerator from Ragas takes the prepared documents as input to create synthetic samples. It receives the LLM model and Embeddings as input, (LLM model and Embeddings can be taken either from ChatGPT or given as input by the user). This slicing process creates a specified number of samples between different task types, such as `simple`, `reasoning`, and `multi-context`. The synthetic data can be exported into a Pandas DataFrame for further analysis or testing.
3. What Sets Ragas Apart
The uniqueness of Ragas instead of just using LLMs to provide test cases is the generation of samples with various levels of complexity. Various questions with different natures are created, including reasoning, conditioning, and multi-context. LLMs cannot achieve this result because they follow common paths. The variety of difficulty levels within the dataset makes Ragas a solid tool for evaluation.
Evaluation
Since each component of an LLM and RAG pipeline plays a great role in how the whole system would turn out, one must consider each component's performance to assess a Retrieval-Augmented Generation using the synthetic test set. For this, Ragas provides a specialized metric that can be used in order to assess each component of your RAG pipeline in isolation.
The synthetic dataset is then tested by generating responses from a pre-trained language model using the questions and contexts provided in the dataset. The model's generated responses are compared against the ground truth answers. The evaluation involves:
Questions: A set of predefined questions.
Contexts: The retrieved contexts related to each question.
Generated Answers: Responses produced by the model for each question.
Ground Truths: The expected answers for each question.
Metrics are used to assess the performance of the RAG system, which are listed as follows:
Faithfulness: Assesses the factual accuracy of the answers in relation to the provided context.
Context Precision: Measures how relevant the retrieved context is to the question, reflecting the quality of the retrieval process.
Answer Relevancy: Determines the relevance of the answers to the questions.
Context Recall: Evaluates the retriever's ability to retrieve all the necessary information to accurately answer the question.
Context Utilization: Evaluates whether all of the answer-relevant items in the contexts are ranked higher. Ideally all the relevant chunks must appear at the top ranks
Context Entity Recall: measure of recall of the retrieved context, calculated as the fraction between number of entities present in both ground_truths and contexts relative to the number of entities only present in the ground_truths.
Noise sensitivity: Noise sensitivity score ranges from 0 to 1 and it measures how often a system makes errors by providing incorrect responses when utilizing relevant or irrelevant retrieved documents.
Summarization scores: measure of how well the summary captures the important information from the contexts.
The retriever evaluates context_precision and context_recall–the relevance of the answers to the questions and the frequency of hallucinations, whilst the generator provides faithfulness that measures hallucinations and answer_relevance.
Production Monitoring
Monitoring a RAG application in production is a crucial activity that can be used to ensure the quality and performance of the system. Ragas provides the basic building blocks for monitoring, along with active research into developing more sophisticated solutions. Some of the most critical items to monitor are the following:.
Faithfulness: It assists in identifying and quantifying instances of hallucination.
Bad Retrieval: Ragas identifies and quantifies poor context retrievals.
Bad Response: This feature assists in recognizing and quantifying evasive, harmful, or toxic responses.
Bad Format: This feature enables the detection and quantification of responses with incorrect formatting.
Use Cases
1. Understand costs and usage of operation
Although Ragas does not directly calculate the usage of tokens by default, it is possible to create a TokenUsageParser function that allows for calculating the usage in terms of cost per output token.
2. Compare Embeddings for Retriever
The quality of the embeddings is one of the most important factors that directly define the performance of the RAG system, especially the retriever part. In other words, better embeddings translate into better retrieval of relevant content. For this reason, it is important to compare and select the most suitable embeddings for your specific data. This can be achieved by generating synthetic test data and evaluating different embedding models based on metrics such as context precision and recall. Then, you can identify the embeddings that provide the highest quality results by looking at the performance scores.
3. Compare LLMs using Ragas evaluation
The language model used in a RAG system plays a crucial role in determining the quality of the generated content. To establish the best-performing LLM for a given task, performance comparison should be carried out using evaluation metrics including faithfulness, answer relevancy, and correctness. This involves generating synthetic test data specific for considered retrieval and generation tasks.
4. Write Custom Prompts with Ragas
The creation and application of custom prompts ensure the evaluation process gets closer to your needs or use cases. Writing a custom prompt in Ragas means defining some sort of instructions and examples that should be followed by the model during response generation while conducting evaluations. Here is how it works:
Dataset Preparation: First, you load a dataset that contains questions, ground truth answers, and contexts. This dataset is what the model will be tested against.
Creating a Custom Prompt Object: Using Ragas, you create a new Prompt object. This object defines the structure and content of the prompt that will be used during evaluations. For example, you can provide examples to guide the model, showing it what the input (question and answer) and output (statements) should look like. This is how to create a new Prompt object to be used in the metric for the evaluation task:
Using the Custom Prompt in Evaluations: Once you have defined the custom prompt above, you are ready to use it in the evaluation process. The difference in the evaluation of model output will lie in how closely the model's output reflects your specific instructions given through a custom prompt. The faithfulness metric is used for evaluation thanks to default prompts long_form_answer_prompt and nli_statements_message for evaluations
Evaluating the Dataset: This is where you really assess your dataset against the metrics you have chosen with your customized prompt. You will get a feel for how good this model is, giving you an idea of the percentage relevance and accuracy pertaining to the answer being generated.
5. Automatic Language Adaptation
Language adaptation in Ragas involves customizing evaluation metrics to work with datasets in any language, not just English. Although prompts in Ragas are originally written in English, Automatic Language Adaptation translates key parts of the instructions and examples into the target language using an LLM. This process helps the LLMs to better understand and generate responses in the new language. For instance, a prompt designed to extract nouns from a sentence in English can be automatically adapted to Hindi, retaining its functionality.
The first prompt outlined here, which is the original prompt, is to extract the noun from a given sentence. In the sentence “The sun sets over the mountains,” it returns the array composed by sun and mountains
Ragas can translate key parts of the instructions and examples into the target language using an LLM, which allows it to extract a response from an input in another language. For example, the prompt “Extract nouns from the given sentence” can be applied to Hindi, and to extract nouns from the Hindi sentence although the prompt is not originally in English.
6. Explainability through Logging and Tracing
In the context of large language models, logging and tracing are crucial for monitoring and understanding model behavior.
Machine learning (ML) logs are the extensive logs that store information in the whole life cycle, comprising a training-testing-deployment phase in a machine learning model. These logs represent rich data sources offering insight into how the model performs, behaves, and whether it is effective in its tasks. They provide insight into the operation which happens as a whole in the ML systems.
Tracing in ML involves observing exactly how data is input and goes through each processing step toward the final prediction of the model. It thus provides the exact tracking of the flow of data and the steps taken on execution, offering complete transparency to the end regarding the operations of a model.
A way of tracing model performance in Ragas is through the use of callbacks, which then allow for fine-grained tracking of performance using tools such as Langsmith. Doing so will capture data at each step in the process, thus making it easier to understand and further improve model performance. In practice, one can have various tracers implemented into their evaluation pipeline, such as LangChainTracer, for the logging and monitoring of both the context and the precision for the model responses. This data can be used to fine-tune and optimize the model further so that desired behavior can be achieved in different scenarios. Besides, there is a possibility of writing custom callbacks to suit the needs and provide flexibility in monitoring and analyzing an ML model.
7. Adding to your CI pipeline with Pytest
You can also include the evaluations by Ragas into your Continuous Integration (CI) pipeline, so it keeps track of the performance of your RAG system before major updates or releases. In order to run in an in-CI mode that is more reproducible, one could perform a call to Ragas with the argument in_ci set to True. Pytest can manage this process by creating tests that verify metrics such as the relevancy of answers and the precision in context. To efficiently handle these tests, you can use Pytest markers, such as ragas_ci, to tag them and run only when needed. When the in_ci is an argument for the evaluate() function and it is set to True, Ragas runs metrics in a special mode which produces more reproducible metrics. This code demonstrates how to make a pytest:
Conclusion
Retrieval-Augmented Generation is a technique that combines retrieval-based techniques with generative models to enhance the retrieval of information and the quality of responses. Ragas is an open-source tool designed for RAG applications that provides robust solutions for evaluating and improving RAG systems.
Synthetic data creation by Ragas reduces manual dataset creation time and guarantees different levels of complexity. It uses a Metrics Driven Development approach and has both capabilities for evaluation and monitoring. Evaluation here represents a check on the performance of RAG components by applying quantitative metrics to get reliable and reproducible results. Monitoring in production helps maintain the quality of RAG applications by identifying issues such as poor retrieval, evasive responses, or incorrect formatting.
Unique to Ragas is the generation of diverse datasets, hence giving a far better evaluation than what is possible with standard LLMs. It supports custom prompt creation, language adaptation, and detailed logging and tracing, all contributing to significant enhancement in evaluation. Also, the integration of Ragas with Continuous Integration pipelines using Pytest provides continuous quality assurance and performance tracking. Overall, Ragas stands out for its capability to facilitate continuous improvement in RAG systems.
Authors
Streamline RAG System Development with Walturn
Are you leveraging Retrieval-Augmented Generation (RAG) systems for superior NLP performance? Walturn can help optimize your RAG applications with tailored support for tools like Ragas. From synthetic data generation to metrics-driven development and seamless CI integration, our services enhance the reliability and efficiency of your RAG solutions. Let us guide you through the complexities and help you elevate your NLP capabilities.
References
Cobus Greyling. “Combining Ragas (RAG Assessment Tool) with LangSmith.” Medium, 5 Sept. 2023, cobusgreyling.medium.com/combining-ragas-rag-assessment-tool-with-langsmith-e46078001f95.
“Introduction | Ragas.” Docs.ragas.io, docs.ragas.io/en/stable/index.html.
“Metrics | Ragas.” Docs.ragas.io, docs.ragas.io/en/stable/concepts/metrics/index.html.
DhanushKumar (2024) Evaluation with Ragas - DhanushKumar - Medium, Medium. Available at: https://medium.com/@danushidk507/evaluation-with-ragas-873a574b86a9.
“Write Custom Prompts with Ragas | Ragas.” Ragas.io, 2023, docs.ragas.io/en/stable/howtos/applications/custom_prompts.html.