
This lecture introduces the fundamentals of evaluating Large Language Model (LLM) applications. You'll learn why evaluation is critical.
Understand the RAG architecture and why RAG is important. Learn why evaluation of RAG is important.
In this lecture, you’ll explore the RAGAs (Retrieval-Augmented Generation Assessment) framework—a structured, community-driven approach to evaluating RAG pipelines. You’ll learn how RAGAs breaks down RAG systems into measurable components like retrieval precision, faithfulness.
In this session, you'll dive into one of the key evaluation metrics in the RAGAs framework—Context Precision. You'll learn how this metric measures the proportion of retrieved context that is actually relevant to the question being asked. We'll walk through a hands-on implementation using a sample RAG pipeline and evaluate it using RAGAs.
This hands-on session focuses on two crucial metrics from the RAGAs framework—Context Recall and Faithfulness. These metrics help assess the quality of your RAG application's retrieval and generation stages:
Context Recall measures how well the retrieved context covers all the information required to answer the question.
Faithfulness checks if the generated answer is grounded in the retrieved context or if it hallucinates.
In this hands-on session, you'll learn how to evaluate a complete Retrieval-Augmented Generation (RAG) pipeline using the RAGAs framework.
In this session, you'll explore how a RAG-based application is exposed as an API, and how to evaluate its behavior through automated API testing. You’ll learn how to design inputs, interpret JSON responses, and connect evaluation tools like Pytest or custom scripts to validate retrieval quality and generated answers.
In this session, you'll explore how a RAG-based application is exposed as an API, and how to evaluate its behavior through automated API testing. You’ll learn how to design inputs, interpret JSON responses, and connect evaluation tools like Pytest or custom scripts to validate retrieval quality and generated answers.
DeepEval is an open-source Python framework for evaluating generative AI applications. This module provides an overview of DeepEval and includes a hands-on coding exercise using the framework.
Evaluating Large Language Model (LLM) applications is critical to ensuring reliability, accuracy, and user trust—especially as these systems are integrated into real-world solutions. This hands-on course guides you through the complete evaluation lifecycle of LLM-based applications, with a special focus on Retrieval-Augmented Generation (RAG) and Agentic AI workflows.
You'll begin by understanding the core evaluation process, exploring how to measure quality across different stages of a RAG pipeline. Dive deep into RAGAs—the community-driven evaluation framework—and learn to compute key metrics like context relevancy, faithfulness, and hallucination rate using open-source tools.
Through practical labs, you'll create and automate tests with Pytest, evaluate multi-agent systems, and implement tests using DeepEval. You'll also trace and debug your LLM workflows with LangSmith, gaining visibility into each component of your RAG or Agentic AI system.
By the end of the course, you’ll know how to create custom evaluation datasets and validate LLM outputs against ground truth responses. Whether you're a developer, quality engineer, or AI enthusiast, this course will equip you with the practical tools and techniques needed to build trustworthy, production-ready LLM applications.
No prior experience in evaluation frameworks is required—just basic Python knowledge and a curiosity to explore.
Enroll and learn how to evaluate or test Gen AI application.