LLM AI Agent Evaluations and Observability with Galileo AI

$10.99 (91% OFF)

About This Course

<div>Important note: Please click the video for more information. This course is hands-on and practical, designed for developers, AI engineers, founders, and teams building real LLM systems and AI agents. It’s also ideal for anyone interested in LLM observability and AI evaluations and who wants to apply these skills to future agentic apps. You should have some knowledge in AI agents and how they are built.</div><div><br></div><div><span style="font-size: 1rem;">Note this is the complete guide to AI Observability and Evaluations. We go both into theory and practice, using Galileo AI as the AI Agent / LLM monitoring platform. Learners also get access to all resources and the GitHub code / notebooks used in the course.</span></div><div><br></div><div><span style="font-size: 1rem;">Why does LLM Observability and Evaluations Matter?</span></div><div><br></div><div>LLMs are powerful, but they are unpredictable. They hallucinate, they fail silently, they behave differently across prompts and versions. There is a big difference between building an AI agentic / LLM system and actually "productionalizing" it. What if the LLM starts producing offensive content? What if tools embedded within agents fail silently? How do you measure model quality degradation? </div><div><br></div><div>Traditional monitoring and building methods don't work. You need to run experiments, build custom evaluations, and set up alerts that assess subjective measures. Dashboards built to track classification accuracy are not designed for open-ended text generation. Log pipelines created for predictable APIs cannot capture reasoning steps, tool usage, or why an agent failed.</div><div><br></div><div>As a result, most teams fall back on manual spot checks, gut feel, and endless prompt tweaking. That approach might work in the beginning, but it does not scale.</div><div><br></div><div>What we need instead is a systematic way to measure, monitor, evaluate, and continuously improve LLM and agent systems. That is where observability and structured evaluation come in.</div><div><br></div><div><span style="font-size: 1rem;">What is this course?</span></div><div><br></div><div>This course will make you more confident when you build and deploy AI agents or other LLM-based systems. It will teach you the tools and tricks needed for building robust AI agents with structured personalized evaluations and experiments, and how to monitor your agents in production with observability and logging. We first start with the basics, the theory around what makes AI agents / LLM systems particularly difficult to build and track. Then, we get into the practical where we build our own evaluations and instrument our own apps with Galileo AI.</div><div><br></div><div><span style="font-size: 1rem;">What is Galileo AI?</span></div><div><br></div><div>Galileo is a platform designed specifically for evaluating and monitoring LLM and agent systems. It's specifically designed for AI agents / LLM-based systems, and includes the following features:</div><div><br></div><div><ul><li>Observability: Log LLM interactions, track spans and metadata, visualize agent flows, monitor safety and compliance signals</li><li><span style="font-size: 1rem;">Evaluations: Design experiments, create evaluation datasets, define and register metrics, use LLMs-as-judges, version and compare results</span></li></ul></div><div><br></div><div>In short, it gives you a structured way to understand how your AI systems behave and helps you build them. In this course, we do a masterclass in Galileo AI and how to use it to monitor and evaluate your AI app.</div><div><br></div><div><span style="font-size: 1rem;">Course Overview:</span></div><div><ul><li><span style="font-size: 1rem;">Introduction - We start by explaining why LLM evaluations and observability matter, covering the risks of deploying generative AI without structured monitoring, setting expectations, and reviewing the course roadmap.</span></li><li><span style="font-size: 1rem;">Theory: LLM/Agent Observability - This section introduces traditional monitoring concepts, explains why they fall short for generative systems, and outlines the key components of LLM observability.</span></li><li><span style="font-size: 1rem;">Theory: LLM / Agent Evaluations - You’ll explore evaluation theory, understand why evaluations are critical for production AI, learn the main evaluation approaches, and see the common challenges teams face with LLMs.</span></li><li><span style="font-size: 1rem;">Theory: Observability and Evaluations for LLMs vs Traditional ML - We contrast generative AI with classical machine learning, highlighting the unique risks, costs, and iteration loops.</span></li><li><span style="font-size: 1rem;">Theory: Tools and Approaches for LLM Observability and Evaluations - This section surveys the landscape of observability and evaluation tools available for LLM systems and explains why dedicated platforms are necessary.</span></li><li><span style="font-size: 1rem;">Practice: Galileo Platform Deep-Dive Overview and Setup - This section walks you through Galileo’s architecture, integrations, pricing, account creation, repository cloning, and local development setup to prepare you for instrumentation.</span></li><li><span style="font-size: 1rem;">Practice: Logging LLM Interactions with Galileo - You’ll learn practical logging with Galileo, including terminology, manual and SDK-based methods, simulating LLM applications, inspecting agent graphs, detecting errors, and setting up alerts and signals.</span></li><li><span style="font-size: 1rem;">Practice: Evaluating LLM Performance with Galileo - We shift from observation to evaluation, showing how to design experiments, manage datasets and metadata, implement evaluation code, define metrics, and perform agent-specific and LLM-as-judge assessments.</span></li><li><span style="font-size: 1rem;">Conclusion: Earn your certificate</span></li></ul></div>

What you'll learn:

Design an LLM observability plan: what to log, how to structure traces, and how to make failures diagnosable
Build evaluation datasets with realistic inputs, expected behavior, metadata, and slices for edge cases and regressions
Run repeatable Galileo AI experiments to compare models, prompts, and agent versions on consistent test sets
Implement custom eval metrics for generation quality, groundedness, safety, and tool correctness (beyond accuracy)
Apply LLM-as-judge scoring with rubrics, constraints, and spot checks to reduce evaluator bias and drift
Debug agent failures using traces to pinpoint breakdowns in retrieval, planning, tool use, or response synthesis
Set up production monitoring in Galileo with signals, dashboards, and alerts for regressions and silent failures
Use eval results to prioritize fixes, validate improvements, and prevent quality or safety regressions over time
Choose observability and eval methods for single-call LLM apps vs. multi-step agents, and explain tradeoffs
Instrument LLM apps and agents in Galileo to capture traces, spans, prompts, tool calls, and metadata for debugging
Design an LLM observability plan: what to log, how to structure traces, and how to make failures diagnosable

About This Course

What you'll learn:

More Course Deals