
Welcome to the Become an AI PM Course!
This is a 5 week long course by Latitude.
Content of the course
Track 1 - AI PM Mindset: Why AI products behave differently, why reliability matters, and how AI PMs think in systems.
Track 2 - Designing AI Systems: How to design prompts, agents, tools, and structured outputs that behave predictably.
Track 3 - Experimentation & Feedback: How to test your AI, read traces, and design proper experiments.
Track 4 - Diagnosing & Automating Quality: How to annotate outputs, discover failure patterns, and build evals.
Track 5 - Iteration & Improvement: How to fix prompts using data, compare versions, and optimize quality/latency.
Track 6 - Becoming an AI PM: How to communicate reliability, work with Eng/Data teams, and scale the workflow across your org.
Join our private AI PM Slack community to ask questions, share progress, and get support during the course (link below)
Resources for the course
You have lifetime access to all lessons, come back anytime.
You can join our private AI PM Slack community to ask questions, share progress, and get support.
Each video includes written notes summarizing the key ideas.
Some lessons include downloadable files: glossaries, frameworks, and templates.
You’ll also find links to extra resources and Latitude documentation when relevant.
Let's do this!
AI PM Mindset: Understanding Probabilistic Systems
The Core Difference
Traditional Software is Deterministic: Same input, same output, following fixed rules (Predictable, Stable).
AI Systems are Probabilistic: Output is based on probabilities, leading to variability. The same prompt can produce different, valid answers.
Variability is by Design: This is the source of LLMs' creativity and flexibility, but it makes them hard to manage.
Product Management Changes
Shift from Features to Systems: AI PMs manage the entire system: data, prompts, models, and feedback loops.
New Success Metric: Success is reliability: achieving consistent, expected behavior across real-world users, not perfection.
Join our private AI PM Slack community to ask questions, share progress, and get support during the course.
AI PM Mindset: The AI Reliability Loop
The Framework for Reliable AI
AI is non-deterministic; the output is never exactly the same.
To manage this, reliable AI products must work within a continuous feedback mechanism: The AI Reliability Loop.
This loop is a refined process for managing the non-deterministic nature of AI.
The AI Reliability Loop
See How Your AI Behaves (Observe):
Observe real outputs by running the prompt and analyzing logs using real user inputs.
Annotate Responses:
A human reviews a sample of logs one-by-one, providing judgment (e.g., thumbs up/down) and qualitative feedback.
Crucial Point: Annotation requires a Domain Expert (usually the AI PM) who understands what a good output looks like, not necessarily the most technical person. This human judgment is never automated.
Discover Failure Patterns (Identify Issues):
Review annotated logs to spot recurring issues, such as tone being off, intent misunderstanding, or hallucinations.
These are called failure patterns or issues and provide the first indication of problems to solve.
Build Evals:
Turn the discovered failure patterns into automated tests (Evals).
Evals allow you to measure the real dimension of an issue across a large volume of logs (e.g., 10k logs), moving beyond the small annotated sample.
Evals become the KPIs for reliability, providing clear data to guide iteration (ending "vibe prompting").
Iterate:
Improve one thing at a time (e.g., prompt change), then re-run the Evals to measure actual improvement.
The Loop Continues: Because AI is probabilistic, improvements can introduce new failures. This process must run regularly (e.g., allocate 30 minutes weekly for log annotation) to maintain and improve the product.
Join our private AI PM Slack community to ask questions, share progress, and get support during the course.
Key Definitions for working with AI
LLM (Large Language Model)
A type of artificial intelligence model trained on vast amounts of text data (e.g., the entirety of the internet) that functions by predicting the most statistically probable next word to generate a coherent and contextually relevant response. These models form the core technology behind modern generative AI applications.
Examples: GPT, Claude, Grok, Llama.
Provider
A configured connection point that grants access to a specific AI company or entity's LLM models and APIs. It serves as the gateway to the underlying AI service.
Example: Using the OpenAI provider to access the GPT-5 model.
Prompt
The input text or instructions provided by a user (or system) to the LLM that dictates the desired task or output.
Prompt Engineering
The systematic practice of designing, testing, and refining prompts to reliably elicit a specific, desired, and consistent output behavior from a probabilistic LLM.
Reliability
The quantitative measure of how frequently an AI feature delivers an output that functions exactly as intended across a wide range of real-world user inputs and scenarios.
Example: A summarization feature that accurately produces a 3-bullet summary in 90 out of 100 uses is considered 90% reliable for that specific task.
Hallucination
An instance where the model generates information or facts that are confidently presented as true but are either factually incorrect, nonsensical, or unverifiable against its training data or the provided context.
Example: The model asserts, "The Eiffel Tower was built in 1992 by Apple."
Determinism vs. Probabilism
Deterministic: A system where the same input will always produce the same, predictable output (characteristic of traditional software).
Probabilistic: A system where the same input can produce slightly different, varied outputs based on probability distributions (characteristic of most LLMs). This behavior is controlled by the Temperature setting.
Temperature
A numeric hyperparameter, typically ranging from 0.0 to 2.0, that governs the randomness and creativity of the model's output.
Low Temperature (closer to 0.0): Results in more conservative, focused, and deterministic outputs, favoring the most statistically probable tokens.
High Temperature (closer to 1.0 or 2.0): Results in more varied, creative, and less predictable outputs.
Tokens
The small, foundational chunks of text or data—roughly equivalent to 3–4 characters or sub-words—that LLMs use to process, measure, and generate text. Cost, speed, and input limits are measured in tokens.
Example: The phrase "This is a tokeniser!" is tokenized into multiple chunks for processing.
Context Window
The maximum total quantity of tokens (both input prompt and generated output) that a model can process and retain in its short-term memory during a single interaction.
Relevance: Defines the complexity and volume of information the model can analyze at one time.
LLM Agent
A complex architectural layer built on top of an LLM that enables the model to perform iterative tasks, engage in multi-step reasoning (like Chain-of-Thought), or utilize Tools by calling itself in loops.
Tools
External functions, APIs, or capabilities that an LLM agent is deliberately given access to. These allow the agent to perform actions outside of its base text generation ability.
Example: An agent uses a Tool to search the internet for current weather data before responding to a user query.
Bias
A systematic skew or deviation in the model's outputs, often resulting in unfair or prejudiced responses, which originates from imbalanced, non-representative, or problematic data within its training set or design.
Latency
The duration of time, measured from the moment a prompt is sent to the moment the model begins generating the first word of the response (Time to First Token). Latency is critical for real-time user experience.
RAG (Retrieval-Augmented Generation)
A specialized process where the LLM is first instructed to retrieve specific, external, and up-to-date information from a private database, documents, or knowledge base, and then uses that retrieved data to formulate its final, accurate answer.
Purpose: Mitigates hallucination and ensures answers are grounded in current or proprietary knowledge.
Join our private AI PM Slack community to ask questions, share progress, and get support during the course.
Register in latitude.so
You can use Latitude for free with the open source version.
For the course, you can have 2 months for free on Latitude Cloud using the promo code AI-PM-COURSE
Latitude Workspace Setup
Go to Latitude: Navigate to the platform: latitude.so
Sign Up: Click "Get started" and sign up using Google or email.
Access Workspace: You will land on your Workspace (your "AI lab"), where you will manage all your prompts, projects, and experiments.
Create Workspace: Name your workspace (e.g., "AI PM Course").
Create Project: Click "New Project" to create the container where you will add your first prompt and begin the process.
Join our private AI PM Slack community to ask questions, share progress, and get support during the course.
Designing AI Systems: Prompts vs. Agents
The Core Building Blocks
Understanding when to use a simple prompt versus a complex agent is the first step in designing reliable AI systems.
Prompts (Single Instruction $\to$ Single Output)
Prompts are the simplest building block of AI systems.
They involve one instruction, one LLM call, and one output.
Prompts have no memory and no planning capability.
Agents (Multi-Step, Tool-Using Systems)
Agents are multi-step systems, not just a longer prompt.
They are capable of Planning, Deciding Next Actions, and Looping.
Agents follow the loop: plan -> act -> observe -> repeat.
They can call Tools and observe the results until a goal is achieved.
Best Use Cases
Prompts Are Best For:
Summarization
Classification
Rewriting / Transforming text
Extraction
Translation
Agents Are Best For:
Complex workflows
Research tasks
Chatbots that need memory
Tasks requiring external tools/APIs
Multi-step reasoning problems
Reliability and Performance Trade-Offs
Prompts Offer:
Simplicity and speed
Lower cost and greater interpretability
Higher reliability
Agents Offer:
More power and capability
Higher cost
Slower execution and less predictability
Harder debugging
Start with a prompt. Only upgrade to an agent when the task requires external tools, memory, or complex, dependent, multi-step reasoning that a single prompt cannot handle.
This is a hands-on video; you will find all the prompts used in the video at the end of these notes, allowing you to follow along with the instructor step-by-step.
Designing AI Systems: The Art of Prompt Engineering
What Makes a Prompt 'Good'
The prompt is the core instruction given to the LLM. It is the new "specification" for AI systems, meaning every word matters. A good prompt reduces ambiguity and guides the model's behavior.
Clear Intent: The model must know precisely what is being asked. Treat the AI like a role-playing game where you specify every detail of its "personality" and task.
Relevant Context: Provide necessary background information or relevant examples when consistency is crucial. AI is a great rule-follower, so the more rules you give, the better.
Proper Constraints: Define the required format, tone, or specific rules. Use standard formatting methods like Markdown or XML tags to clearly separate parameters. Consistency is key.
Prompt Structure & Variables
Effective prompts follow a clear pattern that enables testing and evaluation.
Pattern: Instruction (What to do) $\to$ Input (The content) $\to$ Output Format
Variables: Use placeholders (e.g., {{email_text}}, {{user_message}}) to swap inputs dynamically, allowing the prompt to be used with different data points and enabling systematic testing.
Roles: Prompts often involve defining System (the instructions/rules), User (the input), and Assistant (the desired output).
Zero-Shot vs. Few-Shot
Zero-Shot Prompting
Description: The model figures out the task based on the instruction alone.
When to Use: For flexibility or when the task is simple and clear.
Few-Shot Prompting
Description: You provide one or more input/output examples so the model can learn the desired pattern.
When to Use: When consistency matters most, as examples drastically improve adherence to a specific format or style.
Prompt design is your first layer of quality control. You cannot make your AI reliable without first making your instructions reliable.
Prompts for Use
Here are the prompts used in the video:
Simple Good Prompt
---
Summarize the following email in one clear sentence.
Email:
{{email_text}}
---
Small Change -> Big Impact Pair
The subtle change in constraints and role-setting leads to dramatically different focus and results.
A) Broad Summary
---
Summarize this email.
Email:
{{email_text}}
---
B) Targeted Summary (Constraint)
---
Summarize this email in one sentence for a Customer Success Manager, focusing on the user’s intent and required actions.
Email:
{{email_text}}
---
Zero and Few Shot prompts
Input for testing the variables
"Project deadlines are slipping, the team is struggling with communication, and our budget review is coming up soon."
Zero-Shot Prompt
This relies entirely on the instruction to guide the model's output format.
---
Extract the three main topics from the following message.
Message:
{{user_message}}
---
Few-Shot Prompt (with Examples)
This prompt uses examples (Example 1, Example 2) to demonstrate the exact desired structure and style, ensuring high consistency.
---
You are an assistant that extracts key topics from user messages.
Follow the format: ["topic1", "topic2", "topic3"]
Examples:
User: "I can't log in and my password reset isn't working."
Topics: ["login issue", "password reset", "account access"]
User: "The app keeps crashing on startup after the new update."
Topics: ["app crash", "startup failure", "post-update bug"]
Message:
{{user_message}}
---
Designing AI Systems: Defining Roles for Quality
The Role-Based Prompt Structure
When designing prompts, it is crucial to separate the instruction based on who says what within the interaction. This separation is handled by three distinct roles: System, User, and Assistant.
1. The System Role
The System role defines the model's identity, rules, and fundamental behavior. It acts as the model's operating manual.
Purpose: Sets tone, behavior, and mandatory constraints.
Examples:
"You are a customer support agent." (Identity)
"Always respond in JSON." (Constraint)
"Never guess; ask for missing information." (Rule)
2. The User Role
The User role is the actual request or input that the model must act on. It represents the problem you want solved.
Purpose: Contains the specific user input or task content.
Examples:
"Write a reply to this message." (Action)
"Summarize this text." (Action)
"Extract key topics from this review." (Input Content)
3. The Assistant Role
While not explicitly detailed, the Assistant role is where the final expected output is often structured or where few-shot examples (the model's response part of the example) are placed.
Why Roles Matter for Reliability
Clear role separation is essential for moving from unpredictable to reliable AI behavior.
Reduces Hallucinations: Clear separation gives the model less room to improvise or make up content.
Increases Predictability: Behaviors become more consistent across multiple runs and different inputs.
Simplifies Debugging: If an output is flawed, you immediately know which role (System, User, or Assistant) to adjust for correction.
Common Mistakes to Avoid
Mixing constraints (rules/format) into the User message.
Putting identity instructions (who the model is) outside of the System role.
Overloading one role while leaving others empty.
Remember, clean prompts lead to clean behavior.
Example Structure
-----
<System>
You are a customer support assistant who writes short, friendly replies.
Always follow company policy and keep responses under 100 words.
</System>
<User>
Write a reply to this customer message:
{{customer_message}}
</User>
<Assistant>
Respond in this format:
{
"reply": "",
"next_action": ""
}
</Assistant>
----
Prompts for Use
Here are the prompts used in the video:
Role-Based Prompt Structure (Customer Support Agent)
-----
<system>
You are a customer support agent.
Always respond in JSON.
Never guess; ask for missing information.
</system>
<user>
Write a reply to this message: {{message}}
</user>
----
Assistant tag
----
<assistant>
```json
{
"tone": "Positive",
"response": "Thank you for your positive feedback! We're thrilled to hear that you love the product. If you have any questions or need assistance, feel free to reach out!"
}
</assistant>
-----
Final prompt
----
<system>
You are a customer support agent.
Always respond in JSON.
Never guess; ask for missing information.
</system>
<user>
Write a reply to this message: I love this product!
</user>
<assistant>
```json
{
"tone": "Positive",
"response": "Thank you for your positive feedback! We're thrilled to hear that you love the product. If you have any questions or need assistance, feel free to reach out!"
}
</assistant>
<user>
Write a reply to this message: {{message}}
</user>
-----
This is a hands-on video; you will find all the prompts used in the video at the end of these notes, allowing you to follow along with the instructor step-by-step.
Designing AI Systems: Structured Outputs for Reliability
What is Structured Output?
Structured output means the model returns data in a predictable, standardized format, most commonly JSON (JavaScript Object Notation).
JSON is the industry standard format for applications to exchange data efficiently.
Structured output was a major advancement that allowed AI to work with computers more predictably.
Contrast: Instead of generating free-form text like "I am good!", the model generates a structured response like {"message": "I am good!"}.
Why Structured Output Matters
Consistency is Key: Unstructured text is suitable for natural conversation (chatbots) but causes issues in complex systems that need consistent information exchange.
Deterministic Goal: Structured output is an attempt to make a non-deterministic system (the LLM) behave deterministically for specific, data-focused tasks.
Enables Automation: If you need to evaluate, log, trigger workflows, or integrate with downstream tools, you absolutely require consistency.
How to Instruct Structured Output
There are two primary ways to achieve structured output:
Normal Prompting: Explicitly instruct the model to use a specific format and provide the template in the prompt itself.
Enforcing Structured Output: Use the model's native feature (if available) to strictly enforce an output Schema (a blueprint defining required fields and data types).
Example Instruction (Schema Snippet):
---
schema:
type: object
properties:
stock_name:
type: string
current_price:
type: number
market_value:
type: number
required:
- stock_name
- current_price
- market_value
---
Benefits for Reliability
Structured outputs are non-negotiable for reliable AI features because they provide clear benefits:
Easier Evals: You can compare outputs field by field instead of interpreting entire paragraphs, making automated testing (Evals) more effective.
Fewer Hallucinations: The constraints imposed by the structure reduce the model's opportunity to ramble or generate irrelevant text.
Better Integrations: Downstream systems can reliably route actions based on the consistent, structured fields.
Easier Debugging: If the system breaks, you know exactly where to look for missing or incorrect fields.
Common Pitfalls
When working with structured outputs, be aware of common errors that can break your systems:
Missing required fields.
Wrong data types (e.g., model returns a string when a number is expected).
Model adding extra, non-JSON text around the structured output.
Inconsistent key naming (e.g., sometimes user_id and sometimes userId).
Prompts for Use
Here are the prompts used in the video:
Base prompt:
(remember to change the temperature to 0.3!)
---
<system>
You are a helpful assistant that provides the capital of a country for a given
country name and interesting information about that capital.
</system>
<user>
Country Name: {{country_name}}
</user>
---
Schema
(add it on the formatter)
---
schema:
type: object
properties:
capital:
type: string
interesting_facts:
type: string
required:
- capital
---
Designing AI Systems: Common Pitfalls and Design Risks
Design Pitfalls: The Root of Reliability Issues
The most common mistakes made during the design phase are the root cause of most reliability problems later discovered in evaluations (Evals). Good design prevents these predictable problems.
Vague or Incomplete Instructions: If the intent, context, or constraints are not explicit, the model is forced to guess. Guessing is the enemy of reliability.
Overusing Few-Shot Examples: Too many examples lead to overfitting. The model mimics the provided patterns instead of generalizing, causing it to fail on real-world inputs.
Unclear Role Definition: If the model doesn't know its identity (e.g., support agent, analyst), it behaves inconsistently. Role clarity equals behavior clarity.
Major Failure Modes
Hallucinations & Made-Up Details: Hallucinations are a consequence of a prompt that leaves too much room for invention. Models fill gaps with invented facts unless anchored by constraints, structured output, or tools.
Tool Misuse: Tools can become another failure mode if the logic is unclear. This includes incorrect parameters, unnecessary tool calls, or ignoring the tools entirely.
Lack of Guardrails: Guardrails are simple constraints (e.g., tone rules, format rules, safety constraints) that prevent bad outputs. Without them, you will experience drift (tone drift, style drift, content drift).
Every single one of these design issues will eventually surface in the later steps of the Reliability Loop:
They show up in your annotations (human judgment).
They are categorized during issue discovery.
They are measured during evals.
This is a hands-on video; you will find all the prompts used in the video at the end of these notes, allowing you to follow along with the instructor step-by-step.
Designing AI Systems: Tool Use and Function Calling
Definitions and Mechanism
Tool Use means the LLM can call an external API, database, or function instead of answering only from its internal parameters or training data.
Plain Analogy: Instead of guessing, the model checks a calculator, calendar, or CRM.
Function Calling is the structured implementation of tool use. The model outputs structured JSON that specifies the exact action (function) it wants to take and the parameters required to execute it.
Simple Example: {"action": "get_weather", "city": "Barcelona"}
Tools vs. Triggers:
Tools are the external functions or actions themselves (e.g., calling a billing API, running a refund workflow).
Triggers are the conditions or rules that launch the agent or the tool (e.g., "If it's a billing question -> call billing API").
Why Tools are Required in Production
Without tools, the LLM is limited to its training data and current context. With tools, the LLM can become an active part of your product:
Retrieve Up-to-Date Information: Tools allow access to current data like weather, stock prices, or real-time user data.
Perform Write Actions: Tools enable the LLM to create, update, or delete records in databases, send emails, trigger payments, or schedule meetings.
Execute Multi-Step Flows: Tools facilitate combining reads and writes (e.g., check inventory -> reserve item -> write order -> send confirmation).
Result: Hallucinations drop dramatically, data stays fresh, and reliability jumps because the model stops guessing.
Trade-Offs: Power vs. Complexity
Tool use significantly improves accuracy but introduces new complexity and failure points.
Pros (Accuracy):
Fewer hallucinations due to real facts and data.
Consistent logic (API always returns the same result).
Predictable user experience.
Cons (Complexity/Testing):
More moving pieces lead to more failure points.
Testing must now cover tool availability, latency, error handling, and bad parameters sent by the model.
Observability
When tools fail, you must be able to see what happened: what tool was called, with what parameters, and how it responded. This is addressed later in the course using observability features (e.g., Latitude's Observability) to inspect tool calls and traces.
Prompts for Use
Here are the prompts used in the video:
Client tool
(remember to add it on the formatter, and also set type:agent)
A Client Tool is a capability or external function (like an API or database query) defined within the user's application code that the LLM is instructed to call to perform actions or retrieve live data.
---
tools:
- log_priority_decision:
description: logs the feature, priority, reasoning and impact score
parameters:
type: object
properties:
feature_request:
type: string
description: name of the requested feature
priority:
type: integer
description: the assigned priority level
reasoning:
type: string
description: a brief explanation of your decision
required:
- feature_request
- priority
- reasoning
---
Final Prompt
-----
---
provider: Latitude
model: gpt-4o-mini
temperature: 1
type: agent
tools:
- log_priority_decision:
description: logs the feature, priority, reasoning and impact score
parameters:
type: object
properties:
feature_request:
type: string
description: name of the requested feature
priority:
type: integer
description: the assigned priority level
reasoning:
type: string
description: a brief explanation of your decision
required:
- feature_request
- priority
- reasoning
- Notion/*
---
<system>
You are an agent that logs priority decisions for feature requests. Log the feature requested, the priority of that feature, and your reasoning for doing so.
After logging this with the log priority decision tool, post a summary of what you did to Notion with the Update Page tool.
<tools>
To carry out your tasks you have access to the following tools:
- log_priority_decision
</tools>
</system>
<user>
Feature Request: {{feature_request}}
</user>
-----
This is a hands-on video; you will find all the prompts used in the video at the end of these notes, allowing you to follow along with the instructor step-by-step.
Experimentation & Feedback: Manual Exploration
From this point forward, the course shifts from theory to application. To develop and improve a real AI feature, you must have one concrete prompt to iterate on using the Reliability Loop. You can use your own feature or follow along using the WristBand starter prompt provided below.
The WristBand Feature
The example feature used throughout this track is a customer support assistant for a festival-management app called WristBand.
The Workflow: It reads incoming messages, classifies the request, retrieves company policies, and generates a response.
Mechanism: It uses a tool (latitude/extract) to read a Notion document containing real support policies, making the example realistic by combining classification, policy lookup, and final response generation.
Why Manual Exploration Matters
Before building structured experiments, you must first observe the raw behavior of the AI through manual testing in the playground.
Understanding Variability: Running the same input multiple times helps you spot inconsistent categories or shifts in interpretation.
Spotting Edge Cases: Manual testing helps identify where the model might suffer from tone drift, misclassifications, or missing context.
Identifying Issues: The more familiar you are with the outputs, the easier it becomes to identify specific patterns of unreliability.
Practical Exploration Steps
Run the Starter Prompt: Execute the prompt multiple times with the same input email to see how the response varies.
Test with Different Inputs: Use various festival-related scenarios (e.g., technical issues vs. billing questions).
Adjust Parameters: Experiment with the temperature setting or provide ambiguous inputs to see how the model's confidence and accuracy change.
Take Notes: Document what remains consistent and what appears unreliable to prepare for structured experiments.
Prompts for Use
Here are the prompts used in the video:
WristBand Starter Prompt
This prompt is designed for a customer support assistant that classifies requests and generates a response based on festival policies.
-----
---
provider: OpenAI
model: gpt-4o-mini
temperature: 1
type: agent
tools:
- latitude/extract
---
<system>
You are a customer support assistant for a festival management app, WristBand.
Your job is to read customer support messages, classify the request, and respond.
Read the following customer email and do the following:
1) Classify the request into one of these categories:
- Ticketing
- Billing Issue
- Refund Request
- Technical Issue (App / QR / Wristband)
- Access Issue (Entry / Pass / Age Verification)
- On-Site Experience Feedback
- General Question
- Other
2) use ‘latitude/extract’ tool to consult policies on our website: “https://latitude.so/festival-policy”
3) Respond to the user
Output in JSON
</system>
<user>
{{username}}
{{customer_query}}
</user>
-----
Testing Inputs (Variables)
Use these examples for the {{customer_email}} variable during your manual exploration:
Technical Issue: "Hey! My wristband stopped scanning after I spilled a tiny amount of margarita on it."
Billing/Ticketing: "Hi, I upgraded my ticket yesterday but the app still shows the basic pass."
Access Issue: "The app is not letting me sign in."
Ambiguous/Refund: "Is there a way that I can change one ticket for another? If not I'll just take a refund."
This is a hands-on video; you will find all the prompts used in the video at the end of these notes, allowing you to follow along with the instructor step-by-step.
Experimentation & Feedback: Observability
You just ran the prompt a few times and likely saw how unpredictable the outputs can be. To improve your AI, you need to look inside the "black box." This is where Observability comes in: features that allow you to see the exact steps carried out during a run, such as calls to LLM providers, tool calls, and subagent calls.
The 3 Core Concepts
Run (Sessions/Conversations): A run is one complete execution of your prompt or agent from start to finish. It records the entire conversation history between the user and the assistant. It can include many user and assistant messages.
Trace: A trace captures the specific trajectory the assistant took to generate the final output in a single interaction. A Run can have several Traces if there are multiple follow-up messages. It includes 1 user message and 1 assistant message
Span: The minimum unit of information within a Trace. A span captures an intermediate thought, a specific tool call, or a generated asset. A Trace is composed of multiple Spans.
Why Observability Matters for AI PMs
You don’t need to understand the technical internals of the model, but you do need to pinpoint exactly what is happening. Observability is your primary tool for:
Debugging: Identifying exactly where a logic chain broke or where a tool received the wrong parameters.
Efficiency: Finding parts of the process that are redundant or could be streamlined.
Root Cause Analysis: If a response is poor, you can "untangle" the observability data to find out if the issue was the initial prompt, a specific tool's data, or the model's reasoning.
Prompts for Use
Here are the prompts used in the video:
Observability Test Prompt (WristBand)
This version includes the tool call structure so you can observe Spans in the trace when the model looks up policies.
-----
---
provider: OpenAI
model: gpt-4o-mini
temperature: 1
type: agent
tools:
- latitude/extract
---
<system>
You are a customer support assistant for a festival management app, WristBand.
Your job is to read customer support messages, classify the request, and respond.
Read the following customer email and do the following:
1) Classify the request into one of these categories:
- Ticketing
- Billing Issue
- Refund Request
- Technical Issue (App / QR / Wristband)
- Access Issue (Entry / Pass / Age Verification)
- On-Site Experience Feedback
- General Question
- Other
2) use 'latitude/extract' tool to consult policies on our website: "https://latitude.so/festival-policy"
3) Respond to the user
Output in JSON
</system>
<user>
{{username}}
{{customer_query}}
</user>
-----
Experimentation & Feedback: Running Experiments
In AI PM work, an experiment is simply running your prompt many times, in a controlled way, to answer a specific question or test a hypothesis.
Experiments are used to answer critical product questions such as:
A/B Testing: Is Version A or Version B more reliable?
Generalization: How does the prompt behave across different inputs?
Impact Analysis: Does a tone change introduce new failures?
Release Readiness: Is this model consistent enough to ship?
Why We Run Experiments Right Now
At this stage of the loop, we are not testing variations yet. The primary goal is to generate traces (logs). These traces are the essential raw material needed for the next step: annotation.
Real-world teams typically run experiments on two types of data:
Production Logs: Real user behavior captured from a live product.
Generated Datasets: Simulated user inputs used when you are early in development and don't have production data yet.
In this course, we will create a small dataset of simulated user inputs, run our prompt against them, and capture the outputs. These captured outputs become the logs we will annotate to kick off the Reliability Loop.
The Foundational Role of Traces
Everything in the Reliability Loop is interconnected:
No Traces → No Annotation.
No Annotation → No Issue Discovery.
No Issue Discovery → No Evals.
No Evals → No Reliable Iteration.
While you will eventually run advanced experiments (like A/B prompt tests, model comparisons, and regression tests) everything starts with this simple step of observing behavior.
Note: If you already have production logs, or if you integrate Latitude directly into your product, these logs will appear automatically, and you can skip manual generation.
Diagnosing & Automating Quality: Annotating Your AI Outputs
What is Annotation?
Annotation is the process of assessing the quality of AI outputs by reviewing logs and explicitly labeling what works and what doesn't. This is how you turn subjective judgment into structured, usable data.
Why Humans are Required:
Contextual Nuance: Models are good at generating text but cannot reliably judge tone or nuance in the way a real product requires.
Explicit Expectations: Models don't know your business context or users; they don't know what "acceptable" means for your specific use case until you teach them.
The Expert View: Only a human (ideally an AI PM or domain expert) can decide if an output is "good enough to ship" or explain why it fails.
When "Correct" is Actually a Failure
A response can follow a prompt perfectly on paper but still fail in practice:
The "Helpful" Failure: The model cites the right policy but replies with boilerplate that doesn't actually help the user.
The Routing Failure: The information is accurate, but the model routes the issue to the wrong category, leading to incomplete guidance.
The Tone Failure: The policy is followed word-for-word, but the tone feels dismissive to a user in a high-stress situation.
Annotation as the Foundation of Evals
Every label, comment, or "thumbs up" you provide is an explicit signal about what quality means in your system.
Data over Automation: You cannot automate quality before you define it.
Regular Cadence: Annotation is not a one-time task. You should ideally annotate a small sample of traces weekly to stay grounded in how the system behaves as new failure modes appear.
Hands-on: The Annotation Workflow
To unlock Issue Discovery, you must first annotate at least 15 runs.
Navigate to Annotations: Use the sidebar to enter the annotation view.
Select Your Prompt: Ensure your "Main Prompt" is set to the correct feature (e.g., Festival Chatbot) so the relevant traces appear.
Filter Traces: You can choose between Production Traces (live user data) or Playground Traces (from your experiments).
The Feedback Loop:
Thumbs Up: Everything is exactly as expected.
Thumbs Down: This triggers "Issue Detection," allowing you to explain the failure.
Writing "Good" vs. "Bad" Annotations
The goal is to be as specific as possible so the system can cluster these errors into broader issues.
Examples
Bad Annotation: "JSON wrong"
Good Annotation: "The response field should be a string, not a nested JSON object."
Bad Annotation: "Tool failed"
Good Annotation: "The tool call failed because the model hallucinated a URL not provided in the prompt."
Prompts for Use
Here are the examples of failures and successful outputs observed during the video annotation session:
Example 1: JSON Structure Issue (Failure)
Issue: The model added extra fields (greeting, body, signature) that weren't requested.
Annotation: "The response field of the JSON output should only contain the response as a string. It should not be another JSON object."
Example 2: Hallucination Issue (Failure)
Issue: The model made up a URL instead of using the one provided.
Annotation: "The tool call failed because the model hallucinated an example URL and did not use the URL provided in the prompt."
Example 3: Successful Output
Scenario: User giving positive feedback about staff.
Result: Singular tool call, correct JSON categories, and accurate response.
Action: Thumbs Up.
Diagnosing & Automating Quality: Evals Theory
What is an Eval?
An evaluation (or "eval") is a systematic way of measuring how well a model's output matches the specific behavior you are seeking (whether an answer is good, useful, appropriate, and aligned with your goals).
Most AI failures are not technical crashes; they are subtle quality failures. The output might be valid code or text but is wrong in context (perhaps it is misleading, off-tone, incomplete, or unsafe). Evals make these failures visible and measurable over time.
Why Evaluation Must Be Externalized
Language models cannot reliably judge their own outputs because they lack your specific context and tend to agree with themselves even when they are wrong. You must define the standard through annotations and then measure the model against that standard externally.
The Three Main Types of Evals
1. Human-in-the-Loop
Definition: A manual review where a person decides if an output meets expectations.
Best For: Subjective criteria like tone, usefulness, and creativity.
Pros: Acts as the ultimate source of truth for the system.
Cons: It is slow, expensive, and can be inconsistent at scale.
2. Programmatic Rule
Definition: A code-based check where the output is verified against a specific condition you provide.
Best For: Tasks like extraction, classification, and deterministic formats.
Pros: Extremely fast and 100% consistent (deterministic).
Cons: Very rigid; it only works for things that are programmatically describable.
3. LLM as Judge
Definition: A separate model is given a rubric and asked to score or grade the response.
Best For: Criteria that are hard to encode in rules, like clarity or coherence.
Pros: Highly scalable compared to human review.
Cons: Risk of bias or "judge" failure modes; requires human oversight and periodic audits.
When to Use Each Type
Use Human-in-the-Loop when:
You are early on and still defining what "good" actually means.
Rating whether a customer support reply sounds empathetic.
Deciding if an answer is truly helpful, not just technically correct.
Checking if a response follows the "spirit" of an internal policy.
Use Programmatic Rule when:
There is a clearly defined, single correct answer.
Consistency and regression detection are the top priorities.
You need to extract specific data like an Order ID or email address.
You need to ensure a JSON output includes all required fields.
Use LLM as Judge when:
You need to scale subjective evaluation beyond what humans can manually review.
The criteria are too complex for regex or simple code rules.
Scoring whether a response is clear and well-structured.
Evaluating correctness against a multi-point rubric.
Evals are not just "tests", they are the foundation of reliable AI. They allow you to scientifically track the development of issues and ensure that your iterations actually move the needle on quality.
This is a hands-on video; you will find the dataset used for this video in the resources section.
Experimentation & Feedback: Running Experiments with Evaluations
Now that we have established our evaluations, it is time to put them to use. In this step, we move from manual testing to a scientific exploration of how our prompt behaves at scale across a large dataset.
How to Run an Experiment with Evals
To measure the reliability of your prompt scientifically, follow these steps in the Latitude dashboard:
Navigate to the Experiments Tab: Go to your prompt (e.g., Festival Chatbot) and select the Experiments tab.
Configure the Run: Give your experiment a clear name, such as "First Experiment with Evaluations".
Select Evaluations: Manually select the evaluations you want to run alongside the experiment. In this example, we use Duplicate Tool Call and Extraction Error.
Load Your Dataset: Select your prepared dataset (e.g., the 100-row Festival Experiment Dataset).
Note: If you have strict API limits, you can choose to run only 10–20 rows for a quicker, less intensive test.
Map Parameters: Associate your prompt's variables (like {{customer_query}}) with the corresponding columns in your dataset.
Analyzing the Results: The Investigator Phase
Once the experiment finishes, you will receive a high-level summary of its performance. This data represents the "Observation" phase of the Reliability Loop at scale.
Understanding the Metrics:
Average Score: A low average (e.g., 2.5) indicates that the prompt requires significant iteration to meet quality standards.
Failures vs. Successes: The total number of evaluation results equals the number of logs multiplied by the number of evaluations. For 100 logs and 2 evals, you will see 200 total results.
Pass Rate: Finding very few successes (e.g., 5 out of 200) highlights exactly how severe your identified issues are in a production-like setting.
Deep-Dive Investigation
By clicking on "See Logs," you can investigate specific failures to understand the relationship between your evaluations:
Spotting Trends: You can see which logs failed both evaluations and which managed to pass at least one.
Verifying Eval Accuracy: Reviewing "passing" logs helps you confirm your evaluations are working correctly (e.g., verifying that a log passed the "Extraction Error" eval but correctly failed the "Duplicate Tool Call" eval).
Conclusion of the Evaluation Phase
Running experiments with evaluations provides a scientific, data-driven understanding of issue severity. You are no longer guessing if a prompt "seems" better; you have a measurable baseline to track. This completes the evaluation section of the Reliability Loop, leading into the final stage: Iteration.
From Isolation to Patterns
When you first review AI outputs, manual feedback is useful, but it doesn't show you what is happening with your prompt at scale. To truly improve a system, you need actionable patterns rather than isolated pieces of feedback. Issue Discovery is the process of aggregating individual annotations and grouping them into recurring "failure modes" or clusters based on their similarity.
Understanding Your Failure Modes
By categorizing every annotation into "baskets," you can identify which problems are the most pressing based on the number of events tied to them.
Common Issues Identified in the Festival Chatbot:
Technical Failures: Tool calls failing to extract information from external websites.
Structural Failures: The JSON response field containing a nested object instead of a plain string, violating the expected output structure.
Efficiency Failures: Duplicate tool invocations causing redundant operations and unnecessary repetition.
The Need for Scientific Measurement
Simply iterating on a prompt and checking if the issues "seem" to disappear is not enough for building reliable products. You need a scientific way to measure if a fix actually worked.
The Problem with "Seeming": Without measurement, you don't know the gravity of an issue or the priority it should take in your engineering efforts.
The Solution: Every identified issue must become an Evaluation.
The Symbiotic Relationship: Issues and Evaluations
Evaluations are attached to your prompt and run automatically whenever a new log comes in.
Automatic Monitoring: If a new output fails an evaluation, it is automatically funneled back into the corresponding issue category.
Trend Tracking: This allows you to watch trends over time and understand the priority of each issue based on real-time data rather than small sample sizes.
Actionable Data: This step gives you the "Why" and "How Much" before you spend energy on prompt engineering to fix a specific failure mode.
Experimentation & Feedback: Building Evaluations
Types of evaluations
1. Human-in-the-Loop
Automatic Setup: When you create a prompt in Latitude, a "Human Annotation" evaluation is created automatically. You can create more Human-in-the-Loop evals in the "Evaluation" section.
The Process: Every time you manually thumbs-up or thumbs-down a log, you are performing a human-in-the-loop evaluation.
Current Status: In the early stages of development, it is normal to have a low pass rate (e.g., 4%) as you are still uncovering failures.
2. LLM-as-Judge: Automating Failure Detection
LLM-as-judge evaluations are highly automatable because they use a separate model to check the outputs of your primary model.
Automated Generation: You can generate these directly from your "Issues" section. By selecting an issue (e.g., "Tool failure to extract information"), the system can automatically create a judge prompt to detect that specific error.
Configuration: You simply select a provider and the system handles the alignment between your previous annotations and the new evaluation logic.
Monitoring: Once added, this judge model will review every incoming log to see if it matches the defined failure mode (e.g., checking if the tool call returned an error message).
3. Programmatic Rule: Technical Precision
These evaluations are used for deterministic issues, such as structural errors in code or redundant tool calls.
The Goal: To capture specific, technical failure modes that code can detect more reliably than a human or another LLM.
Creating the Rule: You must manually configure the metric. For identifying duplicate tool calls, we use Regular Expressions (Regex) to match specific patterns in the text.
Linking to Issues: After setting the pattern, you link the evaluation to a specific issue (e.g., "Duplicate tool invocations") so that every time the rule fails, it feeds data back into that issue basket.
Programmatic Rule Configuration
Name: Duplicate tool call
Metric Type: Regular Expression (Regex)
Description: Detects when the system makes redundant calls to the same tool multiple times instead of a single invocation.
Regex Pattern:
Code snippet
---
\{"type":"tool-call"[\s\S]*?\{"type":"tool-call"
---
Iteration & Optimization: Fixing the Prompt
Targeted Changes over Intuition
Once you have evaluation results, the next step is to make targeted prompt changes. These updates should always be based on observed failures and measured outcomes, rather than just intuition.
Manual Iteration vs. Automatic Optimization (important distinction!)
Manual Iteration: Directly editing the prompt yourself. This is most useful early on for obvious fixes or when adding new capabilities that cannot be inferred automatically.
Automatic Optimization: Using data-driven tools to continuously improve reliability once a prompt is handling many real-world use cases in production (this is what Latitude automates).
Prioritizing What to Fix
An AI PM's core responsibility is deciding where to focus engineering energy. For every issue discovered, ask:
Frequency: How often does it occur?
Severity: How serious is the impact when it occurs?
Issues that are both frequent and high-impact must be prioritized.
The Iteration Loop
To avoid "throwing every solution at the wall," follow a consistent 5-step process for every change:
Identify: Pinpoint one concrete, repeated failure mode based on evaluation evidence.
Hypothesize: Form explicit, testable reasoning (e.g., "If I change X, then Y will improve because Z").
Modify: Change only one thing at a time (single variable) to accurately measure the cause of improvement or regression.
Test: Rerun evaluations under identical conditions (same dataset, logic, and metrics) for a valid comparison.
Evaluate: Review results to see if the targeted metric improved and if new issues appeared elsewhere.
Diagnosing Issue Origins
Factual Errors: Usually indicate weak grounding context or retrieval/tool issues.
Stylistic/Tone Issues: Typically point to unclear instructions.
Off-Topic Responses: Indicate an unclear task definition.
Tool Call Failures: Often related to structured output constraints, temperature, or model selection.
The Updated Prompt Hypothesis
In this video, we observed that tool-calling failures and extraction errors were occurring simultaneously.
Hypothesis: Switching to a more modern, efficient model will improve tool-calling reliability and extraction success.
Model Update
(you can directly change it on the formatter, or update the provider settings)
Previous Model: GPT-4o-mini
New Model: GPT-4.1-mini
Verification Step
After making the model change, use the Preview feature with the Festival Experiment Dataset to verify:
Metric 1: Does it produce a single tool call (solving the redundancy issue)?
Metric 2: Does it produce the correct JSON structure (Category and Response fields)?
Metric 3: Do both the "Extraction Error" and "Duplicate Tool Call" evaluations pass?
Iteration & Optimization: Comparative Experiments
Measuring Improvement at Scale
After making a targeted change to your prompt, the next step is to verify if that improvement holds up across your entire dataset. Manual previews are a great first signal, but running a full experiment allows you to compare the new version against the previous baseline scientifically.
The Comparison Workflow
To validate your hypothesis, you must rerun the exact same experiment configuration as before:
Consistency is Key: Use the same evaluations (Duplicate Tool Call and Extraction Error) and the same Festival Experiment Dataset.
Speed of Iteration: Modern platforms can process large batches (e.g., 100 logs) in under a minute, providing near-instant feedback on your changes.
Direct Comparison: By selecting your previous experiment results, you can see a side-by-side breakdown of how your metrics have shifted.
Analyzing the Results
In this iteration, switching to a more modern model version (GPT-4.1-mini) yielded dramatic results:
Eliminating Redundancy: The "Duplicate Tool Call" matches dropped from 99% down to 0%, completely resolving the redundant operation issue for this experiment.
Increasing Success Rates: The "Extraction Error" pass rate jumped from a mere 4% to 89%.
Setting a Benchmark: While an 89% pass rate is a massive improvement and generally strong for an LLM feature, you can now set a new goal (such as 95%) for future iterations.
What’s Next?
With a prompt that is now significantly more reliable, the focus shifts to maintaining that quality. In the next video, we will use these successful logs to create a Golden Dataset, which will act as your "ground truth" to prevent regressions as you continue to optimize.
Iteration & Optimization: Golden Datasets and Regression Testing
The Challenge of Frequent Changes
As you begin to regularly update and improve an AI system, a critical challenge arises: ensuring that new improvements do not break behaviors that already worked. This phenomenon, where reliability drifts despite new updates, is a common issue in real-world AI systems. The solution lies in dataset curation and regression testing.
Key Definitions
Golden Dataset: A curated set of high-signal examples (usually dozens) that represent the behavior you care about preserving. It includes critical user flows, edge cases, and scenarios that must continue to work over time. As an AI PM, you own and curate this "cornerstone for correctness".
Regression Testing: The process of running a new version of a prompt against your Golden Dataset and comparing the results to a recorded baseline. The goal is to detect if any previously "known good" behavior has broken (regressed).
The Regression Testing Workflow
Establish a Baseline: Run your current prompt against the Golden Dataset and record the evaluation results.
Make Changes: Update the prompt, switch models, or adjust tool logic.
Run Regression Test: Before shipping, run the Golden Dataset again against the new version.
Compare: Check if any previously successful cases now fail.
Important Nuance: Because LLMs are probabilistic, regression testing is not about exact output matches. Instead, look for statistical degradation, such as drops in accuracy, increased hallucinations, or higher latency/cost.
Hands-on: Building Your Golden Dataset
Building a Golden Dataset is a process of curation based on exceptionally good logs.
Identify Good Logs: Review your traces for successful outcomes, such as correct tool calls and helpful responses.
Curation Criteria: Look for logs that handle specific issues well (e.g., technical errors, billing inquiries) and use desirable formatting (like specific JSON keys).
Technique: In the Latitude dashboard, select the successful spans from your traces and use the "Add to dataset" feature to move them into your Golden Dataset.
Running Your First Regression Test
To perform the test, navigate to Experiments and follow the standard setup:
Name: Set a clear name like "Regression Test #1".
Evaluations: Select your standard evaluations (e.g., Extraction Error, Duplicate Tool Call).
Dataset: Select your newly created Golden Dataset.
Analyze: After running, you will receive a percentage score (e.g., 91%). This becomes the benchmark for all future versions of your prompt.
The idea is that you curate your own golden dataset, but we'll add one for testing purposes very soon.
Rule of Thumb for AI PMs
If something breaks in production, add it to the Golden Dataset. Over time, this becomes one of your most valuable product assets.
Iteration & Optimization: Prompt Optimization
What is Optimization?
Optimization is not manual prompt editing. It is an automatic process that uses data to continuously improve a prompt's reliability. This stage is best reached once you have a prompt working in production that handles many real-world use cases.
The GEPA Algorithm
We use a system called GEPA, an automatic prompt optimization process that has been shown to outperform fine-tuning in certain cases.
How it works: It takes a dataset and your provided evaluations, producing prompt iterations that attempt to reduce error as much as possible.
Why it works: LLMs are sensitive to changes at the word or character level; GEPA proposes edits, tests them, and keeps only those that improve the evaluation score.
Benefits: The internal mechanics are abstracted away. You don't need to reason about every specific word change; you only need to ensure your evaluation reflects your goals.
Composite Evaluations
Optimization relies heavily on the quality of your evaluations. There are two primary setups:
Single Evaluation: Measures one dominant failure mode. Best for narrow, well-scoped tasks.
Composite Evaluation: Combines multiple evaluations and failure modes. This is the recommended default for real product tasks because it prevents overfitting (when a model becomes too specific to one case and loses its general intelligence).
When Optimization Works Best
When outputs are structured (e.g., JSON).
When success is easy to score.
When failure modes are repeatable.
When you have a stable Golden Dataset or production logs.
Hands-on: The Optimization Workflow
1. Creating a Composite Evaluation
Before optimizing, you must combine your signals to direct the optimization process.
Automatic Creation: Composite evaluations are created automatically when you generate automatic evals from issues.
Manual Management: You can navigate to the Evaluations tab in your chatbot settings to view your Composite Score.
Refining Signals: Within the composite evaluation settings (e.g., "Performance"), you can add or delete specific sub-evaluations, such as "Extraction Error" or "Duplicate Tool Call".
Weighting: You can adjust the weights of these signals to determine which failure modes are more critical for the optimizer to address.
2. Starting the Optimization
Navigate to the Optimizations page and click Start Optimization.
Choose Depth: Options include Quick, Balanced, or Deep. A Quick optimization is often sufficient for simpler prompts and is more token-efficient.
Select Target: Choose your composite "Performance" evaluation.
Select Dataset: Use your Golden Dataset as the training ground.
Set Scope: Decide if the optimizer should modify only instructions or also configuration settings.
3. Analyzing Results
After the process completes (usually within 5 minutes or until the token budget is met), you can review the results:
Performance Gain: You can compare the baseline performance against the new version (Draft) to see the percentage improvement.
The Draft: The optimizer generates a new prompt version. These are often more "wordy" and include detailed explanations of processes and strict output format constraints.
Iteration & Optimization: Managing Prompt Versions
The Importance of Versioning
Once a prompt is ready for production, version management becomes critical. Versioning allows you to solidify your progress, providing a safety net that lets you revert to a stable state if new changes accidentally reduce reliability.
Types of Versions
Drafts: Unfinished or experimental versions. You can create an unlimited number of drafts to test new ideas without affecting the live environment.
Active/Live Version: The specific version of the prompt that is currently being pulled by SDKs or APIs in your production environment.
Archived Versions: Previously published versions that have been replaced but are kept for record-keeping or potential restoration.
The Deployment Workflow
To move a prompt from a draft to your active production version, you use the Deploy process:
Select Your Draft: Navigate to the draft you wish to solidify.
Review Changes (Diffing): Latitude provides a visual comparison where removed text is highlighted in red and new additions are in green.
Evaluate Evaluation Changes: You can also see which evaluations were added (blue) or removed (red) in this version.
Name and Describe: Give the version a clear, non-generic name (e.g., "Festival Production") to make it easy to identify in the history.
Deploy: Once deployed, this version becomes the "Live" version.
Managing Production Logs
When a version is marked as Live, any data generated through the SDK will be specifically labeled as Production Logs. This allows you to separate real-world user data from your playground or experiment traces, which is vital for accurate monitoring.
Reverting and History
History Section: You can view the entire timeline of your prompt (V1, V2, V3, etc.) to see how the instructions have evolved over time.
Reverting: If a new deployment causes issues, you can navigate back to an "Active" or "Archived" version and restore it to live status immediately.
Key Takeaways for AI PMs
Solidify Progress: Publish a version whenever you reach a milestone of improved reliability.
Safety First: Never make major changes directly to a live production version; always create a new Draft first.
Traceability: Use the history and diffing tools to understand exactly why performance shifted between version updates.
Iteration & Optimization: Monitoring in Production
Deploying Your Prompt
Once you have established a Live Version through versioning, you are ready to integrate the prompt into your actual product. Latitude provides three primary ways to access your prompt programmatically:
JavaScript SDK: For web-based applications.
Python SDK: Ideal for backend services and data science workflows.
HTTP API: A flexible option for any environment that supports standard web requests.
To get started, you must install the Latitude Data SDK. Detailed implementation guides and connection steps can be found in the official documentation linked within the platform's "Deploy" section.
Interfacing with the Feature
In a production environment (such as a simple chat web app), the system will use your live prompt configuration to handle real user queries. For example, a user reporting a wristband issue will trigger the prompt, which then categorizes the request (e.g., "Technical Issue") and generates the assistant's response in real-time.
Monitoring Production Logs
A critical shift occurs once your feature is live: you move from testing in a "Playground" to observing real-world behavior.
Production Section: In the Annotations or Traces tab, you can toggle from "Playground" to "Production" to see actual user requests.
Ground Truth: Production logs are your most valuable data. They represent exactly how users interact with your product, making them the highest priority for annotations.
Closing the Loop: You can continue the Reliability Loop by annotating these real logs, discovering new production-specific failure modes, and updating your evaluations based on real-world "ground truth."
Become a certified AI Product Manager in less than 5 weeks. Move beyond basic prompting to master the systematic engineering of reliable AI systems. This course transitions you from a builder to a strategic leader by teaching you the exact frameworks used to ship production-ready agents at scale.
The Path to Becoming an AI PM
Week 1 - Master the AI PM Mindset & Fundamentals: Understand why AI products are probabilistic rather than deterministic and identify the core "Reliability Gap" that traditional teams fail to solve. You will set up your workspace and learn the fundamental scientific frameworks used by top-tier AI teams to fix reliability issues.
Week 2 - Architect Complex AI Systems: Stop guessing with simple chat boxes and learn the science of prompt engineering. You will design sophisticated AI agents that utilize Prompting Roles, Tool Use, and Function Calling while architecting Structured JSON Outputs for machine-readable, professional results.
Week 3 - Establish Data-Driven Experimentation & Feedback: Move beyond "vibes-based" testing to structured environments where you learn to measure what actually matters. Gain the technical "eyes" to see inside the machine using Traces and Spans, allowing you to create rigorous experiments with datasets to validate performance through data.
Week 4 - Diagnosing & Automating Quality: Unlock the ability to scale by using human feedback and automated grading to ensure quality. You will learn Evaluations (Evals) Theory to choose the right eval type and build your own LLM-as-judge systems that grade thousands of production logs simultaneously.
Week 5 - Implement the Continuous Reliability Loop: Learn that the job isn't done at launch by mastering the full cycle of observation, annotation, and improvement. You will curate a Golden Dataset for Regression Testing, protecting your product from breaking as you update prompts, and perform Cost & Latency Optimization to ensure your production features stay fast and affordable.