Stop Texting Your AI
How to Enforce JSON Output with Pydantic
Week 2: Moving from “Vibes” to “Validation” with the Agentic AI Architect
In Week 1, we established that LLMs are not magic brains; they are Stochastic CPUs. They are non-deterministic engines that predict the next token.
If you treat them like a chat buddy, sending polite texts like “Please analyze this email and give me a summary, thanks!”, you are building fragile software. You are hoping the Stochastic CPU is in a good mood.
But we are Architects. We are in the business of engineering certainty.
To build production-grade agents, we need to stop “texting” our AI and start treating it like a function call. We need inputs that are typed and outputs that are guaranteed.
Welcome to Week 2: Structured Output & Prompt Engineering as Code.
The “Hello World” of Enterprise AI: Classification
Before we look at the code, let’s look at the landscape.
While LinkedIn is busy debating the “Promise vs. Reality” of AI or doom-scrolling about “AI Slop,” most engineers are stuck in analysis paralysis, waiting to see who “wins the race” between OpenAI and DeepSeek.
We aren’t waiting. We are building.
The backbone of Enterprise AI isn’t image generation; it’s Classification.
Routing customer support tickets (Billing vs. Technical).
Scoring sales leads (High Intent vs. Kicker).
Detecting severity (P1 vs. P3).
The Traditional ML Path: If you have a dedicated Data Science team, a year of lead time, and 100,000 rows of clean, labeled data, by all means, train a BERT model or an XGBoost classifier. It is efficient, specialized, and cheap at scale.
The Architect Way: But what if you need a production-ready classifier today and you don’t have labeled data? You write a schema. You use a zero-shot LLM. You have a working system by lunch.
This week, we are building exactly that: A SaaS Email Router that takes messy, angry customer emails and converts them into rigid, actionable JSON.
(Note: This pattern is the foundational building block for the Data Contract Validator we will build in our Capstone project.)
The Architecture: Chaos In, Structure Out
The goal is to turn unstructured noise into structured signal. To do this, we are using LangChain and Pydantic.
Here is the difference between a “Script” and an “Architecture”:
1. The Contract (Pydantic)
We don’t ask the LLM for “a JSON.” We define exactly what that JSON looks like using a Pydantic model. This is our interface contract.
Python
from pydantic import BaseModel, Field
from typing import Literal
class EmailAnalysis(BaseModel):
category: Literal["billing", "technical_support", "sales", "complaint", "spam"] = Field(
description="The primary intent of the email."
)
priority: Literal["high", "medium", "low"]
summary: str = Field(description="A concise 1-sentence summary.")
confidence: float = Field(ge=0, le=1, description="Confidence score 0.0 to 1.0.")
Why this is powerful:
Enforced Enums: We use
Literalto restrict the category. The LLM cannot invent a category like “Unsure” or “Maybe Billing.” It must pick from our list.Constraints: We enforce that
confidencemust be a float between 0.0 and 1.0 usingge(greater or equal) andle(less or equal).
If the LLM misses a field or returns a string instead of a float, Pydantic throws a validation error. In our architecture, the PydanticOutputParser catches this error and automatically asks the LLM to fix it. Making the code Self-healing.
2. The Logic Layer (Why we need “Confidence”)
You noticed the confidence field in the schema above. This isn’t just for show.
In a production system, getting the data structure right is only half the battle. You also need to know how much to trust it. By forcing the LLM to self-evaluate (0.0 to 1.0), we can build a Logic Layer on top of the prediction:
High Confidence (> 0.9): Auto-process the refund.
Medium Confidence (0.5 – 0.9): Route to a human queue with a “suggested” tag.
Low Confidence (< 0.5): Trigger a retry loop or flag for manual audit.
Without this field, your application is flying blind.
The Architect’s Note: Temperature vs. Confidence
A common point of confusion: “If I set Temperature=0, isn’t the model already confident?”
No. Think of it like a Weather Forecaster.
Temperature (0.0) is telling the forecaster: “Always give me the most statistically likely prediction. No creativity. No ‘maybe it will snow in July’.”
Confidence is asking the forecaster: “Okay, you predicted rain. On a scale of 0 to 100, how sure are you?”
You use Temperature to make the forecast consistent (deterministic), but you use the Confidence Score to decide if you should actually carry an umbrella (business logic).
3. Prompts as Code (The YAML Strategy)
Finally, stop hardcoding your prompt strings inside your Python functions. It is messy, hard to read, and harder to version control.
In this repo, we treat Prompts as Code. We extract the logic into classifier_prompt.yaml.
Why this matters:
Version Control: You can see diffs in your prompt logic over time.
Testability: You can swap out prompts without touching the deployment code.
Determinism: It forces you to think of the prompt as a config file, not a magic spell.
Trust, but Verify (LLM-as-Judge)
You have your JSON. Great. But is it correct?
Most tutorials stop here. We don’t. In the repo, main.py includes a Golden Dataset, a list of tricky emails with known “correct” answers. We use GPT-4o as a Judge to grade our cloud/local model’s output.
We don’t just “eyeball” it. We run a test suite.
Prompt Engineering Isn’t Dead, It Just Graduated
Some people claim “Prompt Engineering is dead.” They are wrong. It just evolved from “guessing magic words” to System Design.
When building your YAML prompts, keep these four pillars in mind:
Context (Persona): Don’t just say “Classify this.” Say “You are a Tier 3 Support Manager.”
Clarity: Be verbose. Ambiguity in the prompt leads to hallucinations in the output.
Structure: Explicitly define the output format (like we did with Pydantic).
Iteration: Don’t change the prompt based on one example. Run it against your Golden Dataset to ensure you didn’t fix one edge case only to break three others.
When you lock these principles into a YAML file, your prompt becomes code: version-controlled, testable, and deterministic.
Conclusion: The Architect’s Checklist
If you are building an Agentic System, ensure you can check these boxes:
No “Vibes”: Are your inputs and outputs strongly typed (Pydantic/JSON schemas)?
Config, not Code: Is your prompt logic separated from your application logic (YAML)?
Deterministic Base: Is your temperature set to 0 for logic tasks?
Self-Awareness: Does your agent return a “Confidence” score for logic flow?
Graded: Do you have a “Golden Dataset” to verify the agent’s accuracy?
The Architect’s Quiz (Homework)
The Assignment: Clone the repo: https://github.com/snudurupati/agentic-ai-architect
Run
python 02_structured_output/main.py. Watch as the system processes the Golden Dataset and grades itself.The Twist: What happens to your system when the definition of “Priority: High” changes? Do you update the Python code, or just the YAML prompt?
The Edge Case: If the LLM is 90% confident but technically wrong, how do you catch that in production?
The Road Ahead: Why Structure Matters for RAG
You might be asking, “Why do I need JSON if I just want to build a Chatbot?”
Next week, we tackle RAG (Retrieval Augmented Generation). To build a RAG system that doesn’t hallucinate, you cannot just shovel documents into a context window. You need to structure your queries and your retrieved data.
If you can’t control the output shape of a simple email classifier, you have no chance of controlling a multi-document retrieval agent.
Stop guessing. Start Architecting.
See you in the repo.
The Robot Brain Diaries



