You Don’t Have an AI Agent Problem, You Have an Evaluation Problem

Views (3)

RichAI

Like(0)

Report

A house isn’t valuable just because it exists. Its value is maintained through regular evaluation. The same is true for AI agents. An agent that “works today” can silently degrade tomorrow; not because the model changed, but because: inputs evolve, edge cases creep in and orchestration becomes more complex. Without evaluation, you’re guessing.

Why Copilot Studio Evaluations Are Necessary but Not Sufficient

Copilot Studio provides an Evaluations tab where you can upload CSVs and test: conversation flow, expected phrasing and whether the agent responds the way it was built. That’s valuable. But those evaluations answer a UX question, not an intelligence question. That distinction matters a lot once agents:

approve things
deny access
make recommendations
trigger downstream actions

The Real Problem: No AgentOps Mindset

AgentOps isn’t about shipping an agent faster. It’s about ensuring the agent remains:

correct
safe
reliable
auditable

…as it evolves.

My AgentOps Evaluation Stack

Once I accepted that evaluating conversations wasn’t enough, I needed a way to evaluate agent intelligence — deterministically, repeatedly, and before anything reached production. That meant working bottom‑up.

To move from agent demos to AgentOps, I needed an architecture that made responsibilities explicit:

what decides
what evaluates
what orchestrates
what enforces correctness

Here’s the structure I ended up with:

1. agents/ — Where Decisions Live

The agents/expense_agent.py file represents the decision-making core.

This agent does one thing only:

take user input
extract structured details
apply policy
return a structured decision

{
"decision": "Approve | Reject",
"reason": "…",
"amount": 5000,
"expense_type": "travel"
}

You cannot evaluate intelligence if the agent output itself isn’t deterministic and structured. By isolating the agent here:

it can be reused by Copilot, Foundry agents, or backend services
it can be evaluated independently
it can’t “hide” behind conversational phrasing

2. tools/ — Policies and Formatting, Explicitly Separated

The tools/ folder holds logic that supports the agent but should never be mixed with orchestration.

policy_checker.py

This file contains pure policy logic: amount thresholds, luxury constraints and deterministic rules, This allowed me to answer a key evaluation question later - Did the agent fail because the policy was wrong or because reasoning was wrong? Without this separation, failures are impossible to diagnose.

formatter.py

Formatting was isolated deliberately. By treating formatting as a tool:

prevented evaluation noise from phrasing changes
ensured LLM judges saw consistent input
avoided “false regressions”

3. evals/ — The Heart of AgentOps

This folder is where the AgentOps mindset lives.

test_cases.json

It’s a dataset which contains:

standard cases
edge cases
ambiguous phrasing
adversarial input
noisy real‑world language

This dataset evolved over time and whenever I added new scenarios, regressions surfaced immediately.

native_evaluator.py

This file introduces LLM‑as‑a‑Judge, powered by Azure AI Foundry models. Important design choice:

the judge evaluates reasoning quality
it returns structured JSON
it never overrides deterministic correctness

run_evals.py

This file ties everything together. For each test case, it:

Runs the agent
Applies deterministic correctness checks
Invokes the Foundry‑backed LLM judge
Classifies failure types
Aggregates accuracy
Generates an HTML report
Exits with a CI‑friendly status code

This is where I stopped trusting outputs and started trusting signals.

4. client/ — Azure AI Foundry as an Evaluation Dependency

The foundry_client.py file abstracts model access. Using Foundry models for evaluation means:

consistent judge behavior across environments
alignment with enterprise model governance
parity between development, CI, and production

5. .github/workflows/ — When Evaluation Became Enforcement

The agent-evals.yml workflow is where this stopped being an experiment. Every change:

runs run_evals.py
uploads the HTML report
fails the build if accuracy drops

This was the moment I crossed into AgentOps. If evaluations can’t block a deployment, they’re suggestions — not safeguards.

6. api.py — Making the Agent Reusable

Only after correctness and evaluation were locked down did I expose the agent via a thin FastAPI wrapper. This turned the agent into a capability, not an app. Now it could be:

called by Microsoft Copilot Studio
reused by Foundry agents
evaluated without changing orchestration
trusted across multiple systems

Github Link RichaPandit/agent-eval-demo

Final Thoughts

Most teams are building agents. Very few are building AgentOps. If your evaluation strategy only asks “Did the agent respond?” You’re missing the more important question: “Can I trust this system as it evolves?”

That’s where real AgentOps begins.

Community site session details

You Don’t Have an AI Agent Problem, You Have an Evaluation Problem

The Real Problem: No AgentOps Mindset

My AgentOps Evaluation Stack

Github Link RichaPandit/agent-eval-demo

Final Thoughts

Comments

Introducing the 2026 Season 1 community Super Users

Kudos to our 2025 Community Spotlight Honorees

Congratulations to the March Top 10 Community Leaders!