Fabian G. Williams

Principal Product Manager, Microsoft Subscribe to my YouTube.

How Do You Trust an Autonomous AI Agent? Evals Are the Answer.

I run an autonomous AI agent at home — 16 cron jobs daily. It says 'done' but did it actually do anything? I built an eval framework to find out. Here's what broke, what I learned, and why agent evals are fundamentally different from LLM evals.

March 28, 2026

Fabian Williams

10-Minute Read

OpenClaw Eval Dashboard showing mixed results across 9 dimensions — the honest picture after adding freshness, failure rate, and delivery gap scoring

I run an autonomous AI agent on a Mac Mini in my house. She handles 16 daily cron jobs — finances, email triage, outreach campaigns, device monitoring, morning briefings. The agent says “done.” But did it actually do anything? I built a 9-dimension eval rubric to find out. Along the way I discovered that my evals were broken, my agent was better than I thought, and the most important metric isn’t pass/fail — it’s whether a failure is your fault or the agent’s fault.

Fabian G. Williams

How Do You Trust an Autonomous AI Agent? Evals Are the Answer.

Recent Posts

How Do You Trust an Autonomous AI Agent? Evals Are the Answer.

Your Next Hire Should Be an AI — Here's How a Nonprofit Did It in Two Weeks

53 Downloads, 114 Countries, Zero Marketing Budget: My First Month on the App Store

WandR v1.0.1: A Ducati, a Reddit Question, and an AI Test at Union Station

Qui Non Proficit Deficit: Three Months Offline, Two Apps Shipped, and an AI That Runs a Nonprofit

Categories

About