🚀 Introduction
If you’ve ever wanted a single repo that lets you flip a switch between a locally‑hosted LLM, a hosted OpenAI model, and a Python‑based evaluation suite—while keeping every call visible in a unified observability stack—this is it.
The Agent + Local Model + Evals starter repo does exactly that. It bundles three independent but comparable pathways:
Path | Tech Stack | Primary Tracing Destination |
---|---|---|
A | .NET 8 + Semantic Kernel + Ollama | Azure Monitor (or any OTEL backend) |
B | OpenAI Agents SDK (TypeScript/Node) | OpenAI Logs → optional OTLP → Azure Monitor |
C | OpenAI Evals (Python) | None (pure scoring) |
The goal is simple: run the same prompt across all three implementations, watch the telemetry line up, and let the Evals tell you which one performed best.
🛠️ What This Delivers
- Side‑by‑side harness – One command line, three runtimes, identical prompt flow.
- End‑to‑end tracing – OpenAI’s built‑in agent logs, OTLP export to Aspire, and Azure Monitor correlation all work out‑of‑the‑box.
- Clean patterns – Consistent agent naming, reusable tool definitions, and environment‑first configuration.
- Immediate feedback – Dashboards surface latency, token usage, and tool‑call outcomes in real time.
🧠 Architecture At A Glance
┌───────────────────────┐ ┌───────────────────────┐
│ TypeScript (Node) │ │ .NET 8 (C#) │
│ @openai/agents SDK │ │ Semantic Kernel + │
│ → OpenAI Logs │ │ Ollama (local LLM) │
│ → optional OTLP → 🏁 │ │ → Azure Monitor (OTEL)│
└─────────┬─────────────┘ └─────────┬─────────────┘
│ │
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ Python Evals (C) │ │ Aspire / OTLP │
│ – compare prompts │ │ – collector → AI │
│ – score responses │ │ – Azure Monitor KQL │
└───────────────────────┘ └───────────────────────┘
Dashboards you’ll love
Dashboard | What you see |
---|---|
OpenAI Logs | Agent workflow visualisation, tool‑call timestamps, token counts. |
Aspire (OTLP) | Span‑level breakdown across services, latency heat‑maps. |
Azure Application Insights | KQL‑driven queries, item‑ingestion trends, cross‑service correlation. |
Tip: If you prefer a different OTEL backend (Jaeger, Honeycomb, etc.) just swap the exporter in
otel-collector-config.yaml
. The rest of the pipeline stays untouched.
📋 Quick Start (Happy Path)
Below is the “it‑just‑works” path I use on a fresh macOS dev box. Adjust paths if you’re on Linux or Windows.
1️⃣ .NET + Ollama
# 1️⃣ Pull a tool‑capable model (once)
ollama pull gpt-oss-120b # or any model that supports function calls
# 2️⃣ Export the Azure Monitor connection string (or any OTEL endpoint)
export AZURE_MONITOR_CONNECTION_STRING="InstrumentationKey=YOUR_KEY;IngestionEndpoint=..."
# 3️⃣ Run the .NET sample
cd sk-ollama
dotnet run
What to look for: In your telemetry backend you should see a service named SkOllamaAgent
with a trace that includes a tool_call
span and a downstream HTTP export to Azure Monitor.
2️⃣ OpenAI Agents SDK (Node/TypeScript)
# 1️⃣ Install the SDK + Zod (for schema validation)
npm i @openai/agents zod@3.25.67
# 2️⃣ Set your OpenAI key
export OPENAI_API_KEY="sk-..."
# 3️⃣ Run the sample
npm run start # typically `node src/main.ts`
The SDK automatically emits OpenAI Logs. If you want those logs pushed to Aspire, enable OTLP in otel-config.json
and set OTEL_EXPORTER_OTLP_ENDPOINT
.
Pro tip: Keep the
zod
version pinned. A recent minor bump broke the generated TypeScript types for a handful of agents.
3️⃣ Python Evals
# 1️⃣ Create a virtual env
python -m venv .venv && source .venv/bin/activate
# 2️⃣ Install OpenAI evals
pip install openai-evals
# 3️⃣ Run the comparative eval
openai-evals run \
--eval-name agent-comparison \
--prompts-file prompts.json \
--providers sk-ollama,openai-agents
The eval script pulls the same prompt list you used in the .NET and TypeScript runs, scores each response (e.g., relevance, tool‑call correctness), and spits out a CSV you can drop into a PowerBI or Looker dashboard.
🎯 Observability in Action
OpenAI Dashboard – Agent Workflows
Agent workflows showing successful TimeAgent executions with proper tracing.
Comprehensive Testing Results
OpenAI Dashboard displaying test outcomes, complete with tool‑call traces.
Azure Application Insights – KQL Query Results
Items Ingested Overview
Azure Application Insights surface shows request counts per agent, latency percentiles, and correlation IDs that match the OpenAI logs.
🔧 Practical Advice & Gotchas
Model choice matters. Ollama’s
gpt-oss-120b
is great for function‑calling demos, but if you hit memory limits on a dev machine, switch tollama3.1:8b
. The code path stays identical.OTLP latency. When exporting to Aspire over the default localhost collector, you may see an extra 30‑50 ms per span. In production you’ll want a dedicated collector node or an agent‑side buffer.
Environment‑first configuration. All three runtimes read from environment variables (
AZURE_MONITOR_CONNECTION_STRING
,OPENAI_API_KEY
,OTEL_EXPORTER_OTLP_ENDPOINT
). Keep them in a.env
file and load viadirenv
ordotenv-cli
to avoid “works on my machine” surprises.Tool definition drift. The TypeScript SDK generates TypeScript types at build time; the .NET Semantic Kernel reads a JSON schema. If you add a new tool, update the schema in both repos to keep the tracing payloads aligned.
Evals reproducibility. Store your
prompts.json
and any seed data under version control. The eval scores are deterministic only when the underlying model seeds are locked (use--seed 42
on the CLI).
🏁 Wrap‑Up
The Agent + Local Model + Evals starter repo is more than a demo; it’s a reproducible experiment platform. By the time you finish the quick‑start, you’ll have:
- A local LLM answering the same questions as a hosted OpenAI model.
- End‑to‑end tracing that stitches together OpenAI Logs, OTLP spans, and Azure Monitor KQL.
- A Python‑driven scoring pipeline that tells you which implementation wins on relevance, latency, or cost.
All of this is wrapped in a single GitHub repo, ready to be forked, extended, and—most importantly—observed.
Happy hacking, and may your traces be ever‑green!