DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
MCP Server Design: 3 Principles We Learned in Production



Exposing a tool to an agent over MCP takes ten minutes. Building an MCP server that survives a model you don’t control, on a tight token budget with limited thinking time, is the part nobody warns you about.

We learned the difference shipping our own, consumed by third-party agents whose models we don’t pick. Three principles came out of it, each one we only fully believed after it broke in production:

TL;DR — three MCP server best practices from our trenches:

Fewer tools, narrower surface. Consolidate around the workflow, not the underlying API.

Consistent verbiage everywhere. Same name for the same concept across every input, output, and value on the server.

Validate against the protocol, not just your tests. The schema is the contract; everything else is a hint.

Background

We’ve been iterating on Trent’s MCP server; one public-facing surface for the product, consumed by third-party agents whose models we don’t control. Each iteration taught us something we’d half-believed going in but only fully internalized after it broke. These three principles have crystallized from that work, and they cut against the grain of how it feels to build a server when you’re moving fast. None of these are subtle in hindsight.

1. Fewer Tools, Narrower Surface

The instinct from regular software design, small composable units, single responsibility, doesn’t transfer cleanly to MCP. The consumer of the surface is an LLM with a finite attention budget, not another piece of software. The right size tool is the workflow, the agent is actually performing, not the smallest atomic operation in the underlying API.

Two reasons we’ve been aggressive about consolidation:

Overlap confuses tool selection. The trap usually isn’t tools that look identical; it’s tools that look distinct from the outside, with different names and different framings, but expose largely the same data with minor variations between them. The model has to decide which one is the “right” call for the workflow, and the decision is often arbitrary. On harder tasks it’s wrong in ways that are hard to debug. Consolidating those into a single tool, with the relevant slice exposed as a parameter, removes a degree of freedom the model didn’t need.

Every tool consumes context. If you’re exposing ~20 tools, the schema, name, description for each tool rides in the prompt every turn (once fetched). That’s a substantial chunk of context burned before the agent has done anything. Those tokens compound across a long loop and compete directly with the work the agent is actually trying to do.

Consolidating also tightens the loop for us as engineers. Fewer tools means a smaller surface to test, a smaller set of failure modes to observe, and a more direct path from a customer issue to the tool that caused it. The product gets simpler for the user, the workflow gets simpler for the model, and the codebase gets simpler for us. That alignment is rare; when you can find it, take it.

Concretely: we took our own MCP server from 17 tools down to 11, and the result was visibly better tool usage across the workflows that had been giving us trouble. The model spent fewer cycles on tool selection and the failure modes we were seeing on tighter constraints largely cleaned up. The current published version is trentai-mcp on PyPI.

The push to make this cut came from a pre-launch integration where Trent was exposed to end users through a third party’s chat interface. During testing we kept hitting cases where the chat couldn’t follow our instructions reliably, and tool overlap turned out to be a major contributor.

2. Consistency Across the Surface is a Correctness Property

MCP tool wording across the input schema, output schema, and the output values of every tool on a server needs to be consistent. If one tool calls a field user_id and another calls the same thing customer_id and a third returns accountId, the model has to reconcile that on every call. It mostly does, but reconciliation costs tokens, introduces ambiguity, and shows up as flaky tool calls in unpredictable conditions.

This matters more than it sounds because you don’t always control the model on the other side of the wire. When the MCP server is consumed by a third party, the agent could be running on a small model with a tight token budget and limited thinking time. Inconsistent naming that a frontier model would reason past, a smaller model just fails on. The same surface that looks fine in development collapses in a deployment you can’t see.

We ran into this during the same third-party pre-launch integration mentioned above. We exposed an update_tasks tool that let the chat write progress into a Trent security assessment, but the underlying API used control_id for the response field name and task_id for the input field name. The chat got confused between the two, the tool call failed repeatedly, and it couldn’t debug its way out. We didn’t catch this right away either; the 422s we kept seeing looked like a service-side bug, and we’d been debugging on the service end for a while before realizing the failure was upstream of the API, in the chat’s tool call. Making the naming consistent across input, output, and value cleared it up.

The frame I’ve started landing on is simple: the model on the other side of the wire is a variable you don’t get to pick. So design the surface for the lowest common denominator (consumer) that matters. Capable models reason past inconsistent naming; smaller ones fail on it. Consistency costs you one round of cleanup before you ship; inconsistency gets paid by every consumer, every call, forever.

3. Don’t Trust the Implementation Just Because it Works

This is the principle I’d most like to have learned sooner.

We built the MCP server with an agent. It worked. The tests the agent wrote alongside the implementation passed, our engineer-driven dogfooding ran cleanly, and the manual testing we did in the workflows we cared about all came back green. Beyond the tool selection and naming problems we covered earlier, we kept hitting a different class of failure that we couldn’t reproduce locally: the agent getting input shape wrong, invoking the tool in ways that didn’t match what we’d documented at all.

When we looked under the hood, the implementation hadn’t actually defined input and output schemas in the JSON properties the MCP protocol specifies. The agent that wrote the server had instead stuffed the entire contract, input shape, output shape, examples, into the description string of the tool, as a long comment-like blob. Frontier models read that and inferred the right structure. Smaller models, with less budget for inference, couldn’t. The fix is structural. MCP inputSchema and outputSchema are contracts, not hints. Stuffing them into the description string opts you out of every guarantee the protocol gives you.

Two lessons from that, both worth saying out loud:

Use the structure the protocol gives you. MCP defines inputSchema and outputSchema as discrete, structured fields for a reason: well-built clients use them to validate inputs, constrain agent behavior, and surface errors early. A description is a hint. A schema is a contract.

Agents get you to “working” faster than to “correct.” That gap is widest in unfamiliar territory, and a young protocol counts as unfamiliar territory, however many examples you’ve worked through. The agent picked a path that satisfied the tests it had written itself, evaluated by the same class of model that wrote them. It didn’t pick the path the protocol intended. We caught it because a stricter consumer broke; if we’d never had that consumer, we’d still be carrying the bug.

What we built with these principles

The server I’ve been describing — trentai-mcp — is how Trent shows up inside Claude Code. It runs the full Scan → Judge → Mitigate → Evaluate loop in your editor: surfacing threats relevant to your application’s architecture, prioritizing them against the real risk profile, generating a remediation plan that becomes tasks Claude Code can implement, and tracking how your security posture changes session over session.

MCP is still young, and the patterns for designing servers well are still being worked out across the industry. The three principles above are real world examples of what we’ve learned in production, and these principles are what I’d share with a new teammate, on day one when building a new server.

Originally published on the Trent AI blog — the full piece includes the worked example of the four consolidated tools.



Source link

AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals



Part 2 of a series on building production AI on .NET. Part 1 covered what evals are and the Analyze → Measure → Improve lifecycle. This post is about the step everyone wants to skip: **Analyze.

When a team decides to “take evals seriously,” the first thing they usually do is wrong. They open a dashboard tool, wire up a generic “correctness” score, and watch a number. It feels productive. It produces a chart. And it tells them almost nothing, because they skipped the step that decides what the chart should even measure.

That step is error analysis: reading your AI’s actual outputs and naming, precisely, the ways they go wrong. It’s unglamorous — no library, no dashboard, just you and a few dozen real examples. It is also, by a wide margin, the highest-leverage thing you will do in evals: error analysis is where the signal comes from. Everything downstream is just operationalising what you find here.

Why you can’t skip straight to metrics

There’s a gap between you and your running system that’s easy to underestimate. Thousands of inputs flow through your AI feature daily, in shapes you never anticipated, and you have no realistic way to see them at scale. Call it the comprehension gap — the distance between the developer and a true understanding of what the data and the model are actually doing.

Metrics don’t bridge that gulf; they presuppose it’s already bridged. To measure “conciseness” you must first have noticed that verbosity is a failure mode worth caring about. If you pick your metrics before you’ve read your data, you’re measuring your assumptions, not your product. The classic result: a dashboard glowing green while users quietly churn over a problem your metrics were never designed to catch.

Error analysis is how you cross the gulf. You trade scale for truth — you can’t read everything, so you read a sample, carefully.

How error analysis actually works

It’s a three-move loop, and the moves are deliberately low-tech.

1. Get a starting dataset and read it. Pull a sample of real (or realistic) outputs — 50 to 100 is plenty to start. Not the happy-path demo cases; the real distribution, including the weird inputs. Then actually read them. Slowly.

2. Open-code the failures. For each output that’s wrong, write a short, free-text note describing what specifically is wrong — in your own words, no fixed categories yet. “Explained the word using a dictionary definition instead of the meaning it has in this sentence.” “Translation is correct but the tone is far too formal for a casual chat.” “The quiz distractor is so obviously wrong it gives the answer away.” This is open coding: you’re labelling reality, not forcing it into boxes.

3. Cluster the notes into a taxonomy. Once you have 40–50 notes, patterns emerge. Group them. Those groups are your failure taxonomy — a ranked list of how your feature fails, with rough frequencies. Now you know what to fix first (the common, severe modes) and, crucially, what your metrics should measure.

That’s the whole secret. The taxonomy is the output, and it’s worth more than any single score, because every later step — the rubric, the golden set, the judge — is downstream of it.

A mindset note: be a detective, not a judge (yet)

The hard part of error analysis isn’t mechanical, it’s psychological. You will be tempted to immediately assign a 1–5 score, or to jump to “the fix is to add a line to the prompt.” Resist both. Scoring too early collapses rich information (“it’s a 2”) into a number that hides why. Fixing too early means you patch the first failure you see instead of the most common one.

Stay descriptive for as long as you can. Your only job in this phase is to understand and categorise. Judgement and repair come later.

A second trap is doing it alone. When two people label the same outputs, they disagree — and the disagreements are gold, because they reveal that “good” isn’t actually defined yet. A short alignment session to resolve them sharpens your definition of quality before you bake it into a rubric. (Solo founders can approximate this by labelling, sleeping on it, and re-labelling cold.)

How error analysis shaped TextStack’s evals

This isn’t abstract for us. TextStack has seven AI surfaces, and every rubric we score against came directly out of reading failures, not out of a generic template.

Take Explain (tap a word, get a short in-context explanation). Reading real outputs surfaced a recurring failure: the model would produce a competent dictionary definition while ignoring the sentence the reader was actually looking at — useless for someone trying to understand this passage. That single observation is why the Explain rubric scores accuracy in context and usefulness to a learner as distinct axes, and explicitly penalises dictionary boilerplate under conciseness. The rubric is a direct transcription of the taxonomy.

Other surfaces produced different taxonomies, and therefore different axes:

Translate kept failing on register — accurate but wrong formality — so register became its own scored dimension alongside accuracy and fluency.

Vocabulary distractors (wrong answers in a quiz) failed by being implausible (too obviously wrong) or too similar to the right answer, so the rubric scores plausibility, distinctness, and difficulty.

We didn’t invent those dimensions in a meeting. We read outputs until the dimensions were obvious. And because every AI call is traced and viewable on an internal /ai-quality page, error analysis isn’t a one-time exercise — new production failures keep feeding new categories back into the taxonomy.

The pitfalls

Scoring before describing. A number erases the why. Open-code in words first.

Vague categories. “Bad output” isn’t a category; “ignored the sentence context” is. Specific enough to act on.

Too small a sample, or only the easy cases. If you only read successes, you’ll conclude everything is fine.

Fixing during analysis. Note the failure, move on. Triage after you can see the whole picture.

Labelling solo with no calibration. Disagreement is information; surface it before it hardens into a bad rubric.

Doing it once. Inputs drift. The taxonomy is a living document, refreshed from real traffic.

The takeaway

Error analysis is the part of evals with no tooling, no dashboard, and the highest payoff — and that’s exactly why it gets skipped. Read your failures, name them in plain language, and cluster them into a taxonomy. That taxonomy tells you what to fix and what to measure. Skip it and you’ll build a beautiful measurement system pointed at the wrong target.

Next in the series: golden datasets that don’t lie — turning your taxonomy into a curated set of cases you can score against, without quietly fooling yourself.

TextStack is a reader that helps you finish the dense technical book you keep quitting — it builds every modern AI primitive (observability, evals, RAG, agents) as a real production feature on .NET. Try it at textstack.app, or read the code at github.com/mrviduus/textstack.



Source link

We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.



By Vilius Vystartas | May 2026

Every LLM can write code that works. The question is: can they write code that’s efficient — and does telling them to be efficient actually help?

I tested 10 models on 10 coding tasks, each in two phases: unprompted (the model writes its own code) and prompted (explicitly told to write clean, DRY, efficient code). That’s 200 API calls, $0.56 total. The results are… not what most prompt engineers would predict.

GPT-5.4 was the only model where prompting gave a substantial boost (+0.20). For most models, the “write efficient code” prompt was meaningless or actively harmful.

How the Metric Works

Each task has a known optimal token budget — the minimum tokens needed to produce correct, DRY code for that task (e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks). The efficiency score is optimal_tokens / actual_tokens, capped at 1.0.

A score of 0.63 means the model used about 1.6x the optimal — not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the “write efficient code” instruction actually changes behaviour.

The Leaderboard (Sorted by Prompted Efficiency)

#
Model
Unprompted
Prompted
Δ
Frugal
Cost
Correctness

🥇
GPT-5.4
0.43
0.63
+0.20
30%
$0.096
78% → 85%

🥈
Qwen 3.6 Plus
0.44
0.60
+0.17
40%
$0.158
78% → 87%

🥉
Gemma 4 31B
0.54
0.58
+0.04
50%
$0.003
92% both

4
DeepSeek Chat
0.51
0.55
+0.04
30%
$0.006
91% → 80%

5
Claude Sonnet 4
0.47
0.52
+0.04
40%
$0.121
92% both

6
LFM 2 24B A2B
0.54
0.47
-0.06
30%
$0.001
90% → 80%

7
Mistral Large 2411
0.54
0.46
-0.08
40%
$0.050
90% → 82%

8
Gemini 2.5 Flash
0.47
0.46
-0.01
50%
$0.020
92% → 90%

9
Cohere Command A
0.60
0.44
-0.17
40%
$0.071
90% → 82%

10
Kimi K2.6
0.34
0.43
+0.09
30%
$0.029
76% → 86%

What Stands Out

GPT-5.4 Is the Prompt Whisperer

GPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were config-generation (+0.81 — went from 12 inline JSON blocks to a template loop), html-from-data (+0.71), and magic-strings (+0.38 — switched to an Enum). It’s the only model in the batch where the “write efficient code” instruction consistently produces different (and better) output.

The cost is notable — $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real.

Gemma 4 31B: The Quiet Winner

Half of Gemma 4’s tasks were already “frugal” — naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That’s a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch.

Cohere Command A: Prompting Backfires

Cohere Command A had the highest unprompted efficiency in the batch (0.60) — it naturally writes concise code. But when told “write efficient code,” it ballooned output on several tasks. html-from-data went from a tight 45-token solution to a 600+-token monstrosity (-0.92 gap). The prompt made it overthink.

Lesson: if a model is already efficient, don’t prompt it to be more efficient.

Qwen 3.6 Plus: Second Place, Slowest

Qwen 3.6 Plus scored second in prompted efficiency (+0.17 improvement) but took 26 minutes for 20 tasks — by far the slowest model. The efficiency gain is real (especially on html-from-data where it went from hardcoded rows to a map/join pattern), but you’re waiting for it. Batch workloads only.

The Kimi Surprise

Kimi K2.6 had the lowest unprompted efficiency (0.34 — verbose, boilerplate-heavy code) but improved the most at the bottom end (+0.09). Still last place, but the prompt actually helped it compress — which is the opposite of the Cohere effect. Some models need the nudge.

Frugality: What Does It Mean?

“Frugal” means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% — half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal — they needed the prompt to tighten up.

The Bigger Picture

Group
Models
Behaviour

Prompt-responsive
GPT-5.4, Qwen 3.6 Plus
Efficiency improves substantially with prompting

Prompt-neutral
Gemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6
Prompt has little effect (±0.04)

Prompt-antagonistic
LFM 2 24B A2B, Mistral Large 2411, Cohere Command A
Efficiency drops when prompted

The prompt-antagonistic group is the most interesting. These models know how to write efficient code (0.54-0.60 unprompted), but the explicit instruction triggers over-engineering — they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric.

If the prompt says “write efficient code” and the model responds by writing more tokens, something in the training signal is misaligned.

My Picks

Best prompted efficiency: GPT-5.4 — 0.63, $0.10 for 20 tasks. The only model where prompting reliably improves output.

Best value overall: Gemma 4 31B — 0.58 prompted, 92% correctness, $0.003. Absurd price/performance.

Best natural efficiency: Cohere Command A — 0.60 unprompted. Don’t prompt it, just let it work.

Most consistent: Claude Sonnet 4 — 92% correctness on both phases, small +0.04 efficiency gain. Reliable.

Skip if you’re in a hurry: Qwen 3.6 Plus — 26 minutes for 20 tasks. Great efficiency gains, terrible latency.

Watch list: Kimi K2.6 — low base efficiency but the prompt actually helps. Worth retesting with a better prompt.

Methodology

Ten real-world coding tasks across CSS, JavaScript, Python, SQL, and bash — each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons (CSS), rendering 20 data rows as HTML (JS/HTML), bulk renaming (shell), form validation (Python), parametrized tests (Python), unit conversion (Python), SQL reporting queries, config generation (JSON), magic string replacement (Python/Enum), and middleware decorator pattern (Python/Flask).

Each model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: efficiency_ratio = optimal_tokens / actual_tokens (capped at 1.0). Correctness scored against expected output patterns.

Total cost: $0.56 for 200 API calls (10 models × 10 tasks × 2 phases). Temperature: 0.1. Max tokens: 600.

Full results: benchmarks.workswithagents.dev



Source link