DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
MCP Server Design: 3 Principles We Learned in Production



Exposing a tool to an agent over MCP takes ten minutes. Building an MCP server that survives a model you don’t control, on a tight token budget with limited thinking time, is the part nobody warns you about.

We learned the difference shipping our own, consumed by third-party agents whose models we don’t pick. Three principles came out of it, each one we only fully believed after it broke in production:

TL;DR — three MCP server best practices from our trenches:

Fewer tools, narrower surface. Consolidate around the workflow, not the underlying API.

Consistent verbiage everywhere. Same name for the same concept across every input, output, and value on the server.

Validate against the protocol, not just your tests. The schema is the contract; everything else is a hint.

Background

We’ve been iterating on Trent’s MCP server; one public-facing surface for the product, consumed by third-party agents whose models we don’t control. Each iteration taught us something we’d half-believed going in but only fully internalized after it broke. These three principles have crystallized from that work, and they cut against the grain of how it feels to build a server when you’re moving fast. None of these are subtle in hindsight.

1. Fewer Tools, Narrower Surface

The instinct from regular software design, small composable units, single responsibility, doesn’t transfer cleanly to MCP. The consumer of the surface is an LLM with a finite attention budget, not another piece of software. The right size tool is the workflow, the agent is actually performing, not the smallest atomic operation in the underlying API.

Two reasons we’ve been aggressive about consolidation:

Overlap confuses tool selection. The trap usually isn’t tools that look identical; it’s tools that look distinct from the outside, with different names and different framings, but expose largely the same data with minor variations between them. The model has to decide which one is the “right” call for the workflow, and the decision is often arbitrary. On harder tasks it’s wrong in ways that are hard to debug. Consolidating those into a single tool, with the relevant slice exposed as a parameter, removes a degree of freedom the model didn’t need.

Every tool consumes context. If you’re exposing ~20 tools, the schema, name, description for each tool rides in the prompt every turn (once fetched). That’s a substantial chunk of context burned before the agent has done anything. Those tokens compound across a long loop and compete directly with the work the agent is actually trying to do.

Consolidating also tightens the loop for us as engineers. Fewer tools means a smaller surface to test, a smaller set of failure modes to observe, and a more direct path from a customer issue to the tool that caused it. The product gets simpler for the user, the workflow gets simpler for the model, and the codebase gets simpler for us. That alignment is rare; when you can find it, take it.

Concretely: we took our own MCP server from 17 tools down to 11, and the result was visibly better tool usage across the workflows that had been giving us trouble. The model spent fewer cycles on tool selection and the failure modes we were seeing on tighter constraints largely cleaned up. The current published version is trentai-mcp on PyPI.

The push to make this cut came from a pre-launch integration where Trent was exposed to end users through a third party’s chat interface. During testing we kept hitting cases where the chat couldn’t follow our instructions reliably, and tool overlap turned out to be a major contributor.

2. Consistency Across the Surface is a Correctness Property

MCP tool wording across the input schema, output schema, and the output values of every tool on a server needs to be consistent. If one tool calls a field user_id and another calls the same thing customer_id and a third returns accountId, the model has to reconcile that on every call. It mostly does, but reconciliation costs tokens, introduces ambiguity, and shows up as flaky tool calls in unpredictable conditions.

This matters more than it sounds because you don’t always control the model on the other side of the wire. When the MCP server is consumed by a third party, the agent could be running on a small model with a tight token budget and limited thinking time. Inconsistent naming that a frontier model would reason past, a smaller model just fails on. The same surface that looks fine in development collapses in a deployment you can’t see.

We ran into this during the same third-party pre-launch integration mentioned above. We exposed an update_tasks tool that let the chat write progress into a Trent security assessment, but the underlying API used control_id for the response field name and task_id for the input field name. The chat got confused between the two, the tool call failed repeatedly, and it couldn’t debug its way out. We didn’t catch this right away either; the 422s we kept seeing looked like a service-side bug, and we’d been debugging on the service end for a while before realizing the failure was upstream of the API, in the chat’s tool call. Making the naming consistent across input, output, and value cleared it up.

The frame I’ve started landing on is simple: the model on the other side of the wire is a variable you don’t get to pick. So design the surface for the lowest common denominator (consumer) that matters. Capable models reason past inconsistent naming; smaller ones fail on it. Consistency costs you one round of cleanup before you ship; inconsistency gets paid by every consumer, every call, forever.

3. Don’t Trust the Implementation Just Because it Works

This is the principle I’d most like to have learned sooner.

We built the MCP server with an agent. It worked. The tests the agent wrote alongside the implementation passed, our engineer-driven dogfooding ran cleanly, and the manual testing we did in the workflows we cared about all came back green. Beyond the tool selection and naming problems we covered earlier, we kept hitting a different class of failure that we couldn’t reproduce locally: the agent getting input shape wrong, invoking the tool in ways that didn’t match what we’d documented at all.

When we looked under the hood, the implementation hadn’t actually defined input and output schemas in the JSON properties the MCP protocol specifies. The agent that wrote the server had instead stuffed the entire contract, input shape, output shape, examples, into the description string of the tool, as a long comment-like blob. Frontier models read that and inferred the right structure. Smaller models, with less budget for inference, couldn’t. The fix is structural. MCP inputSchema and outputSchema are contracts, not hints. Stuffing them into the description string opts you out of every guarantee the protocol gives you.

Two lessons from that, both worth saying out loud:

Use the structure the protocol gives you. MCP defines inputSchema and outputSchema as discrete, structured fields for a reason: well-built clients use them to validate inputs, constrain agent behavior, and surface errors early. A description is a hint. A schema is a contract.

Agents get you to “working” faster than to “correct.” That gap is widest in unfamiliar territory, and a young protocol counts as unfamiliar territory, however many examples you’ve worked through. The agent picked a path that satisfied the tests it had written itself, evaluated by the same class of model that wrote them. It didn’t pick the path the protocol intended. We caught it because a stricter consumer broke; if we’d never had that consumer, we’d still be carrying the bug.

What we built with these principles

The server I’ve been describing — trentai-mcp — is how Trent shows up inside Claude Code. It runs the full Scan → Judge → Mitigate → Evaluate loop in your editor: surfacing threats relevant to your application’s architecture, prioritizing them against the real risk profile, generating a remediation plan that becomes tasks Claude Code can implement, and tracking how your security posture changes session over session.

MCP is still young, and the patterns for designing servers well are still being worked out across the industry. The three principles above are real world examples of what we’ve learned in production, and these principles are what I’d share with a new teammate, on day one when building a new server.

Originally published on the Trent AI blog — the full piece includes the worked example of the four consolidated tools.



Source link

Why Your Gemini Bill Doesn’t Match the Model Names


Why Your Gemini Bill Doesn’t Match the Model Names

tl;dr – Across roughly 3,300 paired skill-eval runs, Gemini 3.5 Flash cost $1.05 per task against Gemini 3.1 Pro’s $0.66, for scores that were effectively identical: 88.6 versus 87.9.

The pricing is even stranger when you look at the actual task costs. Gemini 3.5 Flash and Gemini 4.5 Flash are separated by almost 8× in per-task cost, while Gemini 3.1 Pro comes in cheaper than both. The invoice does not appear to follow the naming hierarchy.

Where the numbers come from?

The benchmark ran every task twice, once with the relevant skill applied and once without, across four Gemini models in OpenHands, totaling roughly 800 tasks per model. Rather than relying on dashboard estimates, we pulled per-call token counts directly from agent session logs and computed costs using Google’s published per-token prices. We then compared the resulting per-task costs across models.

The headline data

Model
$/task (w/ skill)
Score
Pts per $
Input tokens
Turns
List $/Mtok

3.1 Flash Lite
$0.035
70.2
2,006
0.31M
17
$0.25

3 Flash Preview
$0.135
85.4
633
0.63M
24
$0.50

3.1 Pro Preview
$0.66
87.9
132
0.65M
26
$2.00

3.5 Flash
$1.05
88.6
85
1.41M
39
$1.50

A few things stand out from this data.

Cost order and name order are uncorrelated. Gemini 3.1 Pro is cheaper per task than Gemini 3.5 Flash despite carrying a higher per-token list price, while Gemini 4.5 Flash and Gemini 4.5 Flash-Lite, which sit in the same product family, differ dramatically in actual spend. Model names describe intended positioning, but they are a poor guide to real-world agent costs.
Scores do improve with each model generation, which is a genuine positive trend and a good reason to track releases, but capability gains do not automatically translate to cost reductions.
Finally, the practical value pick is Gemini 3 Flash Preview, which lands within three points of the leading models at roughly one-fifth the per-task cost, making it the most efficient option for workloads where a score in the 85 range is acceptable.

Why volume beats unit price

The cost of an agentic task is the product of two variables:

`Task cost = price-per-token × tokens the model decides to spend`

Enter fullscreen mode

Exit fullscreen mode

Model names establish the first variable. The second is determined at runtime by the model’s behavior on the specific task, and it only becomes visible after you read your session logs.

For Gemini 3.5 Flash, the per-task cost breaks down as follows:

Non-cached input: $0.72

Cache-read input: $0.14

Output (including thinking): $0.19

The dominant driver is input volume. Gemini 3.5 Flash sent 1.41 million tokens of context across 39 agent turns per task. Pro sent roughly half that volume across 26 turns, and even at its higher list price of $2.00 per million tokens, its lower volume resolves to a lower total bill.

A model with a cheaper per-token rate that takes more turns to reach an answer will erode its own discount. It is also worth noting that 63-75% of input across these runs was cache-read, which means the effective sensitivity to turn count is even higher than raw list prices suggest: the multiplier is accumulating in your session logs, not on your pricing page.

Skills move cost by tier

Adding a relevant skill to each run changed per-task cost in opposite directions depending on which model ran it:

Pro saw cost drop $0.20 per task (-23%) while the score gained 20 points. The model used fewer turns and less exploratory backtracking, which suggests it was able to act on the structured guidance directly rather than discovering the solution path through iteration.
3.5 Flash was essentially flat, with cost shifting by less than $0.03 in either direction.
3 Flash Preview and Flash Lite each spent slightly more tokens for marginal score gains (+$0.03 and +$0.01 respectively).

The underlying pattern is consistent: a skill compresses the solution path for a model capable of following structured guidance precisely, reducing turn count and therefore total cost. For a model still resolving ambiguity through exploration, the same skill adds context to process rather than a shortcut to apply, and the cost holds steady or rises marginally. A skill is a shortcut for a capable model and overhead for a weaker one.

In practical terms, this produces two clear operating points. Pro with a relevant skill at $0.66 per task is the most cost-efficient route to top-tier performance. Gemini 3 Flash Preview with a skill at $0.135 per task delivers roughly five times the score-per-dollar of either leader, for a score three points lower, which is a reasonable trade for many workloads.

Measure, don’t assume

Four takeaways from this data that apply beyond this specific benchmark:

1/ Do not budget from the rate card. Cost your workload based on measured tokens and turns on your specific tasks, with your specific prompts, in your specific agent harness. Per-token list prices are a useful first filter for ordering candidates, not a reliable predictor of relative spend.

2/ Read cost at the session layer. Aggregate dashboards can show $0 while spend accumulates in the background. Token usage needs to come from raw API responses or agent session logs to be trusted for budgeting purposes.

3/ Watch turn count first. The 39-versus-26 turn gap between 3.5 Flash and Pro is the primary cause of the price inversion observed here, and turn count is the variable most commonly absent from observability tooling. It is the multiplier on everything else in the cost equation.

4/ Re-measure when models update. Gemini 3.5 Flash is a newer release than Gemini 3 Flash Preview and scores higher, but it costs roughly eight times more in this agentic context. Capability improvements and cost improvements are independent variables, and any cost benchmark needs to be re-run with each version update rather than assumed to hold.

Caveats

These results come from a single agent harness (OpenHands), a single benchmark with explicit skill-relevance disclosure, and a specific sample window. Different tasks, prompt structures, and turn-length patterns will shift the absolute numbers and may shift the relative rankings. The finding to carry forward is not a specific model recommendation but a methodology: in agentic settings, cost rankings are not derivable from per-token rates alone, and the ranking that applies to your workload depends on that workload’s specific behavioral profile.

A model name is a pricing tier, not a cost forecast. In agentic workflows, the deciding variable is how many tokens the model chooses to spend to reach an answer, a figure visible only after you run the work and read the logs. The rate card gives you one of the two inputs; only measurement gives you both.

Next: which skills actually earn their tokens? In these runs, 42% produced significant performance gains while 5% were net overhead. We’ll follow up on this analysis in the next post.



Source link

Securing the AI supply chain 🛡️



AI agents are reading code and dependencies at scale. This changes how we think about supply chain risk and the security of our builds.

Adding security checks during the build process is essential for modern development. Here is how you can use Snyk and Upsun to protect your workflow:

Implement Snyk to scan dependencies for vulnerabilities
Add automated scans directly to your build hook
Capture risks at build time before they reach production
Understand exactly what automated scanning can and cannot catch

Check out the full technical write-up to see how to implement these build time fixes:

AI agents are reading code at scale, including your dependencies. Why supply chain risk just changed, plus a build-time fix to add on Upsun.

developer.upsun.com



Source link