DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
I Ditched Vector Search for My Coding Agent’s Memory. FTS5 Won.



Every “give your agent memory” tutorial I’ve read reaches for the same stack: chunk your docs, embed them, throw the vectors in a database, do cosine similarity at query time. So when I needed my coding agent to search through indexed tool output, git logs, and fetched docs without dumping raw text into the model’s context window, I assumed I’d be standing up a vector store too.

I didn’t. I used SQLite’s FTS5 full-text search instead, and for this specific job it’s not a compromise — it’s the better tool.

What the problem actually was

The tool I built (context-mode, for routing large command output and API responses out of the model’s context) needs to answer queries like:

“failing tests”
“HTTP 500 errors”
“async route handlers”

against arbitrary shell output, JSON responses, and fetched web pages — indexed once, searched however many times a session needs. The naive version just dumps everything into context and lets the model read it. That works until the output is 50KB of test logs and you’ve burned half your context window on a summary you needed three lines of.

Why vectors are the wrong default here, not just an alternative

Vector search is built to answer “what’s semantically similar to this.” That’s the right tool when you’re searching prose — support tickets, documentation, chat transcripts — where the same idea gets expressed in different words and you need “how do I reset my password” to match a doc titled “Account Recovery Steps.”

Coding-agent queries mostly aren’t that. “HTTP 500 errors” isn’t a fuzzy semantic concept I want approximated — it’s closer to a literal grep with better ranking. The content being searched is also structured and keyword-dense: stack traces, log lines, JSON keys, error codes. Embedding a stack trace and comparing cosine similarity throws away the thing that actually matters (the literal exception name, the literal line number) in favor of a vector representation that’s better at “these two paragraphs are about similar topics” than “this line contains the string ECONNREFUSED.”

FTS5 is built for exactly this: tokenized, indexed, ranked full-text search over exact and near-exact term matches, with BM25-style relevance scoring out of the box.

What it actually looks like

No embedding model, no vector database, no network round-trip to compute embeddings. It’s stdlib:

import sqlite3

conn = sqlite3.connect(“index.db”)
conn.execute(“””
CREATE VIRTUAL TABLE IF NOT EXISTS docs
USING fts5(source, content)
“””)

def index(source: str, content: str):
conn.execute(“INSERT INTO docs (source, content) VALUES (?, ?)”, (source, content))
conn.commit()

def search(query: str, limit: int = 5):
rows = conn.execute(“””
SELECT source, snippet(docs, 1, ‘(‘, ‘)’, ‘…’, 20), rank
FROM docs WHERE docs MATCH ? ORDER BY rank LIMIT ?
“””, (query, limit)).fetchall()
return rows

Enter fullscreen mode

Exit fullscreen mode

That’s the whole engine. snippet() gives you highlighted context around the match for free. rank gives you BM25 ordering for free. Querying “HTTP 500 errors” against a batch of indexed test output returns the actual lines containing 500 and error, ranked by term frequency and rarity — not the semantically-nearest paragraph, the actually-relevant one.

Where this would fall over — and why it doesn’t here

FTS5 is a bad choice if your queries genuinely need semantic matching: “find the doc about resetting my password” needs to match “Account Recovery,” and no amount of tokenization gets you there without embeddings. If I were building search over a knowledge base of prose documentation with inconsistent terminology, I’d reach for vectors, possibly hybrid (BM25 for recall, vectors for semantic re-ranking).

But an agent’s own tool output, error logs, and fetched API responses are dense with the literal terms you’re going to search for, because you (or the agent) wrote the query with those terms in mind. “Failing tests” as a query is going to co-occur with FAIL, AssertionError, test names — words that are actually in the log. The semantic gap that justifies embeddings mostly doesn’t exist in this domain.

The generalizable lesson

“Add semantic search” has become a reflex the same way “add a cache” or “add a queue” is — reached for because it’s the default answer to “how do I search this,” not because the problem demands it. Vector infra costs you an embedding model, a vector database or extension, and a slower indexing step, in exchange for a capability — semantic similarity — that keyword-dense, structured content usually doesn’t need.

Before reaching for embeddings on your next “agent needs to search X” problem, ask what the query and the content actually look like. If both are keyword-dense and structurally similar (logs, code, JSON, stack traces), full-text search with BM25 ranking will outperform vectors on relevance and cost you a fraction of the infrastructure. Save the vector database for the day your content is actually prose with vocabulary mismatch — most agent tooling isn’t there yet.



Source link

LLM Gateway vs MCP Gateway: Understanding the New AI Infrastructure Stack



As AI applications evolve from simple chatbots into autonomous agents, a new infrastructure layer is emerging. Terms like LLM Gateway, MCP Gateway, MCP Registry, LLM Router, and Agent Gateway are appearing everywhere—but what do they actually do?

Let’s break it down.

The Challenge with Modern AI Systems

Early AI applications were simple:

Application → LLM

Today’s enterprise AI systems are very different. A single AI agent may need to:

Access multiple LLM providers
Connect to GitHub, Slack, Jira, and internal APIs
Discover tools dynamically
Follow security and compliance policies
Track usage and costs

Without a centralized layer, managing these integrations quickly becomes messy and difficult to scale.

What Is an LLM Gateway?

An LLM Gateway provides a single entry point for all model interactions.

Instead of integrating separately with OpenAI, Anthropic, Gemini, or open-source models, applications connect to one gateway that handles:

Authentication
Rate limiting
Usage tracking
Cost monitoring
Security policies

For teams running multiple models, an LLM Gateway simplifies operations significantly.

If you’re exploring production-grade AI infrastructure, TrueFoundry has a detailed guide on LLM Gateways:

👉 https://www.truefoundry.com/docs/gateway

Why LLM Routers Matter

Not every request needs the same model.

A coding task may require a different model than a customer-support query. An LLM Router automatically selects the most suitable model based on factors such as:

Cost
Latency
Performance
Availability

This helps organizations optimize both quality and spending.

Enter MCP: The Standard for AI Tools

The** Model Context Protocol (MCP)** is becoming the standard way for AI agents to interact with tools and external systems.

Instead of creating custom integrations for every service, developers can expose capabilities through MCP servers.

Examples include:

GitHub MCP Server
Slack MCP Server
Notion MCP Server
Internal enterprise tools

As MCP adoption grows, managing dozens or hundreds of MCP servers becomes a challenge.

What Is an MCP Gateway?

An MCP Gateway acts as a centralized access layer between agents and MCP servers.

It provides:

Unified authentication
Access control
Auditing
Observability
Governance

Rather than giving every agent direct access to every tool, organizations can enforce policies through a single gateway.

Learn more about MCP Gateway architecture here:

👉 https://www.truefoundry.com/blog/introducing-truefoundry-mcp-gateway

MCP Proxy vs MCP Gateway

These terms are often confused.

An MCP Proxy primarily forwards requests between agents and MCP servers while handling authentication and connectivity.

An MCP Gateway goes further by adding:

Governance
Monitoring
Policy enforcement
Access management
Registry integration

Think of a proxy as a connectivity layer and a gateway as a complete management layer.

MCP Registry, Agent Registry, and Skills Registry

As AI ecosystems grow, discovery becomes just as important as connectivity.

*MCP Registry*A centralized catalog of available MCP servers, including metadata, ownership, and versions.

*Agent Registry*A directory of deployed AI agents and their capabilities.

*Skills Registry*A searchable catalog of reusable skills, tools, and workflows that agents can access.

Together, these registries help organizations avoid duplication and improve governance.

*Final Thoughts*The future of enterprise AI isn’t just about better models. It’s about managing how models, agents, and tools work together.

That’s why technologies such as **LLM Gateway, LLM Router, MCP Gateway, MCP Proxy, MCP Registry, Agent Gateway, Agent Registry, and Skills Registry **are becoming critical components of modern AI platforms.

As organizations scale from a handful of AI applications to hundreds of agents and tools, these infrastructure layers will become as important as API gateways are in traditional software systems.



Source link

MCP Server Design: 3 Principles We Learned in Production



Exposing a tool to an agent over MCP takes ten minutes. Building an MCP server that survives a model you don’t control, on a tight token budget with limited thinking time, is the part nobody warns you about.

We learned the difference shipping our own, consumed by third-party agents whose models we don’t pick. Three principles came out of it, each one we only fully believed after it broke in production:

TL;DR — three MCP server best practices from our trenches:

Fewer tools, narrower surface. Consolidate around the workflow, not the underlying API.

Consistent verbiage everywhere. Same name for the same concept across every input, output, and value on the server.

Validate against the protocol, not just your tests. The schema is the contract; everything else is a hint.

Background

We’ve been iterating on Trent’s MCP server; one public-facing surface for the product, consumed by third-party agents whose models we don’t control. Each iteration taught us something we’d half-believed going in but only fully internalized after it broke. These three principles have crystallized from that work, and they cut against the grain of how it feels to build a server when you’re moving fast. None of these are subtle in hindsight.

1. Fewer Tools, Narrower Surface

The instinct from regular software design, small composable units, single responsibility, doesn’t transfer cleanly to MCP. The consumer of the surface is an LLM with a finite attention budget, not another piece of software. The right size tool is the workflow, the agent is actually performing, not the smallest atomic operation in the underlying API.

Two reasons we’ve been aggressive about consolidation:

Overlap confuses tool selection. The trap usually isn’t tools that look identical; it’s tools that look distinct from the outside, with different names and different framings, but expose largely the same data with minor variations between them. The model has to decide which one is the “right” call for the workflow, and the decision is often arbitrary. On harder tasks it’s wrong in ways that are hard to debug. Consolidating those into a single tool, with the relevant slice exposed as a parameter, removes a degree of freedom the model didn’t need.

Every tool consumes context. If you’re exposing ~20 tools, the schema, name, description for each tool rides in the prompt every turn (once fetched). That’s a substantial chunk of context burned before the agent has done anything. Those tokens compound across a long loop and compete directly with the work the agent is actually trying to do.

Consolidating also tightens the loop for us as engineers. Fewer tools means a smaller surface to test, a smaller set of failure modes to observe, and a more direct path from a customer issue to the tool that caused it. The product gets simpler for the user, the workflow gets simpler for the model, and the codebase gets simpler for us. That alignment is rare; when you can find it, take it.

Concretely: we took our own MCP server from 17 tools down to 11, and the result was visibly better tool usage across the workflows that had been giving us trouble. The model spent fewer cycles on tool selection and the failure modes we were seeing on tighter constraints largely cleaned up. The current published version is trentai-mcp on PyPI.

The push to make this cut came from a pre-launch integration where Trent was exposed to end users through a third party’s chat interface. During testing we kept hitting cases where the chat couldn’t follow our instructions reliably, and tool overlap turned out to be a major contributor.

2. Consistency Across the Surface is a Correctness Property

MCP tool wording across the input schema, output schema, and the output values of every tool on a server needs to be consistent. If one tool calls a field user_id and another calls the same thing customer_id and a third returns accountId, the model has to reconcile that on every call. It mostly does, but reconciliation costs tokens, introduces ambiguity, and shows up as flaky tool calls in unpredictable conditions.

This matters more than it sounds because you don’t always control the model on the other side of the wire. When the MCP server is consumed by a third party, the agent could be running on a small model with a tight token budget and limited thinking time. Inconsistent naming that a frontier model would reason past, a smaller model just fails on. The same surface that looks fine in development collapses in a deployment you can’t see.

We ran into this during the same third-party pre-launch integration mentioned above. We exposed an update_tasks tool that let the chat write progress into a Trent security assessment, but the underlying API used control_id for the response field name and task_id for the input field name. The chat got confused between the two, the tool call failed repeatedly, and it couldn’t debug its way out. We didn’t catch this right away either; the 422s we kept seeing looked like a service-side bug, and we’d been debugging on the service end for a while before realizing the failure was upstream of the API, in the chat’s tool call. Making the naming consistent across input, output, and value cleared it up.

The frame I’ve started landing on is simple: the model on the other side of the wire is a variable you don’t get to pick. So design the surface for the lowest common denominator (consumer) that matters. Capable models reason past inconsistent naming; smaller ones fail on it. Consistency costs you one round of cleanup before you ship; inconsistency gets paid by every consumer, every call, forever.

3. Don’t Trust the Implementation Just Because it Works

This is the principle I’d most like to have learned sooner.

We built the MCP server with an agent. It worked. The tests the agent wrote alongside the implementation passed, our engineer-driven dogfooding ran cleanly, and the manual testing we did in the workflows we cared about all came back green. Beyond the tool selection and naming problems we covered earlier, we kept hitting a different class of failure that we couldn’t reproduce locally: the agent getting input shape wrong, invoking the tool in ways that didn’t match what we’d documented at all.

When we looked under the hood, the implementation hadn’t actually defined input and output schemas in the JSON properties the MCP protocol specifies. The agent that wrote the server had instead stuffed the entire contract, input shape, output shape, examples, into the description string of the tool, as a long comment-like blob. Frontier models read that and inferred the right structure. Smaller models, with less budget for inference, couldn’t. The fix is structural. MCP inputSchema and outputSchema are contracts, not hints. Stuffing them into the description string opts you out of every guarantee the protocol gives you.

Two lessons from that, both worth saying out loud:

Use the structure the protocol gives you. MCP defines inputSchema and outputSchema as discrete, structured fields for a reason: well-built clients use them to validate inputs, constrain agent behavior, and surface errors early. A description is a hint. A schema is a contract.

Agents get you to “working” faster than to “correct.” That gap is widest in unfamiliar territory, and a young protocol counts as unfamiliar territory, however many examples you’ve worked through. The agent picked a path that satisfied the tests it had written itself, evaluated by the same class of model that wrote them. It didn’t pick the path the protocol intended. We caught it because a stricter consumer broke; if we’d never had that consumer, we’d still be carrying the bug.

What we built with these principles

The server I’ve been describing — trentai-mcp — is how Trent shows up inside Claude Code. It runs the full Scan → Judge → Mitigate → Evaluate loop in your editor: surfacing threats relevant to your application’s architecture, prioritizing them against the real risk profile, generating a remediation plan that becomes tasks Claude Code can implement, and tracking how your security posture changes session over session.

MCP is still young, and the patterns for designing servers well are still being worked out across the industry. The three principles above are real world examples of what we’ve learned in production, and these principles are what I’d share with a new teammate, on day one when building a new server.

Originally published on the Trent AI blog — the full piece includes the worked example of the four consolidated tools.



Source link