llm – DAILY NEWS

TECH & AI

I Ditched Vector Search for My Coding Agent’s Memory. FTS5 Won.

jackminion Jul 4, 2026 0

Every “give your agent memory” tutorial I’ve read reaches for the same stack: chunk your docs, embed them, throw the vectors in a database, do cosine similarity at query time. So when I needed my coding agent to search through indexed tool output, git logs, and fetched docs without dumping raw text into the model’s context window, I assumed I’d be standing up a vector store too.

I didn’t. I used SQLite’s FTS5 full-text search instead, and for this specific job it’s not a compromise — it’s the better tool.

What the problem actually was

The tool I built (context-mode, for routing large command output and API responses out of the model’s context) needs to answer queries like:

“failing tests”
“HTTP 500 errors”
“async route handlers”

against arbitrary shell output, JSON responses, and fetched web pages — indexed once, searched however many times a session needs. The naive version just dumps everything into context and lets the model read it. That works until the output is 50KB of test logs and you’ve burned half your context window on a summary you needed three lines of.

Why vectors are the wrong default here, not just an alternative

Vector search is built to answer “what’s semantically similar to this.” That’s the right tool when you’re searching prose — support tickets, documentation, chat transcripts — where the same idea gets expressed in different words and you need “how do I reset my password” to match a doc titled “Account Recovery Steps.”

Coding-agent queries mostly aren’t that. “HTTP 500 errors” isn’t a fuzzy semantic concept I want approximated — it’s closer to a literal grep with better ranking. The content being searched is also structured and keyword-dense: stack traces, log lines, JSON keys, error codes. Embedding a stack trace and comparing cosine similarity throws away the thing that actually matters (the literal exception name, the literal line number) in favor of a vector representation that’s better at “these two paragraphs are about similar topics” than “this line contains the string ECONNREFUSED.”

FTS5 is built for exactly this: tokenized, indexed, ranked full-text search over exact and near-exact term matches, with BM25-style relevance scoring out of the box.

What it actually looks like

No embedding model, no vector database, no network round-trip to compute embeddings. It’s stdlib:

import sqlite3

conn = sqlite3.connect(“index.db”)
conn.execute(“””
CREATE VIRTUAL TABLE IF NOT EXISTS docs
USING fts5(source, content)
“””)

def index(source: str, content: str):
conn.execute(“INSERT INTO docs (source, content) VALUES (?, ?)”, (source, content))
conn.commit()

def search(query: str, limit: int = 5):
rows = conn.execute(“””
SELECT source, snippet(docs, 1, ‘(‘, ‘)’, ‘…’, 20), rank
FROM docs WHERE docs MATCH ? ORDER BY rank LIMIT ?
“””, (query, limit)).fetchall()
return rows

Enter fullscreen mode

Exit fullscreen mode

That’s the whole engine. snippet() gives you highlighted context around the match for free. rank gives you BM25 ordering for free. Querying “HTTP 500 errors” against a batch of indexed test output returns the actual lines containing 500 and error, ranked by term frequency and rarity — not the semantically-nearest paragraph, the actually-relevant one.

Where this would fall over — and why it doesn’t here

FTS5 is a bad choice if your queries genuinely need semantic matching: “find the doc about resetting my password” needs to match “Account Recovery,” and no amount of tokenization gets you there without embeddings. If I were building search over a knowledge base of prose documentation with inconsistent terminology, I’d reach for vectors, possibly hybrid (BM25 for recall, vectors for semantic re-ranking).

But an agent’s own tool output, error logs, and fetched API responses are dense with the literal terms you’re going to search for, because you (or the agent) wrote the query with those terms in mind. “Failing tests” as a query is going to co-occur with FAIL, AssertionError, test names — words that are actually in the log. The semantic gap that justifies embeddings mostly doesn’t exist in this domain.

The generalizable lesson

“Add semantic search” has become a reflex the same way “add a cache” or “add a queue” is — reached for because it’s the default answer to “how do I search this,” not because the problem demands it. Vector infra costs you an embedding model, a vector database or extension, and a slower indexing step, in exchange for a capability — semantic similarity — that keyword-dense, structured content usually doesn’t need.

Before reaching for embeddings on your next “agent needs to search X” problem, ask what the query and the content actually look like. If both are keyword-dense and structurally similar (logs, code, JSON, stack traces), full-text search with BM25 ranking will outperform vectors on relevance and cost you a fraction of the infrastructure. Save the vector database for the day your content is actually prose with vocabulary mismatch — most agent tooling isn’t there yet.

Source link

TECH & AI

LLM Gateway vs MCP Gateway: Understanding the New AI Infrastructure Stack

jackminion Jun 23, 2026 0

As AI applications evolve from simple chatbots into autonomous agents, a new infrastructure layer is emerging. Terms like LLM Gateway, MCP Gateway, MCP Registry, LLM Router, and Agent Gateway are appearing everywhere—but what do they actually do?

Let’s break it down.

The Challenge with Modern AI Systems

Early AI applications were simple:

Application → LLM

Today’s enterprise AI systems are very different. A single AI agent may need to:

Access multiple LLM providers
Connect to GitHub, Slack, Jira, and internal APIs
Discover tools dynamically
Follow security and compliance policies
Track usage and costs

Without a centralized layer, managing these integrations quickly becomes messy and difficult to scale.

What Is an LLM Gateway?

An LLM Gateway provides a single entry point for all model interactions.

Instead of integrating separately with OpenAI, Anthropic, Gemini, or open-source models, applications connect to one gateway that handles:

Authentication
Rate limiting
Usage tracking
Cost monitoring
Security policies

For teams running multiple models, an LLM Gateway simplifies operations significantly.

If you’re exploring production-grade AI infrastructure, TrueFoundry has a detailed guide on LLM Gateways:

👉 https://www.truefoundry.com/docs/gateway

Why LLM Routers Matter

Not every request needs the same model.

A coding task may require a different model than a customer-support query. An LLM Router automatically selects the most suitable model based on factors such as:

Cost
Latency
Performance
Availability

This helps organizations optimize both quality and spending.

Enter MCP: The Standard for AI Tools

The** Model Context Protocol (MCP)** is becoming the standard way for AI agents to interact with tools and external systems.

Instead of creating custom integrations for every service, developers can expose capabilities through MCP servers.

Examples include:

GitHub MCP Server
Slack MCP Server
Notion MCP Server
Internal enterprise tools

As MCP adoption grows, managing dozens or hundreds of MCP servers becomes a challenge.

What Is an MCP Gateway?

An MCP Gateway acts as a centralized access layer between agents and MCP servers.

It provides:

Unified authentication
Access control
Auditing
Observability
Governance

Rather than giving every agent direct access to every tool, organizations can enforce policies through a single gateway.

Learn more about MCP Gateway architecture here:

👉 https://www.truefoundry.com/blog/introducing-truefoundry-mcp-gateway

MCP Proxy vs MCP Gateway

These terms are often confused.

An MCP Proxy primarily forwards requests between agents and MCP servers while handling authentication and connectivity.

An MCP Gateway goes further by adding:

Governance
Monitoring
Policy enforcement
Access management
Registry integration

Think of a proxy as a connectivity layer and a gateway as a complete management layer.

MCP Registry, Agent Registry, and Skills Registry

As AI ecosystems grow, discovery becomes just as important as connectivity.

*MCP Registry*A centralized catalog of available MCP servers, including metadata, ownership, and versions.

*Agent Registry*A directory of deployed AI agents and their capabilities.

*Skills Registry*A searchable catalog of reusable skills, tools, and workflows that agents can access.

Together, these registries help organizations avoid duplication and improve governance.

*Final Thoughts*The future of enterprise AI isn’t just about better models. It’s about managing how models, agents, and tools work together.

That’s why technologies such as **LLM Gateway, LLM Router, MCP Gateway, MCP Proxy, MCP Registry, Agent Gateway, Agent Registry, and Skills Registry **are becoming critical components of modern AI platforms.

As organizations scale from a handful of AI applications to hundreds of agents and tools, these infrastructure layers will become as important as API gateways are in traditional software systems.

Source link

TECH & AI

Use a flat-priced, auto-routing LLM API in Aider or Cline — one npx command

jackminion Jun 23, 2026 0

Coding assistants like Aider, Cline, and Continue all speak the OpenAI wire protocol — point them at a base_url, give them an API key, done. That makes swapping in a different LLM backend trivial… if that backend uses Authorization: Bearer.

The flat-priced, auto-routing API I’d been using doesn’t. It’s distributed through RapidAPI, which authenticates with an X-RapidAPI-Key header instead of Bearer. So I couldn’t just drop it into Aider. The fix turned out to be ~120 lines, so I open-sourced it.

modelis-openai

A zero-dependency local proxy (MIT, Node 18+). It listens on 127.0.0.1, speaks plain OpenAI, rewrites the auth header, and forwards to the upstream gateway. Streaming (stream: true) is piped straight through, so token-by-token output works exactly as with the OpenAI API.

your tool ──OpenAI(Bearer)──▶ modelis-openai (localhost) ──X-RapidAPI-Key──▶ upstream ──▶ best model

Enter fullscreen mode

Exit fullscreen mode

Quickstart

npx modelis-openai

Enter fullscreen mode

Exit fullscreen mode

Then point any OpenAI-compatible tool at it:

Setting
Value

Base URL
http://127.0.0.1:8787/v1

API key
your RapidAPI key

Model
modelis-auto

Drop it into your tool

Aider

export OPENAI_API_BASE=http://127.0.0.1:8787/v1
export OPENAI_API_KEY=
aider –model openai/modelis-auto

Enter fullscreen mode

Exit fullscreen mode

Cline / Roo Code — API Provider OpenAI Compatible, Base URL http://127.0.0.1:8787/v1, Model ID modelis-auto.

Continue (~/.continue/config.yaml)

models:
– name: Modelis
provider: openai
model: modelis-auto
apiBase: http://127.0.0.1:8787/v1
apiKey:

Enter fullscreen mode

Exit fullscreen mode

Any OpenAI SDK

from openai import OpenAI
client = OpenAI(base_url=”http://127.0.0.1:8787/v1″, api_key=””)
print(client.chat.completions.create(
model=”modelis-auto”,
messages=({“role”: “user”, “content”: “Hello”}),
).choices(0).message.content)

Enter fullscreen mode

Exit fullscreen mode

How it works

Reads the key from Authorization: Bearer (or MODELIS_RAPIDAPI_KEY).
Rewrites the request model to modelis-auto (configurable).
Forwards to the RapidAPI gateway with X-RapidAPI-Key / X-RapidAPI-Host.
Relays the response — including SSE streams and rate-limit headers — unchanged.

It also answers GET /v1/models and GET /health so tools that probe on startup don’t choke.

Honest notes

It routes to a paid API (there’s a free tier to start). The point of the proxy is to remove the integration friction, not to give anything away.

Cursor isn’t supported — it sends requests from its own servers, so a localhost endpoint can’t be reached. This is for tools that call the API from your machine.

Links

If you try it in a tool I didn’t list, I’d love to hear how it goes.

Source link