DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
The CTO Playbook for AI Agent Data Analysis on a Budget



So here’s what happened: the CTO Playbook for AI Agent Data Analysis on a Budget

Six months ago my engineering team was burning roughly $14,000 a month on a single AI agent data pipeline. The model was great. The latency was fine. The output quality was honestly impressive. But the bill was eating our runway, and I had to make a call that would have felt absurd a year earlier: rip out a perfectly working stack and rebuild it from scratch.

This is the story of how I did it, what I learned shipping AI agent data analysis at scale, and why I now treat model choice the same way I treat database choice — as a strategic decision, not a default.

The Wake-Up Call

We had built our analytics agent on GPT-4o. It is a phenomenal model. I will not pretend otherwise. But the moment we crossed about 8 million tokens per day of production traffic, the math stopped working. At $2.50 per million input tokens and $10.00 per million output tokens, every new customer we onboarded was a net loss on infrastructure for the first three months.

I remember staring at the dashboard one Tuesday morning. Throughput was fine. The model was hitting the benchmarks we cared about. Our NPS was climbing. And yet finance was flagging the line item every week. That is the moment every startup CTO dreads: when the thing that is working is also the thing that is going to kill you if you do not change it.

So I started asking the questions I should have asked on day one. Which models are actually production-ready for our workload? What is the real cost gap between flagship models and the new generation of leaner ones? And critically, can I switch providers without rewriting my entire application?

That last question is the one nobody talks about. Vendor lock-in in the LLM space is real, and it is sneakier than cloud lock-in. When your prompt engineering, your evaluation harness, your retry logic, and your observability all assume one provider’s API shape, switching costs are not just financial — they are engineering hours you do not have.

The Cost Numbers That Made Me Switch

Once I started looking at the market seriously, the gap was jaw-dropping. Global API currently lists 184 models, with prices ranging from $0.01 to $3.50 per million tokens depending on tier. That spread is not academic. For an analytics agent, where input tokens dominate (because you are shoving tables, schemas, and prior context into every prompt), the input price is what actually moves your P&L.

Here is the comparison I built for my board deck:

Model
Input ($/M)
Output ($/M)
Context

DeepSeek V4 Flash
0.27
1.10
128K

DeepSeek V4 Pro
0.55
2.20
200K

Qwen3-32B
0.30
1.20
32K

GLM-4 Plus
0.20
0.80
128K

GPT-4o
2.50
10.00
128K

Look at GLM-4 Plus. $0.20 input, $0.80 output, 128K context window. For a large slice of our agent traffic — the follow-up questions, the structured summarization calls, the routing layer — the quality delta against GPT-4o was inside the noise floor of our human eval set. The cost delta was 12x.

That is when I knew. We were not paying for quality. We were paying for the logo on the box.

The Architecture I Actually Shipped

I am going to walk you through the production-ready setup we landed on, because I think it is the right shape for almost any team running AI agent data analysis at scale.

The core insight is that “AI agent data analysis” is not one workload. It is at least four:

Routing and intent classification — tiny prompts, high volume, must be cheap and fast.

Schema and tool selection — moderate context, structural reasoning.

Heavy analytical reasoning — the flagship call, where quality actually matters.

Verification and self-critique — another model call, where consistency matters more than peak brilliance.

Each of those workloads has a different price-quality sweet spot. Treating them as one homogeneous workload is how teams end up with $14,000 monthly bills for what should be a $3,000 service.

My routing logic now looks at the incoming query, classifies it (using GLM-4 Plus, which is dirt cheap), and then dispatches to one of three model tiers. The flagship calls — maybe 15% of total volume — still hit a top-tier model. The other 85% lands on leaner, faster, dramatically cheaper endpoints.

The result: a 40-65% cost reduction against our previous all-GPT-4o stack, with our internal quality benchmarks moving by less than 2 percentage points. That is the kind of ROI your CFO actually notices.

The Code

Here is the base client setup we use everywhere. I am showing the Python version because that is what our data team writes, but the same shape works in Node and Go.

import os
from openai import OpenAI

# when we swap providers — the entire point of routing through
# a unified API surface.
client = OpenAI(
base_url=”https://global-apis.com/v1″,
api_key=os.environ(“GLOBAL_API_KEY”),
)

def classify_query(user_query: str) -> str:
“””Cheap intent classification. GLM-4 Plus is plenty for this.”””
response = client.chat.completions.create(
model=”z-ai/glm-4-plus”,
messages=(
{
“role”: “system”,
“content”: “Classify the user’s analytics query as: simple, structured, or deep. Reply with one word only.”,
},
{“role”: “user”, “content”: user_query},
),
temperature=0.0,
max_tokens=4,
)
return response.choices(0).message.content.strip().lower()

def run_agent(user_query: str, context: str) -> str:
“””Dispatch to the right model tier based on query complexity.”””
tier = classify_query(user_query)

if tier == “deep”:
# Flagship tier — only for the hard stuff.
model = “deepseek-ai/DeepSeek-V4-Pro”
elif tier == “structured”:
# Mid tier — schema reasoning, tool calls.
model = “deepseek-ai/DeepSeek-V4-Flash”
else:
# Default tier — follow-ups, summarization, simple Q&A.
model = “Qwen3-32B”

response = client.chat.completions.create(
model=model,
messages=(
{“role”: “system”, “content”: “You are a senior data analyst. Reason step by step.”},
{“role”: “user”, “content”: f”Context:\n{context}\n\nQuestion: {user_query}”},
),
temperature=0.2,
)
return response.choices(0).message.content

Enter fullscreen mode

Exit fullscreen mode

Notice the base_url. That single line is the reason I am not locked into any one provider. If a better-priced model drops next quarter, or if a provider has a regional outage, I change the model string and move on. My application code, my prompt library, my eval harness — none of it changes. That is vendor lock-in avoidance as a feature, not as an afterthought.

For streaming responses on the deep tier, here is a second snippet that has saved us a lot of perceived latency complaints:

def stream_agent(user_query: str, context: str):
“””Stream the flagship tier for time-to-first-token gains.”””
response = client.chat.completions.create(
model=”deepseek-ai/DeepSeek-V4-Pro”,
messages=(
{“role”: “system”, “content”: “You are a senior data analyst.”},
{“role”: “user”, “content”: f”Context:\n{context}\n\nQuestion: {user_query}”},
),
stream=True,
temperature=0.2,
)
for chunk in response:
delta = chunk.choices(0).delta.content
if delta:
yield delta

Enter fullscreen mode

Exit fullscreen mode

Streaming shaved roughly 800ms off perceived response time on our longest-tail queries. At scale, that is the difference between a user thinking “this feels fast” and “this feels slow.”

What Actually Broke (And What I Learned)

I would be lying if I said the migration was clean. A few things bit us, and I want to be honest about them because the marketing material never is.

Tokenization differences. When you swap models, token counts do not transfer 1:1. The same English prompt can be 10-15% more tokens on one model than another. We had to rebuild our cost forecasting model from scratch. I am embarrassed how long I assumed tokenization was standard.

Latency variance. The 1.2s average latency number is real, but averages lie. We saw p99 latency spike on two of the cheaper models during US evening hours. We solved it with a simple fallback chain: if a call does not return inside 4 seconds, retry once on the next tier up. Costs us a few percent. Saves us a lot of angry customers.

Quality variance on edge cases. Our flagship model caught a subtle statistical error in about 95% of cases. The mid-tier model caught it in about 82%. That sounds small, but in a data analysis product, a silent miscalculation is a brand-destroyer. We added a verification call (using a different model family to avoid correlated errors) on any answer that involves numbers. The 84.6% average benchmark score we see is the blended result across all tiers.

Cache behavior. I cannot stress this enough: cache aggressively. We saw a 40% hit rate on our analysis queries within the first week, because analysts ask the same questions in slightly different ways. That 40% is pure margin. If you are not caching at the prompt-similarity level, you are leaving money on the table.

The Vendor Lock-In Question

This deserves its own section because it is the part of the conversation I think most CTOs avoid.

When you build on a single provider’s API, you are not just buying tokens. You are buying into their SDK conventions, their rate limit semantics, their error envelope, their deprecation policy, and their pricing roadmap. The moment any of those change in a way you do not like, you are stuck. And in the current LLM market, pricing has been dropping roughly 10x per year for equivalent capability. Locking in at last year’s prices is a real cost.

Routing through a unified API surface like Global API does not magically fix this, but it shifts the dependency from “the model vendor” to “the routing layer.” That is a much better place to be, because the routing layer has an economic incentive to keep you portable. Your model vendor does not.

We also run a quarterly exercise I call the “swap drill.” I take one of our production endpoints, switch it to a different model for a week, and measure the quality and cost delta. It is two engineer-days of work. It keeps us honest, and it means that if any provider raises prices or has a reliability incident, we are not scrambling — we are executing a playbook we have already rehearsed.



Source link

The mysterious Hy3 LLM is topping OpenRouter Model Rankings by a large margin



OpenRouter is a service that provides access to most LLMs with a singular API, which has become exceedingly useful as of late given the rapid cadence of new LLM releases. Due to the company’s role as an intermediary between users and the LLM APIs, OpenRouter has robust, representative data on how users interact with LLMs and it publishes this data on the AI Model Rankings page: a welcome deviation from the labs themselves which generally keep this data secret for competitive reasons. Recently, I checked the OpenRouter rankings and noticed something peculiar.Retrieved May 25, 2026.Two new models are now beating LLM darling Claude in terms of token usage and by more than 50%? I’ve heard of DeepSeek Flash V4: it’s an open-source release from DeepSeek that is not only fast/cheap, but also performs closer to the leading LLM models at a very low cost so it’s no surprise that it’s incredibly popular. But what the heck is Hy3 preview? I’ve never heard of Hy3 or anyone talking about it. Googling it returns an announcement from Chinese megacorp Tencent about Hy3’s open-source release: the model page itself on Hugging Face is sparse and includes oddly honest benchmark results that are not favorable for the model compared to other Chinese open-source models.Coding-oriented benchmark results for Hy3 from Tencent’s Hugging Face repo.A Hacker News search for Hy3 only returned a single submission that isn’t about Hy3, and Reddit discussion is more about the open-weights release. One Reddit thread also noted the rise of Hy3 but from May 6, when Hy3 was offered by OpenRouter for free; that free endpoint is no longer available, and therefore Hy3’s usage in the weekly rankings above is from paying users.Hy3 preview is apparently popular in domains outside of agentic coding as well.Retrieved May 25, 2026.Did I miss something? After some nonscientific testing, the model quality is indeed on par with the other Chinese models indicated and not close to models such as Claude Opus 4.7 and GPT 5.5. It’s not a magic overlooked diamond-in-the-rough, so there has to be something else at play. Fortunately, OpenRouter has the data to narrow down possible explanations, but after checking the data I became more confused.Hy3 preview is available from the OpenRouter API at a stated price of $0.066/1M tokens input which is indeed cheaper than the current top-ranked model DeepSeek V4 Flash with a stated price of $0.10/1M tokens input. Given the drastically rising cost of LLMs and coding agents, it makes sense that a cheaper model would prevail, but only if it offered similar quality and that doesn’t appear to be the case.Here’s the chart of Hy3 preview model usage over time on OpenRouter from the model page:Hy3 preview has no usage data before May 8, which implies that is the time the model switched from the free SKU to the paid SKU. Usage is also steady over time since then with the initial rankings shown in this post being several weeks after launch, showing that the usage is at least organic (or very expensive to fake) and not a one-off outlier. Of note, if you do the math on the numbers presented here, the input-token-to-output-token breakdown on LLM API calls is now 98% input, 2% output in aggregate.For the OpenRouter AI Model Rankings, there have historically been spikes by specific apps switching their default to a particular LLM, such as when Kilo Code offered Grok Code Fast 1 for free in September 2025, which rocketed it up in popularity. That does not appear to be the case here because apps only constitute a very small part of Hy3 preview’s activity.The top 5 apps accout for OpenRouter’s value proposition is the ability to automatically route a given API request to different providers: for open-weight models such as DeepSeek V4 Flash, OpenRouter lists 13 providers, but Hy3 preview only has one provider despite its open weights: the Singapore-based SiliconFlow. Their usage page on OpenRouter shows that SiliconFlow had relatively little usage…until Hy3.The green area corresponds to free Hy3 usage while the blue area corresponds to paid Hy3 usage: OpenRouter does not differentiate them on mouseover which I suspect is a bug.Coincidentially that data visualization shows that usage didn’t drop drastically when Hy3 preview moved from free to paid, which in itself is interesting: if users were not getting value from the free model, they likely would have stopped using it once the costs hit their wallet.What am I missing? Am I overthinking it and the answer is really because “it’s the cheapest” and it received sufficient loss leader traction from the free period?…but is Hy3 preview actually the cheapest LLM backed by a major company on OpenRouter? While I was double-checking some assumptions, I found that OpenRouter has data that shows Hy3 preview is not the cheapest well-performing LLM available: it’s actually DeepSeek V4 Flash, but with interesting caveats.LLM Economics in 2026#So here are a few more notes about how LLM APIs work that aren’t often discussed. LLM calls are still stateless, which means that after every turn (including user messages to the LLM asking questions), all of the tokens in the current conversation thread are reprocessed, meaning that in the case of agents, the count of input tokens increases cumulatively with each successive message and is one reason why starting new threads frequently as context fills up is encouraged for effective agent use.Reverse-chronological OpenRouter logs from one minute of Zed Agent use with DeepSeek V4 Flash selected.But even before agentic workflows, large inputs such as full PDFs bloated context similarly. As a result, most LLM providers implemented prompt caching, which reuses input tokens processed earlier in the conversation: this is a win-win that saves time/compute for the LLM provider and the savings are passed to the customer. Most LLM providers cache inputs automatically, including when accessed through OpenRouter: the disk-lightning-bolt symbol next to the cost indicates tokens were cached and the cache may not always be hit, especially if OpenRouter switches providers mid-thread. The odd API provider out is the Anthropic (Claude) API which requires paying for a cache write first for some reason.Typically, cache read costs are 10% of the input costs: this is the case for the latest models from OpenAI API, Anthropic API, and Google Gemini API. For the 13 providers that serve DeepSeek V4 Flash, cache read costs are between 20% and 50% of input cost, which makes sense as they may not have the same economies of scale. There’s one DeepSeek V4 Flash provider that’s an exception, though:That’s a 2% cache read cost! (multiply by 2, move decimal left 2 places) How are DeepSeek’s cache read prices so low? DeepSeek has implemented a new approach to KV caching starting with V4 and as the model’s creator it is positioned to best leverage its own innovations, which as mentioned the benefits are passed to the customer. The DeepSeek V4 Pro variant model, when served by DeepSeek, has a cache read cost of 0.83%! (use a calculator for that one)Remember how I showed that 98% of LLM API costs are now input tokens, which are aggressively cached? That means the “stated” prices of LLMs are now misleading, but unusually in a pro-customer way because the effective price will be much cheaper! To counter this ambiguity, OpenRouter now has a table for effective prices on the model page, which accounts for the cost savings from cache hits. Here’s the effective pricing for DeepSeek V4 Flash via OpenRouter by provider, which is different for each provider as they have different cache read costs and cache hit rates:Retrieved May 25, 2026; these values update every hour.The prices are all over the place, but notice the second row where DeepSeek itself is the provider, which is priced at a whopping $0.018/1M input tokens! That 2% cache read really pays off. Comparing apples to apples with Hy3 preview, the effective pricing for Hy3 preview as noted on its model page from SiliconFlow (a whopping 44% cache read cost) is $0.034/1M: nearly double DeepSeek V4 Flash from DeepSeek! Of course, this is only applicable if DeepSeek is explicitly used as the provider, which some downstream OpenRouter clients/agents may not support: the OpenRouter prices match the prices directly from DeepSeek, so using a direct DeepSeek API key will work the same.There is also an elephant in the room: DeepSeek is a China-based company and some may not want—or may not legally be able—to give their payment processing information or LLM input data to a Chinese company who has set prompt training = true on their OpenRouter data policy information, which is a legitimate concern.Yes, subscription-based LLM services such as Claude Code and Codex are still the best bang for your buck if you’re able to consistently exhaust the usage limits. But the super-cheap DeepSeek V4 Flash via the API doesn’t lock you into a subscription, and if you need a bit more agentic compute to finish a project, it’s cheaper than paying for extra usage from the subscription services. At the least, it’s a microeconomic check against additional pricing shenanigans that will likely continue through 2026 as competition in agentic AI heats up.Overall, I still don’t understand the popularity of Hy3 preview on OpenRouter. Given the available data and analysis above, my guess is that a single large app not affiliated with Tencent is indeed using Hy3 as its data-processing backbone, and this app isn’t solely an agentic coding app. But one of the advantages of OpenRouter is that it’s low-lift to switch models and providers: it wouldn’t surprise me if DeepSeek V4 Flash gets a spike in a few weeks once people catch on to its pricing.



Source link