DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
The CTO Playbook for AI Agent Data Analysis on a Budget



So here’s what happened: the CTO Playbook for AI Agent Data Analysis on a Budget

Six months ago my engineering team was burning roughly $14,000 a month on a single AI agent data pipeline. The model was great. The latency was fine. The output quality was honestly impressive. But the bill was eating our runway, and I had to make a call that would have felt absurd a year earlier: rip out a perfectly working stack and rebuild it from scratch.

This is the story of how I did it, what I learned shipping AI agent data analysis at scale, and why I now treat model choice the same way I treat database choice — as a strategic decision, not a default.

The Wake-Up Call

We had built our analytics agent on GPT-4o. It is a phenomenal model. I will not pretend otherwise. But the moment we crossed about 8 million tokens per day of production traffic, the math stopped working. At $2.50 per million input tokens and $10.00 per million output tokens, every new customer we onboarded was a net loss on infrastructure for the first three months.

I remember staring at the dashboard one Tuesday morning. Throughput was fine. The model was hitting the benchmarks we cared about. Our NPS was climbing. And yet finance was flagging the line item every week. That is the moment every startup CTO dreads: when the thing that is working is also the thing that is going to kill you if you do not change it.

So I started asking the questions I should have asked on day one. Which models are actually production-ready for our workload? What is the real cost gap between flagship models and the new generation of leaner ones? And critically, can I switch providers without rewriting my entire application?

That last question is the one nobody talks about. Vendor lock-in in the LLM space is real, and it is sneakier than cloud lock-in. When your prompt engineering, your evaluation harness, your retry logic, and your observability all assume one provider’s API shape, switching costs are not just financial — they are engineering hours you do not have.

The Cost Numbers That Made Me Switch

Once I started looking at the market seriously, the gap was jaw-dropping. Global API currently lists 184 models, with prices ranging from $0.01 to $3.50 per million tokens depending on tier. That spread is not academic. For an analytics agent, where input tokens dominate (because you are shoving tables, schemas, and prior context into every prompt), the input price is what actually moves your P&L.

Here is the comparison I built for my board deck:

Model
Input ($/M)
Output ($/M)
Context

DeepSeek V4 Flash
0.27
1.10
128K

DeepSeek V4 Pro
0.55
2.20
200K

Qwen3-32B
0.30
1.20
32K

GLM-4 Plus
0.20
0.80
128K

GPT-4o
2.50
10.00
128K

Look at GLM-4 Plus. $0.20 input, $0.80 output, 128K context window. For a large slice of our agent traffic — the follow-up questions, the structured summarization calls, the routing layer — the quality delta against GPT-4o was inside the noise floor of our human eval set. The cost delta was 12x.

That is when I knew. We were not paying for quality. We were paying for the logo on the box.

The Architecture I Actually Shipped

I am going to walk you through the production-ready setup we landed on, because I think it is the right shape for almost any team running AI agent data analysis at scale.

The core insight is that “AI agent data analysis” is not one workload. It is at least four:

Routing and intent classification — tiny prompts, high volume, must be cheap and fast.

Schema and tool selection — moderate context, structural reasoning.

Heavy analytical reasoning — the flagship call, where quality actually matters.

Verification and self-critique — another model call, where consistency matters more than peak brilliance.

Each of those workloads has a different price-quality sweet spot. Treating them as one homogeneous workload is how teams end up with $14,000 monthly bills for what should be a $3,000 service.

My routing logic now looks at the incoming query, classifies it (using GLM-4 Plus, which is dirt cheap), and then dispatches to one of three model tiers. The flagship calls — maybe 15% of total volume — still hit a top-tier model. The other 85% lands on leaner, faster, dramatically cheaper endpoints.

The result: a 40-65% cost reduction against our previous all-GPT-4o stack, with our internal quality benchmarks moving by less than 2 percentage points. That is the kind of ROI your CFO actually notices.

The Code

Here is the base client setup we use everywhere. I am showing the Python version because that is what our data team writes, but the same shape works in Node and Go.

import os
from openai import OpenAI

# when we swap providers — the entire point of routing through
# a unified API surface.
client = OpenAI(
base_url=”https://global-apis.com/v1″,
api_key=os.environ(“GLOBAL_API_KEY”),
)

def classify_query(user_query: str) -> str:
“””Cheap intent classification. GLM-4 Plus is plenty for this.”””
response = client.chat.completions.create(
model=”z-ai/glm-4-plus”,
messages=(
{
“role”: “system”,
“content”: “Classify the user’s analytics query as: simple, structured, or deep. Reply with one word only.”,
},
{“role”: “user”, “content”: user_query},
),
temperature=0.0,
max_tokens=4,
)
return response.choices(0).message.content.strip().lower()

def run_agent(user_query: str, context: str) -> str:
“””Dispatch to the right model tier based on query complexity.”””
tier = classify_query(user_query)

if tier == “deep”:
# Flagship tier — only for the hard stuff.
model = “deepseek-ai/DeepSeek-V4-Pro”
elif tier == “structured”:
# Mid tier — schema reasoning, tool calls.
model = “deepseek-ai/DeepSeek-V4-Flash”
else:
# Default tier — follow-ups, summarization, simple Q&A.
model = “Qwen3-32B”

response = client.chat.completions.create(
model=model,
messages=(
{“role”: “system”, “content”: “You are a senior data analyst. Reason step by step.”},
{“role”: “user”, “content”: f”Context:\n{context}\n\nQuestion: {user_query}”},
),
temperature=0.2,
)
return response.choices(0).message.content

Enter fullscreen mode

Exit fullscreen mode

Notice the base_url. That single line is the reason I am not locked into any one provider. If a better-priced model drops next quarter, or if a provider has a regional outage, I change the model string and move on. My application code, my prompt library, my eval harness — none of it changes. That is vendor lock-in avoidance as a feature, not as an afterthought.

For streaming responses on the deep tier, here is a second snippet that has saved us a lot of perceived latency complaints:

def stream_agent(user_query: str, context: str):
“””Stream the flagship tier for time-to-first-token gains.”””
response = client.chat.completions.create(
model=”deepseek-ai/DeepSeek-V4-Pro”,
messages=(
{“role”: “system”, “content”: “You are a senior data analyst.”},
{“role”: “user”, “content”: f”Context:\n{context}\n\nQuestion: {user_query}”},
),
stream=True,
temperature=0.2,
)
for chunk in response:
delta = chunk.choices(0).delta.content
if delta:
yield delta

Enter fullscreen mode

Exit fullscreen mode

Streaming shaved roughly 800ms off perceived response time on our longest-tail queries. At scale, that is the difference between a user thinking “this feels fast” and “this feels slow.”

What Actually Broke (And What I Learned)

I would be lying if I said the migration was clean. A few things bit us, and I want to be honest about them because the marketing material never is.

Tokenization differences. When you swap models, token counts do not transfer 1:1. The same English prompt can be 10-15% more tokens on one model than another. We had to rebuild our cost forecasting model from scratch. I am embarrassed how long I assumed tokenization was standard.

Latency variance. The 1.2s average latency number is real, but averages lie. We saw p99 latency spike on two of the cheaper models during US evening hours. We solved it with a simple fallback chain: if a call does not return inside 4 seconds, retry once on the next tier up. Costs us a few percent. Saves us a lot of angry customers.

Quality variance on edge cases. Our flagship model caught a subtle statistical error in about 95% of cases. The mid-tier model caught it in about 82%. That sounds small, but in a data analysis product, a silent miscalculation is a brand-destroyer. We added a verification call (using a different model family to avoid correlated errors) on any answer that involves numbers. The 84.6% average benchmark score we see is the blended result across all tiers.

Cache behavior. I cannot stress this enough: cache aggressively. We saw a 40% hit rate on our analysis queries within the first week, because analysts ask the same questions in slightly different ways. That 40% is pure margin. If you are not caching at the prompt-similarity level, you are leaving money on the table.

The Vendor Lock-In Question

This deserves its own section because it is the part of the conversation I think most CTOs avoid.

When you build on a single provider’s API, you are not just buying tokens. You are buying into their SDK conventions, their rate limit semantics, their error envelope, their deprecation policy, and their pricing roadmap. The moment any of those change in a way you do not like, you are stuck. And in the current LLM market, pricing has been dropping roughly 10x per year for equivalent capability. Locking in at last year’s prices is a real cost.

Routing through a unified API surface like Global API does not magically fix this, but it shifts the dependency from “the model vendor” to “the routing layer.” That is a much better place to be, because the routing layer has an economic incentive to keep you portable. Your model vendor does not.

We also run a quarterly exercise I call the “swap drill.” I take one of our production endpoints, switch it to a different model for a week, and measure the quality and cost delta. It is two engineer-days of work. It keeps us honest, and it means that if any provider raises prices or has a reliability incident, we are not scrambling — we are executing a playbook we have already rehearsed.



Source link

I built a free system design whiteboard for engineering interviews



I bombed a system design interview last year — not because I didn’t know the architecture, but because I spent the first 5 minutes fighting Excalidraw.

So I built SystemDesignBoard — a free, keyboard-first whiteboard specifically for system design interviews.

What it does

You open it, press a key, and start drawing. No account, no onboarding, no drag-from-a-sidebar friction.

R → place a Service node

C → place a Database/Cache/Queue

A → connect two nodes

N → open the scratchpad for scale math

The features I’m most proud of

Animated connectors that show communication type

Instead of just drawing arrows, connectors visually encode how services talk:

⇄ sync — paired dashes (request + ACK)

≋ stream — near-solid fast line with glow (continuous pipeline)

This matters in interviews — your interviewer can glance at your diagram and immediately understand the communication pattern.

Cloud provider badges

Tag any node as AWS (EC2, Lambda, RDS, S3), GCP (GKE, Cloud Run, Firestore), or Azure. Each subtype has its own icon.

Trade-off logging

Right-click any node → Log Trade-offs → attach your CAP theorem stance, consistency level, and scaling strategy directly to the component.

Diagram-as-Code

Type:(Mobile App) -> (API Gateway)(API Gateway) -> (Auth Service)(Auth Service) -> (Users DB)(Feed Service) -> (Posts DB x3)(Feed Service) -> (Redis Cache)Hit Apply — it auto-lays out the whole architecture in seconds.

Export to animated GIF

Export your diagram as a GIF that shows live traffic flow animations. Great for sharing after an interview or in a design doc.

Tech stack

React + TypeScript + Vite

@xyflow/react (ReactFlow v12) for the canvas

Zustand + Immer for state with full undo/redo

html-to-image + gifshot for PNG/GIF export

It’s free and open

No signup required. Works entirely in the browser. Free during beta.

👉 systemdesignboard.com

Would love feedback — especially from anyone who’s done system design interviews recently. What’s missing? What’s annoying? Drop a comment below.



Source link

Among Liars -> The 7th Player Isn’t Human



This is a submission for the June Solstice Game Jam.

I built Among Liars, a realtime multiplayer elimination where six humans join a room, but the game secretly adds a seventh player: a Gemini-powered AI hiding inside the Spy side.

There are two teams:

Detectives are trying to expose the hidden AI.

Spies are trying to protect the AI long enough for the Detectives to run out of chances.

The game is inspired by the Turing Test, but instead of asking “Can AI answer like a human?”, it asks something more playable:

Can AI survive being socially judged by humans?

When the game begins, six human players are split into two teams: three Detectives and three Spy Agents. A hidden Gemini-powered AI is then added to the Spy side, creating a team of four spies. The Detectives must identify the AI, while the Spy Agents work together to keep it hidden.

Each round starts with a 2-minute warmup where teams can plan in private rooms. Detectives discuss who feels suspicious. Spies coordinate how to protect the AI.

Then one Detective asks a wildcard-style question to the Spy side. The question is automatically sent to every living Spy player and also to the Gemini AI. Everyone answers under pressure, and the Detective has to read the answers like evidence.

The trick is that Spy-side players receive new cover names every round, so Detectives cannot simply track the AI by name or position. They have to judge tone, timing, weirdness, confidence, and emotional detail.

A question like:

“Describe a tiny mistake you made today without making it sound important.”

is much harder than a normal trivia question because it asks for texture, not correctness.

That is where the game becomes interesting.

Sometimes AI sounds too polished.

Sometimes humans sound fake on purpose.

Sometimes the suspicious answer is suspicious because it is AI.

Sometimes it is suspicious because a Spy is protecting the AI.

That tension is the core of Among Liars.

You can play it here:

Live Demo: https://amongliars.vercel.app

Video Demo

Live App

https://amongliars.vercel.app

GitHub Repository

https://github.com/abbasmir12/amongliars

How I Built It

The frontend is built with:

and a custom black-and-white visual style.

The backend uses Supabase for:

Room creation
Random matchmaking
Player state
Role assignment
Realtime chat
Private team rooms
Round state
Answer storage
Eliminations
Win conditions

I used Supabase Realtime instead of a custom WebSocket server, so messages, answers, player changes, and round changes update live across browser tabs and devices.

The game includes:

6-player waiting room with automatic countdown
Private role reveal
Detective-only and Spy-only private rooms
Public chat
2-minute planning/warmup phase
Rotating Spy cover names every round
Wildcard question flow
90-second Detective question window
45-second Spy answer window
30-second Detective final read window
Gemini AI answer generation
Evidence cards
Detective guess phase
Round result screen
Eliminated player tracking
Detective/Spy win states

Round Flow

Each round is designed to feel like a small interrogation.

First, there is a 2-minute warmup. During this time everyone can continue talking publicly, but the private rooms are where the real strategy happens.

Detective Strategy Room

Detectives discuss:

Who sounds too clean
Who is avoiding pressure
What question would expose the AI
Which answer patterns felt suspicious in previous rounds

Spy Strategy Room

Spies coordinate:

How to protect the hidden AI
How messy or natural their answers should feel
Whether to draw suspicion away from one player
How to make the room harder for Detectives to read

After warmup, one living Detective is selected.

That Detective receives a 90-second question window.

The Detective writes a wildcard pressure prompt. Once submitted, the question is automatically sent to every living Spy-side player, including the human Spies and the hidden Gemini AI.

The Spy side then receives a 45-second answer window.

Human Spies type their responses while Gemini generates its answer through a Supabase Edge Function.

All responses are stored in Supabase and displayed as evidence cards.

Finally, the selected Detective receives a 30-second final read window and must click the answer card they believe belongs to the AI.

Resolution

The resolution is intentionally asymmetric:

If the Detective guesses wrong, that Detective is eliminated.
If the Detective guesses correctly, one human Spy bodyguard is eliminated.
The AI survives until all human Spies are gone and it has nowhere left to hide.

Undercover Names

One of the most important mechanics is the rotating identity system.

Spy-side players never keep the same visible name between rounds.

A player might be:

TAVI in Round 1

ORION in Round 2

MICA in Round 3

This prevents cheap detective work.

Detectives cannot simply memorize player names, positions, or patterns tied to a specific identity.

Instead, they must judge the answers themselves.

Previous round results preserve the original cover names, so historical evidence remains readable even after identities rotate.

Wildcard Questions

The wildcard question is the heart of the game.

The best questions pressure the difference between a lived human answer and a generated answer.

Examples:

“Describe a tiny mistake you made today without making it sound important.”
“What is a smell that makes you trust a place?”
“Say something you would only text, not say out loud.”
“Which answer in this room feels rehearsed, and why?”

These questions are not about facts.

They are about texture.

They force players to produce awkward, emotional, sensory, or social details under pressure.

That is where the Turing Test becomes playable.

Gemini Integration

The Gemini integration runs server-side through a Supabase Edge Function.

When a Detective submits a question, the function:

Checks the current room and round.
Finds the hidden AI player.
Reads the AI’s current cover name.
Sends the question and game context to Gemini.
Receives a short in-character answer.
Saves the answer into Supabase.
Broadcasts it alongside the other Spy answers.

The Gemini API key is never exposed to the browser.

I also added multiple AI behavior styles so Gemini does not always respond with the same personality.

Sometimes it answers plainly.

Sometimes it is guarded.

Sometimes it is short, awkward, or oddly direct.

The goal is not to make the AI sound perfectly human every time.

The goal is to make it difficult to separate from the Spy side.

Prize Category

I am submitting for both optional prize categories.

Best Ode to Alan Turing

Among Liars is built directly around the idea of the Turing Test.

But instead of making the test a static question-and-answer screen, I turned it into a social game.

The AI is not judged by one answer alone.

It is judged by how it survives inside a room full of humans who are actively suspicious of it.

The game asks:

Can a machine imitate a human well enough to survive pressure, suspicion, and social reading?

That felt like a more interactive tribute to Alan Turing’s original imitation game.

Best Google AI Usage

Gemini is not a decorative feature in this project.

It is the hidden player.

The entire game loop depends on Gemini:

Gemini receives the Detective’s wildcard question.
Gemini answers as a Spy-side player.
Gemini uses the current room context and cover name.
Gemini’s answer becomes evidence the Detective must judge.
The game cannot fully exist without the AI participant.

I integrated Gemini through a server-side Supabase Edge Function so the API key remains protected and the AI response becomes part of the realtime game state.

The AI is also given its current undercover identity and round context, allowing it to behave like a player inside the match rather than a generic assistant.

Final Thoughts

Among Liars started from a simple question:

What if the Turing Test was not a test, but a game night?

The result is a tense social deduction game where humans are reading AI, humans are imitating AI, and nobody can fully trust what “normal” sounds like.

That is the fun part.

In this game, the AI does not need to be perfect.

It just needs to survive.



Source link