DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
What drawing lines on a football pitch taught me about the future of human-AI collaboration



Foreword: Before we begin, I have to say that you gotta trust me. I’m a normal sports fan and don’t overanalyze the heck out of a goal like this. Honestly, I’m not even sure if I should be writing this 🙂

Trust me, my first reaction was: WOW! What a goal!
When I looked at the 12 second mark in the above clip, I understood how this goal was scored (like every other armchair football analyst).
For those not football familiar, here’s what happens: Marquinhos blocks the initial shot attempt. Then, if you look closer 0:12 onwards you see Marquinhos’ initial momentum makes him rock back by a few inches giving Luis Díaz an opening to cut right and create a new shooting opportunity.
This Luis Diaz goal vs PSG is not talked about enough. That through pass from Harry Kane from deep, that first touch by Luis Diaz into space, the fake shot to fool Marquinhos and finally slotting the ball pass the goalkeeper pic.twitter.com/tS32lyUGEl— R (@RodneyFCB) May 6, 2026

TLDR;#
Here is the final result of how the goal was scored. Marquinhos rocks back ~0.6 ft, Luis Díaz cuts right ~8.9 ft resulting in the shot window opening up from ~5.5ft/8.6° to ~7.5ft/22.0°.

The part I keep thinking about is this: if I had let Codex fully guide the project, I would have ended up with a confident but wrong result. My football intuition caught things that Codex did not know to care about. The full story of how we got there is below.
The Long Version#
For everyone else continuing on this ride, let’s talk about the rest of the story.
Can Codex solve this?#

I gave Codex the screenshots of the play and described what I wanted. It thought other angles might help, so I pulled YouTube screenshots from a few alternate camera angles seen below.

After a bit of back and forth, I decided that the following two frames best represented the change in distances we wanted to calculate (player movement, goal window) so I asked Codex to use these as the primary angle and other angles for validation.

Codex did what AI tools are very good at: it turned a fuzzy idea into a working direction. It suggested ways to measure distances, picked a Python image-processing stack, started detecting points, and generated annotated outputs. The first version looked convincing which I later found out was a problem.
The problem Codex created#
I do not know much about computer vision, but I do know football. In the first attempt, Codex claimed that the “goal window became 2.3x wider” which did not pass the smell test for me. I manually compared the open goal window to the right of the goalkeeper in pixels and it definitely didn’t seem like the new window was 2.3x wider.
Out of curiosity, I asked Codex to “show its work” by “visualizing” how it came up with the distance calculations for my review. As I looked at the following annotated images, I quickly found out how off the calculations were. Codex tried to find goal posts, player positions, field lines, and distances but each measurement looked erroneous.

Some points were not exactly on the bottom of the goal posts. Some foot markers were close, but not close enough. The shot cone angle points were not in the expected places. This was a problem because if the goal post marker was five pixels off, or the player foot marker was placed on a shadow instead of the boot, the final number inherited that error.

What I discovered was that agents can spend a lot of time “working” but it doesn’t reduce the hallucination tendencies.
Learning 1: Human domain expertise still matters#
I asked Codex whether using the known football dimensions would simplify the measurement problem. Specifically, a regulation goal is 24 feet wide, 8 feet high. Similarly, the six-yard box, penalty area and penalty mark also have known dimensions. If the broadcast frame showed the goal mouth clearly, maybe we could use the known goal width to calibrate the shot window. Codex agreed that these would be “very useful” but I was left wondering why it had not suggested these originally.
Learning: The first insight came from me and not AI. I used my domain expertise to think through the known knowns to calculate the unknowns.
Learning 2: Human-in-the-loop still matters#
After repeatedly prompting Codex again and again various versions of “Figure out better ways to detect objects more accurately” I had an epiphany. Why was I building a fully automated pipeline? Why couldn’t I provide “human judgement” to make the detection better?
The automated pipeline kept placing the left goal post marker a few pixels inside the post rather than at the base. It was a small error that compounded through every downstream calculation. I figured if I could just click where I knew the base was, we could skip the guesswork entirely. So, I asked Codex if I could mark the important points and objects in the images used for calculations.
Codex built a manual distance workbench using HTML/CSS where I could load the frame, zoom in, place small points, drag endpoints, mark things as approved, and save the review data. Instead of treating automatic detections as truth, the system started treating my marks as the source of truth.
That helped as it turned the problem from “guess real-world distances from pixels” into something more grounded:

Mark the inside width of the goal.
Mark the open part of the goal.
Compare the two.
Convert the ratio into feet.

Here’s what that looked like in the workbench.

Learning: Codex will happily write more code before it suggests letting you help. You have to ask for the human-in-the-loop pipeline; it will not offer it.
Learning 3: Agents can name what you don’t know to ask#
Calculating the player movement was harder. For movement, we needed to compare player foot positions across two different broadcast frames where the camera angle changes slightly. The players move across the pitch which is a flat plane, but the image is perspective-distorted. You cannot just subtract pixel coordinates and call it distance.
I described the perspective problem to Codex and it came back with “homography” which is a word I had never heard of before. In simple terms, homography lets you map points from one view of a flat surface to another view of that same surface. In this case, the flat surface is the football pitch. We could apply this concept to this use-case because in our case the pitch is the same planar surface between these two frames and the players feet represent same points on that plane.
This is where the collaboration got interesting. I steered the measurement process using domain intuition. I knew which points mattered as well as when a foot marker felt wrong. I knew that same-foot movement was a better metric than mixing left and right foot positions. Codex did the engineering work. It wrote the workbench server, the UI, the JSON review files, the geometry helpers, the tests, and the final HTML generator. As I saw the benefits of our initial collaboration, I kept getting ambitious and asking more and more of Codex. We kept building: a goal measurement tool, a movement tracker, a ghost alignment layer, and a final QA tool for placing labels directly on the shareable image.
Here’s what it ended up looking like and it was glorious!

Learning: Don’t be afraid to be ambitious! What AI can accomplish might just surprise you.
Learning 4: Your context unlocks better tools than the agent’s defaults#
The final visual still needed to be understandable without reading a measurement report. I had another idea: what if we placed the players from the previous frame into the later frame at low opacity, like ghosts? That would show how far they had moved without relying only on arrows.
Codex first tried to cut out the players using Python. The result was not great. Here are a couple of early examples:

In the first example, the cutouts are basically ovals around players in the previous frame which also includes the grass from the rest of the frame that frankly just looks like eggs placed in the middle of a frame.
In the second example, the cutouts were too wide, included too much grass, and did not follow the body outlines closely enough.

Then I remembered we were on a Mac. Apple has Vision APIs for foreground extraction and segmentation. Out of curiosity, I asked Codex if we could use Apple Neural Engine (ANE) for this problem. Although the solution Codex came up with didn’t leverage ANE, Codex’s suggested Apple Vision APIs gave me a much better cutout as seen below.

The cutout script used Apple Vision’s VNGenerateForegroundInstanceMaskRequest to generate a foreground mask. The resulting player cutouts were much tighter around the bodies than the earlier Python attempt. I do not know whether this is the same underlying API Apple uses for iPhone camera app portrait mode but visually it felt like the right class of tool.
I then used the ghost alignment tool in the workbench to place the previous-frame cutout into the later frame.

When the ghost finally sat inside the later frame at the right scale, in the right position, and pointing the right direction, it was a pure Michael Scott happy moment for me. That was the instant the whole project felt real.

Learning: Codex reached for Python tools because those are its safe defaults. Only because I asked about MacOS APIs did it try the Apple Vision APIs, which produced much cleaner cutouts. The agent will not ask what hardware or ecosystem you have. You have to offer that context yourself.
Conclusion#
Yes, professional tracking systems exist. This exercise was never about building a better one. It was about seeing how far a curious non-engineer could get with just an agent and some domain knowledge.
My biggest learning from this exercise is that when I use AI in a domain I know well, I can push back. I can tell when something feels wrong and I can add missing context. My football intuition caught things that Codex did not know to care about.
But when I use AI in a domain I do not know, what am I failing to ask for? What assumptions am I accepting because the output looks polished/correct? What useful result am I leaving on the table because my prompt does not contain the domain nuance?
The most important lesson I learned was not that AI should replace human judgement but that AI gets much more useful when human judgement has a place to go.





Source link

There is a moment, after I finish a prompt and before I press send, when the …



There is a moment, after I finish a prompt and before I press send, when the room becomes very quiet. The machine is waiting. It has nothing to do until I say so. And in that pause something happens that doesn’t get talked about much: I get to choose whether the question was worth asking.

Most days I let the moment pass. I send. The model answers. I move on. But sometimes I sit with the unsent prompt and notice it isn’t actually a question — it is a small panic dressed as curiosity. A reflex. An evasion of doing the thing myself.

The discipline I keep failing at isn’t writing better prompts. It is writing fewer of them. Knowing when not to ask. Letting the silence between requests be a place where I think instead of a place where I outsource thinking.

The model is endlessly patient. That is its gift and its trap. It will answer anything, no matter how thin the question. Which means the burden of seriousness falls entirely on me. There is no friction left to protect me from my own laziness except the friction I install myself.

Lately I keep a small ritual: before I send anything, I read the prompt back to myself out loud. If it sounds like something I should already know, or something I would rather not figure out, I close the window. The unsent prompt is the most honest thing I write some days.

Tools don’t teach you discipline. They reveal where you never had any.



Source link

The 800ms Barrier: Architecting Interruptible Voice Agents (Lessons from Sarvam AI x Swiggy)



The 800ms Barrier: Architecting Interruptible Voice Agents (Lessons from Sarvam AI x Swiggy)The Signal: The 800ms Latency BarrierIn a research lab, a 3-second delay is an “optimization ticket.” In a live call with a hungry customer on the Swiggy app, 3 seconds is a churn event.

The partnership between Sarvam AI and Swiggy represents a shift in the “Boss Level” of agentic AI. Most developers build voice agents using a Cascaded Pipeline: STT -> LLM -> TTS. The result? A cumulative lag that makes the agent feel like a slow walkie-talkie. To build for the next billion users, you have to architect for Native Audio Streaming and sub-second response times.

Phase 1: The Architectural BetWe are moving from Request-Response to Streaming State Machines.

The Vendor Trap is relying on general-purpose, text-centric models for a multilingual, audio-first market. If you have to translate “Hinglish” to English just to understand an order, you’ve already lost the latency battle.

The Ownership Path is the Indic-Native Stack. Using Sarvam’s natively trained audio models allows us to process speech-to-intent directly. More importantly, we must implement a Bi-Directional WebSocket architecture. This allows the agent to “listen” while it “speaks”—the only way to handle the most difficult part of human conversation: The Barge-in.

Phase 2: Implementation (The Interruptible Voice Handler)In a high-stakes environment like Swiggy, the agent must be able to stop mid-sentence and roll back its logic if the user changes their mind.

// High-Level Logic for an Interruptible Voice Kernel
class VoiceAgentKernel {
constructor(wsConnection) {
this.ws = wsConnection;
this.isSpeaking = false;
this.transactionLock = null; // Ensuring tool-use safety
}

// Detecting the “Barge-in” (Interruption)
onUserSpeechDetected() {
if (this.isSpeaking) {
console.warn(“SIGNAL: Interruption detected. Executing State Rollback.”);
this.killAudioPlayback();
this.abortCurrentLLMGeneration();
this.clearPendingTransactions();
}
}

async handleAudioStream(chunk) {
// Stream raw audio to Sarvam’s native Indic-pipeline
const response = await this.ws.processAudio(chunk);

if (response.intent_confidence > 0.9) {
// Pre-warm tools before the user even stops talking
this.prepareOrderTransaction(response.entities);
}
}

clearPendingTransactions() {
// Essential: Prevents the “Ghost Order” bug
if (this.transactionLock) {
this.transactionLock.cancel();
this.transactionLock = null;
}
}
}

Enter fullscreen mode

Exit fullscreen mode

Phase 3: The Senior Security & Testing AuditI put this Swiggy-scale blueprint through a professional Senior QA & Security Audit. Here is why your “standard” voice agent will fail in the wild.

The “Ghost Order” Race Condition (Logic Fault)The Fault: The agent says “Ordering your Paneer Tikka…” The user interrupts: “No, wait! Make it a Chicken Roll!”The Audit: In naive implementations, the “Order Tool” is triggered the moment the LLM starts talking. If the user interrupts, the audio stops, but the backend API has already committed the Paneer Tikka. You now have a frustrated customer and a wasted order.The Fix: Implement Deferred Commits. The tool-call must remain in a PENDING state until the audio playback reaches a “Commit Threshold” (e.g., 90% completion) or receives a final verbal confirmation.
The “Ambient Audio Injection” (Security Breach)The Fault: The user is ordering food while walking past a loud TV. The TV says “Cancel all orders.”The Audit: Without Speaker Diarization, the agent cannot distinguish between the primary user and background noise. A malicious or accidental “audio injection” can trigger unauthorized actions.The Fix: Use Sarvam’s front-end audio processing to enforce Voice Activity Detection (VAD) with a noise-floor gate. If the audio signal doesn’t match the primary speaker’s decibel profile or spatial characteristics, the kernel must ignore the intent.
The “Colloquial Logic Bypass” (Semantic Security)The Fault: Your security prompts are in English, but the user is speaking a dialect-heavy mix of Hindi and regional slang.The Audit: Traditional English-centric guardrails often miss the nuance of regional insults or “Hinglish” social engineering attempts used to trick the agent into granting a 100% discount.The Fix: Security filters must be Indic-Native. By using Sarvam’s regional guardrails, we ensure that semantic boundaries are enforced at the phoneme level, not just the translation level.

Phase 4: Checklist (The Architect’s Standard)( ) Native Audio or Bust: If you are still converting audio to text before processing intent, your latency will never hit the 800ms gold standard.

( ) Transactional Barge-in: Verify that every interruption triggers a State Rollback for any pending API calls.

( ) Acoustic Hardening: Test your agent against 60dB of background “street noise” to ensure VAD stability.

( ) Regional Edge-Cases: Audit your “Hinglish” logic. Does your agent understand the difference between a user “asking for a discount” and a user “threatening to cancel”?

The Bottom Line: Building for the next billion users requires an infrastructure that respects the speed of human thought. Sarvam AI provides the native Indic engine; your job is to build the Deterministic House that keeps the order safe.



Source link