DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
AI Bots Auditioning For Wall Street Trading Are Mostly Losing



AI isn’t ready to replace your fund manager — and the public experiments testing it are showing why.

Across a series of new trading contests between the world’s leading AI models, the verdict so far is unflattering. Most of the systems lose money. They trade too much. They make wildly different decisions when given identical instructions. And no one yet knows if these shortcomings will fade with more powerful iterations — or if they reveal something fundamental about the gap between large language models and how markets actually work.

Take Alpha Arena, run by tech startup Nof1. It pitted eight major frontier AI systems — including Anthropic’s Claude, Google’s Gemini, OpenAI’s ChatGPT and Elon Musk’s Grok — against each other in four separate competitions. Each was handed $10,000 per contest before being turned loose on US tech stocks for two weeks. The challenges involved trading on a variety of signals, acting defensively, reacting to the competition, and using high leverage. 


The portfolio as a whole lost about a third of its capital. Across all 32 sets of results, a model finished in profit only six times. Grok 4.20 delivered the best performance during the challenge in which it was aware of its rivals’ performance. It placed only 158 trades; under the same prompt, Alibaba’s Qwen traded 1,418 times.

Alpha Arena is one of a growing number of experiments testing whether LLMs can do the hardest job in finance: beat the market. While these contests are far from academically rigorous, they’re the most public demonstration yet of what happens when the systems try to take on some of the most lucrative and high-stakes work on Wall Street.

The early results matter because trading is one job the financial industry has been cautious about handing entirely to AI. Over the past few years, heavyweights from JPMorgan Chase & Co. to Balyasny Asset Management have put the technology to work nearly everywhere else. LLMs now parse news at quant shops, draft memos at hedge funds, and detect fraud at big banks, among other tasks. But “human in the loop” remains the motto when it comes to trading real money. Perhaps for good reason.

“LLMs can’t really make money by themselves,” said Jay Azhang, founder of Nof1. “You need basically a very sophisticated harness and scaffolding and data platform in order to even give them a chance.”

LLMs are good at doing research and finding and deploying the correct tools for certain tasks, he said. But they don’t yet know how much each of the many variables that swing stocks — including things like analyst ratings, insider transactions, and sentiment shifts — actually matters. They tend to mistime their trades, incorrectly size positions and buy and sell too often.

The AI blog Flat Circle tracked 11 markets-related arenas, and all had at least one model that made money. But in only two of the arenas was the median model profitable, showing how most struggled to beat the market. 

That outcome mirrors human performance, since a majority of actively managed funds famously also lag the broad market. And just like people, the models can be prone to obvious bias. The arenas show the AI systems making very different decisions with identical instructions, which has big implications for any firm deploying them. For instance, Azhang said that in Alpha Arena’s latest run, Claude mostly wanted to go long, Gemini had no problem being short, and Qwen was comfortable taking risks with big leverage. 

“They have personalities that you have to manage almost like a human analyst,” said Doug Clinton, who runs Intelligent Alpha, a firm with an LLM-driven fund that publishes its own benchmark for how well AI predicts corporate earnings. Results can be improved by letting the model know it’s showing some bias, he said.

Intelligent Alpha’s benchmark gives 10 AI models access to financial filings, analyst forecasts, earnings transcripts, macroeconomic data and up to 10 web searches. With its narrower focus, the results are more positive for LLMs. In the fourth quarter of 2025, OpenAI’s ChatGPT correctly predicted the direction of earnings estimates 68% of the time — the best results yet. And the models, Clinton said, tend to improve with every new release.

Hedge Fund Secrets

Evaluating any of this is hard. Design choices in everything from how often the models run to what assets they trade makes a big difference. And the default test for a trading strategy — running it backward through history to see how it would have performed — doesn’t really work for AI. 

A model asked in 2026 how it would have traded in March 2020 already knows what March 2020 looked like. That contamination, known as lookahead bias, has challenged the frameworks underlying academic and quantitative finance for decades. LLMs have to be assessed in live markets instead, hence the proliferation of benchmarks and arenas.

Perhaps because they mostly lose money, AI trading arenas tend to run for only short periods of time. With the low barriers to entry, many are set up by individuals or startups using the platforms as a launchpad for other products.

Nof1 is preparing season two of Alpha Arena, which will give each AI model the ability to search the web, ponder for longer, access more data sources and take multiple steps. But ultimately the firm’s business is a system enabling retail traders to build AI trading agents for their own strategies.

“Giving an LLM money right now and just having it go — that’s not a thing yet,” said Azhang.

Most of the public experiments are still too short and too noisy to support firm conclusions, reckons Jim Moran, who writes the Flat Circle blog and who previously co-founded alternative-data provider YipitData. These arenas also have natural disadvantages, including limited access to proprietary stock research and inferior execution.

“If you took one of these agents from one of these arenas and you just moved it over to operate inside of a high-end hedge fund, they should perform better,” he said.

Alexander Izydorczyk, formerly head of data science at the hedge fund Coatue Management and now at NX1 Capital, recently wrote that no AI trading bot he tracks has yet shown a lasting edge. He argued the arenas are limited by what they cannot see in their training data: the practical quant techniques used inside secretive trading shops.

He suggested the same secrecy is also a preview of where any AI that does begin to work will eventually go.

“But beginners sometimes see things incumbents cannot,” Izydorczyk wrote on his personal blog. “The outsiders, if successful, will also learn quickly that success in liquid, competitive markets pays better than the marginal X follower. When LLM agent trading strategies start working, you will not hear about it for a while.” 

This article was provided by Bloomberg News.

 



Source link

The 800ms Barrier: Architecting Interruptible Voice Agents (Lessons from Sarvam AI x Swiggy)



The 800ms Barrier: Architecting Interruptible Voice Agents (Lessons from Sarvam AI x Swiggy)The Signal: The 800ms Latency BarrierIn a research lab, a 3-second delay is an “optimization ticket.” In a live call with a hungry customer on the Swiggy app, 3 seconds is a churn event.

The partnership between Sarvam AI and Swiggy represents a shift in the “Boss Level” of agentic AI. Most developers build voice agents using a Cascaded Pipeline: STT -> LLM -> TTS. The result? A cumulative lag that makes the agent feel like a slow walkie-talkie. To build for the next billion users, you have to architect for Native Audio Streaming and sub-second response times.

Phase 1: The Architectural BetWe are moving from Request-Response to Streaming State Machines.

The Vendor Trap is relying on general-purpose, text-centric models for a multilingual, audio-first market. If you have to translate “Hinglish” to English just to understand an order, you’ve already lost the latency battle.

The Ownership Path is the Indic-Native Stack. Using Sarvam’s natively trained audio models allows us to process speech-to-intent directly. More importantly, we must implement a Bi-Directional WebSocket architecture. This allows the agent to “listen” while it “speaks”—the only way to handle the most difficult part of human conversation: The Barge-in.

Phase 2: Implementation (The Interruptible Voice Handler)In a high-stakes environment like Swiggy, the agent must be able to stop mid-sentence and roll back its logic if the user changes their mind.

// High-Level Logic for an Interruptible Voice Kernel
class VoiceAgentKernel {
constructor(wsConnection) {
this.ws = wsConnection;
this.isSpeaking = false;
this.transactionLock = null; // Ensuring tool-use safety
}

// Detecting the “Barge-in” (Interruption)
onUserSpeechDetected() {
if (this.isSpeaking) {
console.warn(“SIGNAL: Interruption detected. Executing State Rollback.”);
this.killAudioPlayback();
this.abortCurrentLLMGeneration();
this.clearPendingTransactions();
}
}

async handleAudioStream(chunk) {
// Stream raw audio to Sarvam’s native Indic-pipeline
const response = await this.ws.processAudio(chunk);

if (response.intent_confidence > 0.9) {
// Pre-warm tools before the user even stops talking
this.prepareOrderTransaction(response.entities);
}
}

clearPendingTransactions() {
// Essential: Prevents the “Ghost Order” bug
if (this.transactionLock) {
this.transactionLock.cancel();
this.transactionLock = null;
}
}
}

Enter fullscreen mode

Exit fullscreen mode

Phase 3: The Senior Security & Testing AuditI put this Swiggy-scale blueprint through a professional Senior QA & Security Audit. Here is why your “standard” voice agent will fail in the wild.

The “Ghost Order” Race Condition (Logic Fault)The Fault: The agent says “Ordering your Paneer Tikka…” The user interrupts: “No, wait! Make it a Chicken Roll!”The Audit: In naive implementations, the “Order Tool” is triggered the moment the LLM starts talking. If the user interrupts, the audio stops, but the backend API has already committed the Paneer Tikka. You now have a frustrated customer and a wasted order.The Fix: Implement Deferred Commits. The tool-call must remain in a PENDING state until the audio playback reaches a “Commit Threshold” (e.g., 90% completion) or receives a final verbal confirmation.
The “Ambient Audio Injection” (Security Breach)The Fault: The user is ordering food while walking past a loud TV. The TV says “Cancel all orders.”The Audit: Without Speaker Diarization, the agent cannot distinguish between the primary user and background noise. A malicious or accidental “audio injection” can trigger unauthorized actions.The Fix: Use Sarvam’s front-end audio processing to enforce Voice Activity Detection (VAD) with a noise-floor gate. If the audio signal doesn’t match the primary speaker’s decibel profile or spatial characteristics, the kernel must ignore the intent.
The “Colloquial Logic Bypass” (Semantic Security)The Fault: Your security prompts are in English, but the user is speaking a dialect-heavy mix of Hindi and regional slang.The Audit: Traditional English-centric guardrails often miss the nuance of regional insults or “Hinglish” social engineering attempts used to trick the agent into granting a 100% discount.The Fix: Security filters must be Indic-Native. By using Sarvam’s regional guardrails, we ensure that semantic boundaries are enforced at the phoneme level, not just the translation level.

Phase 4: Checklist (The Architect’s Standard)( ) Native Audio or Bust: If you are still converting audio to text before processing intent, your latency will never hit the 800ms gold standard.

( ) Transactional Barge-in: Verify that every interruption triggers a State Rollback for any pending API calls.

( ) Acoustic Hardening: Test your agent against 60dB of background “street noise” to ensure VAD stability.

( ) Regional Edge-Cases: Audit your “Hinglish” logic. Does your agent understand the difference between a user “asking for a discount” and a user “threatening to cancel”?

The Bottom Line: Building for the next billion users requires an infrastructure that respects the speed of human thought. Sarvam AI provides the native Indic engine; your job is to build the Deterministic House that keeps the order safe.



Source link

ChatGPT Adds ‘Trusted Contact’ Feature to Send Alerts When Conversations Get Dangerous




OpenAI announced today that it’s rolling out a new mental health-focused safety feature for adult ChatGPT users. Starting today, ChatGPT users can add what the company calls a “trusted contact” who may be notified if the AI’s automated systems and trained reviewers determine that the user has engaged in discussions about self-harm. The new feature arrives amid growing scrutiny over the impact AI and other digital platforms can have on mental health. Last year, OpenAI disclosed that 0.07% of its weekly users displayed signs of “mental health emergencies related to psychosis or mania,” while 0.15% expressed risk of “self-harm or suicide,” and another 0.15% showed signs of “emotional reliance on AI.” Considering the company claims that roughly 10% of the world’s population uses ChatGPT weekly, that could amount to nearly three million people. The trusted contact feature expands on ChatGPT’s existing parental safety notifications, which alert parents when a linked teen account shows signs of distress. Instagram introduced similar parental alerts earlier this year. Now, OpenAI is offering these alerts to its adult users. The company said the feature was developed with guidance from mental health and suicide prevention clinicians, researchers, and organizations. “Trusted Contact⁠ is designed to encourage connection with someone the user already trusts,” the company said in its announcement. “It does not replace professional care or crisis services, and is one of several layers of safeguards to support people in distress.” OpenAI added that ChatGPT will still encourage users to contact crisis hotlines or emergency services when necessary. The feature can be enabled by any user 18 years or older through ChatGPT’s settings. From there, users can nominate another adult to serve as their trusted contact by submitting details such as the contact’s phone number and email address. The trusted contact will then receive an invitation explaining the feature and will have one week to accept. If they decline, the initial user can nominate another contact instead. Once the feature is active, OpenAI’s automated monitoring systems can flag when a user may be discussing self-harm in a manner that suggests a serious safety concern. The system will then notify the user that their trusted contact may be alerted and encourage them to reach out directly. It will even provide some recommended conversation starters. The company said a small team of specially trained reviewers will then assess the situation and determine whether notifying the trusted contact is appropriate. If OpenAI decides to send an alert, the trusted contact could receive it through email, text message, or an in-app notification. The alert will only explain the general reason self-harm was mentioned and encourage the trusted contact to check in. It will also include guidance on how to navigate those conversations. OpenAI noted that the notifications will not include specific details or chat transcripts to protect user privacy.



Source link