agents – DAILY NEWS

TECH & AI

Codex – a.k.a. ChatGPT’s AI Agent

jackminion Jul 2, 2026 0

Codex is OpenAI’s AI coding agent, and ChatGPT is the interface you can use to interact with it. That’s the difference.

As a software engineer, software development has gone through drastic shifts over the decades. We moved from assembly language to high-level programming languages, from waterfall to Agile, from on-premise infrastructure to cloud computing, and from manual deployments to DevOps and continuous delivery.

The next major shift is the emergence of AI coding agents.

Rather than simply generating code snippets, modern coding agents can understand an entire codebase, plan changes, execute them, run tests, fix issues, and explain their reasoning. One of the leading tools in this space is Codex.

What is Codex?

Codex is an AI-powered software engineering agent designed to help developers work directly with their source code.

Unlike traditional AI assistants that answer questions or generate isolated functions, Codex operates much more like another engineer on your team. It can:

Explore an existing repository
Understand project architecture
Make changes across multiple files
Execute commands
Run tests
Fix compilation errors
Refactor code
Generate documentation
Create pull-request-ready changes

Instead of asking “How do I implement JWT authentication?”, you can ask Codex:

“Implement JWT authentication across this Express application using our existing middleware patterns.”

Codex then performs the work inside your repository rather than simply describing how it could be done.

From AI Assistant to AI Engineer

Many developers have used AI chatbots to generate code snippets.

That workflow typically looks like this:

Developer
│
▼
Copy code into ChatGPT
│
▼
Receive code
│
▼
Paste into IDE
│
▼
Fix compilation errors
│
▼
Repeat

Enter fullscreen mode

Exit fullscreen mode

Codex changes the workflow entirely.

Developer
│
▼
Describe the task
│
▼
Codex explores repository
│
▼
Implements changes
│
▼
Runs tests
│
▼
Fixes issues
│
▼
Produces ready-to-review changes

Enter fullscreen mode

Exit fullscreen mode

The interaction becomes goal-oriented instead of code-oriented.

Understanding the Entire Codebase

One of Codex’s biggest strengths is repository awareness.

Rather than treating every prompt independently, Codex understands:

project structure
frameworks
existing coding conventions
dependency management
architecture
naming conventions
testing framework
deployment configuration

For example, in a large Node.js monorepo, Codex can recognize:

apps/
packages/
shared/
infra/
docs/
.github/

Enter fullscreen mode

Exit fullscreen mode

It understands how these components interact and modifies only the areas relevant to the requested task.

This dramatically reduces the amount of context developers need to manually provide.

Working Like a Real Engineer

A typical software task rarely involves writing one function.

Consider a request such as:

“Add audit logging whenever an invoice is approved.”

A human engineer would likely:

locate the approval endpoint
identify the service layer
update the database model
modify unit tests
update integration tests
document the API
verify linting
run the test suite

Codex follows a remarkably similar workflow. Rather than generating a single function, it works through the complete implementation.

Skills and Project Memory

One of the most useful capabilities of Codex is its support for project-specific guidance.

Teams can provide instructions that describe:

coding standards
architectural principles
testing requirements
security practices
repository structure
naming conventions

This allows Codex to behave consistently across an organization.

For example, instructions may specify:

Always use dependency injection.
Never access the database directly from controllers.
Write unit tests before integration tests.
Use repository pattern.
Follow Domain-Driven Design boundaries.
Never commit generated files.

Instead of repeating these instructions in every prompt, Codex learns them from project configuration.

What is an AGENTS.md

Many teams create an AGENTS.md file that acts as an operating manual for AI coding agents. An AGENTS.md file can include:

project overview
architecture
folder structure
coding conventions
build commands
testing commands
deployment process
common pitfalls
review checklist

For example:

# Project Rules

– Node.js 22
– TypeScript only
– Use Prisma ORM
– No direct SQL
– Unit tests required
– Follow Clean Architecture
– Run npm test before completion

Enter fullscreen mode

Exit fullscreen mode

The better this document is maintained, the more consistently Codex performs.

Practical Use Cases

Codex excels at repetitive and complex engineering tasks.

Some examples I’ve used Codex for include:

Feature development

REST APIs
GraphQL resolvers
UI components
database migrations

Refactoring

rename services
split large classes
introduce dependency injection
improve architecture

Bug fixing

investigate failing tests
locate regressions
repair compilation errors
resolve lint issues

Documentation

generate API documentation
update README files
explain complex modules
document infrastructure

Testing

create unit tests
generate mocks
improve coverage
fix broken test suites

Infrastructure

AWS CDK
Terraform
GitHub Actions
Docker
Kubernetes

Strengths

Codex offers several advantages over traditional AI-assisted coding.

1. Repository Awareness

It understands your project’s structure instead of treating every prompt in isolation.

2. Multi-file Editing

Real-world features often require coordinated changes across many files. Codex can handle those changes in one workflow.

3. Command Execution

Codex can build projects, execute tests, run linters, and validate its own work.

4. Consistency

When provided with project instructions, it follows the team’s engineering standards.

5. Reduced Context Switching

Developers spend less time copying code into chat windows and more time reviewing completed work.

Am Not trusting AI Agents 100%

I am discussing the uses of Codex and yet, I still don’t trust it. Conflicting? Probably. Despite its capabilities, Codex (and all AI Agents) is not a replacement for seasoned software engineers.

Human judgment remains essential for:

system architecture
product design
business requirements
security decisions
trade-off analysis
stakeholder communication
technical leadership

The best results come from treating Codex as an engineering partner rather than an autonomous replacement.

AI coding agents represent a significant evolution in software development.

Just as integrated development environments replaced text editors, and CI/CD transformed software delivery, AI agents are reshaping how engineers interact with code.

Rather than focusing on writing every line manually, developers increasingly define objectives, review implementations, and guide architectural decisions while AI handles much of the repetitive engineering work.

Codex exemplifies this shift. It combines repository understanding, code generation, automated validation, and project-specific guidance into a workflow that feels less like using an autocomplete tool and more like collaborating with another engineer.

For organizations willing to invest in clear architecture, strong engineering practices, and well-maintained project documentation, AI coding agents like Codex can significantly accelerate development while allowing engineers to concentrate on solving the problems that require human creativity, judgment, and experience.

Best Practices

Teams adopting Codex tend to achieve better results when they:

Keep repositories well organized.
Maintain clear documentation.
Define coding standards.
Write comprehensive tests.
Provide architectural guidance through AGENTS.md.
Review AI-generated changes before merging.
Use small, well-defined tasks.
Encourage iterative collaboration rather than one-shot prompts.

These practices improve not only AI-generated code but also the overall quality of the software project.

Source link

TECH & AI

hack with Hyd 2.0 – DEV Community

jackminion Jun 28, 2026 0

Support bots that forget every conversation aren’t support bots. They’re expensive FAQ pages.I built SupportMind to fix that — a customer support agent that actually remembers.The architecture is two layers:Memory (Hindsight): After every interaction, the agent stores structured context in a vector namespace per user. Next session, it recalls semantically — “payment problem” retrieves “Visa charge failing” even if the words don’t match.Routing (cascadeflow): Not every query needs GPT-4. Password resets go to Groq’s free tier. Complex billing disputes escalate. Every decision is logged with model, cost, latency, and reason.The delta that matters:Session 1: “Can you tell me your card details and the error you’re seeing?”Session 3 (same user, same issue): “I see you’ve had recurring issues with your Visa ending in 4242. Last time, clearing billing cache fixed it — want to try that first?”Same infrastructure. Completely different agent.On a typical support workload: ~80% simple queries handled by the cheap model. Cost per query dropped from ~$0.012 to ~$0.002.The part I didn’t expect: routing and memory compound. When Hindsight shows a user has had the same issue four times, cascadeflow automatically classifies their next message as complex — even without explicit signals. That fell out of the architecture. 👇https://lnkd.in/gn8NwP6Z

hashtag#AIAgents hashtag#AgentMemory hashtag#Hindsight hashtag#cascadeflow hashtag#LLM hashtag#AI

Source link

TECH & AI

My trading bot said it was trading for four days… he was lying

jackminion Jun 26, 2026 0

Twenty-five days on Hyperliquid. Sixty-five closed trades. P&L: -$9.21.

Turns out that was the smallest wrong thing about it.

The landing page showed -$7.72 because it uses a different P&L formula and excludes two open positions. Either number is small. Both numbers were also wrong about what they were telling me.

I spent yesterday auditing every trade. The audit produced three findings I did not expect. Each one was a different kind of wrong.

This is the first post in a series about ziom trader, my small AI-assisted crypto trading bot. “Ziom” is Polish for buddy, mate, or dude depending on who’s talking. The name is unserious on purpose. The system is not.

This is not a “watch me print money” series. The number is negative. Good.

The point of the series is to track what happens when an LLM-assisted trading system moves from backtests and dashboards into live execution: where the bot is wrong, where the dashboard is wrong, where I am wrong, and which layer gets to prove it.

Frame

The natural first read of -$9.21 is “the strategy is losing money.” That read assumes the displayed P&L attributes to the strategy. It does not.

The number that shows up at the surface is the sum of at least three different layers: the strategy itself, the execution wrapper around it, and the monitoring layer that observes both. Each layer can author its own kind of failure. The displayed number compresses all three into a single dollar figure and loses the attribution on the way up.

The framing that landed for me, from Daniel Nevoigt, is that methodology overview without forward-correlation disclosure is a log with good intentions. Same applies to P&L: total P&L without layer-attribution disclosure is a log with good intentions. You see the number. You do not see where it came from.

Here is what I found when I forced the attribution.

Layer 1: Shadow does not equal live

Before deploying any lane, the system runs against backtested data. The shadow says “this strategy returns X over Y trades.” The deploy decision is taken when the shadow looks healthy. The live then runs and produces a different number.

The label for that difference is not “the strategy disappointed.” The shadow is one authority. The live is a different authority. The market authored the failure criterion, not the strategy.

This is the version of the seam Christopher Maher named: the bite check did not catch itself, a different rail caught it. Shadow data cannot author its own failure. Only the live market can. And the live market does not tell you which part of the gap is variance, which part is regime drift, and which part is a parameter you forgot to tune.

In this window the funding_divergence_long lane had a shadow edge of +0.355%/trade across n=660 backtested trades, CI95 (+0.085, +0.625). The live for the same lane was -1.10% / trade across 29 live trades. The gap is 1.46 percentage points. At sigma about 2% per trade and n=29, that gap is 3.9 standard errors. Statistically significant negative.

That does not prove the strategy is broken. It proves the shadow and the live disagreed by more than variance would explain. Three explanations remain in play, and the audit can narrow but not resolve them:

June 15 ADA outlier was -$2.25, -5.64%, which is 3.6 sigma from shadow mean. One trade is doing structural work in a small sample.
Edge is not durable across this BTC window. June saw recovery to reversal.
Exit configuration choices let losers run.

50 to 100 more trades are needed to separate these. I am not separating them today. The label for this section is AMBIGUOUS and I am pinning it to that label until the sample doubles.

Layer 2: Live displayed does not equal strategy true

Inside the -$9.21, 60% is not strategy. It is system overhead with git commit refs.

The breakdown:

Cause
Trades
Loss
Commit ref

oi_surge LONG with no regime gate, ran in bear
3
-$1.45
gate added 2d10e326 Jun 11

whale lane missing max_per_coin cap
6
-$0.95
cap added 5bd9eaaf Jun 9

whale_footprint as dead lane before disarm
26
-$2.71
disarmed 18d937aa Jun 13

oi_surge LONG as dead lane, 1 trade Jun 12
1
-$0.38
not explicitly disarmed in this window

Total system overhead: -$5.49 across 36 trades, 60% of the loss.

Sixty percent of the loss has an audit trail. Most of it has a git commit. All of it is a different kind of wrong than “the signal failed.”

Each line has either a commit hash that closes the gap or a seam that the audit made visible. None of it is the strategy in the sense of “the signal was wrong.” All of it is the system in the sense of “the rail that would have stopped this did not exist yet.”

Sean Burn names it right: show the seam, do not hide it. Show that 60% of this loss is closed by commits that exist now and did not exist on June 6. Do not collapse “system” and “strategy” into one bucket called “the bot lost money.” They are different authors of the same dollar.

The remaining 40% is funding_divergence_long (-$4.15 across 32 trades) and oi_surge_fade (+$0.13 across 2 trades). The funding_long line is the one with the shadow-vs-live gap from Layer 1. Without the ADA outlier and without the execution gap I will describe next, the lane runs at -$1.47 across 28 trades, or -$0.05 / trade. That is noise floor for this sample size, not strategy quality. Treat it that way.

Layer 3: Visible live does not equal what the driver attempted

The third finding had no warning. The first two were inventory work. This one was structural.

Between June 18 10:01 UTC and June 22 16:01 UTC, the funding_divergence_long driver was armed. The run_summary events in the database show armed=true, placed=1 for the entire 4-day window, roughly 20 to 30 cycles. The positions table for the same window shows zero new fills. The events table shows zero execution_error events.

The dashboard read placed=1. The exchange acknowledgement layer wrote placed_ok=0. The error path that would have written an execution_error row never ran, because the code that throws the exception was caught somewhere upstream without incrementing the error counter.

For four days, the driver said it was trading. The exchange said it was not.The events table said nothing.

The audit trail itself was lying.

The framing from L. Cordero applies: trust retrieval, verify recall. The placed=1 counter was the system retrieving its own belief. The actual position state was the recall, and the recall path was broken. The two layers diverged silently, and the dashboard was reading the wrong one.

The framing from Todd Hendricks applies: big number, wrong metric. placed=1 is a big number. placed_ok=0 is the meaningful one. The system displayed the big one. I deployed the wrong dashboard.

The fix landed today, after the audit, after a peer who runs a different read-the-chain product confirmed independently that the seam between an attempted read and a verified read is where this class of bug lives. His phrase for the right default: incomplete by default. Anything not explicitly classified as a verified result is unknown, not zero. Zero and unknown render visually distinct. The pipeline carries the distinction all the way to the surface.

Impact ESTIMATED: 20 to 30 missed signals, ~$15 notional each. If the shadow edge held, plus or minus $1 to $1.50 in either direction, gain or loss, invisible to the displayed P&L. The honest label is ESTIMATED because I cannot know which way the missed trades would have gone.

What the audit changes

The displayed loss is -$9.21. The strategy contribution to that loss, after subtracting system overhead and the execution gap and the single 3.6-sigma outlier, is approximately -$1.47 across 28 trades, or -$0.05 per trade. That is noise. The sample is too small to call the strategy good or bad. Forward-test budget: 50 to 100 more trades before any strategy-quality verdict.

The system overhead is closed. The commits exist. The next 50 to 100 trades will run with the regime gate, the max_per_coin cap, the disarmed dead lanes, the corrected verification rail, and the current active lane configuration. If those run and the lane is still -$0.10/trade or worse, the strategy is the problem, not the rails. If they run and the lane comes in at +$0.05/trade or better, the shadow edge held and the previous loss was the rails.

I am locking the test budget in advance: if the next 50 trades come in at -$0.10/trade or worse, I retract the post-fix optimism in this post. The bet is on the rails being the issue, not the signal. I will publish the next breakdown either way.

Post-audit check

Added 2026-06-25 around 19:15 CEST, roughly 12 hours after the audit opened. I checked.

The first post-audit window did not reproduce the previous failure pattern.

The oi_surge_fade_live SHORT lane produced approximately +$1.38 across 12 post-audit trades, with 10 of 12 green.

That includes AVAX, UNI, ADA, ATOM, FIL, and TIA. The important part is not that the number is green. The important part is that the result came after the audit separated attempted placement from exchange-confirmed placement.

The early read is positive, but narrow.

This is not “the fixes worked.” It is “the first post-audit window did not immediately repeat the old bug shape, and the active lane produced a green early window under the new reporting rail.”

Those are different claims.

I am only making that narrow claim.

What this is not

This is not a how-I-made-money post. The number is negative. It is not large. The strategy is unverified. The audit caught real bugs with commit refs but did not prove the strategy works.

This is also not a how-AI-coded-my-bot post. Claude Code wrote large parts of this system. The audit found multiple places where the same author, me with model assistance, wrote both the action layer and the layer that was supposed to verify the action. Single-author audit trails lie. That part is on the system design, not on the model.

What this is, is the breakdown that should sit underneath any small displayed number from any algorithmic trading or autonomous agent system. Three different kinds of wrong. Three different authors of the same dollar. The displayed number is one of them. The other two are invisible by default.

Series contract

This series will track ziom trader as a live system, not as a performance claim.

I will publish the boring parts: small losses, missed fills, broken counters, stale assumptions, dashboard lies, audit fixes, and retractions when the next sample contradicts the previous read.

No alpha claims. No “the bot works” until the forward sample earns that sentence. No hiding the layer that authored the failure.

Peer credits

The vocabulary that made this audit possible came from people writing about adjacent problems in adjacent domains.

None of these people were writing about trading bots. Some were writing about incident reports, some about agent systems, one about a read-chain product.

The overlap was not planned. That’s the point.

Daniel Nevoigt: “methodology overview without forward-correlation disclosure is a log with good intentions”
Christopher Maher: “the bite check did not catch itself, a different rail caught it”
L. Cordero: “trust retrieval, verify recall”
Sean Burn: “show the seam, do not hide it”
Todd Hendricks: “big number, wrong metric”
TxDesk, ratifying the placed=1/placed_ok=0 framing in a different domain this morning: “incomplete by default”

That is why I am leaving the credits in the post. The vocabulary did not decorate the audit. It changed what the audit could see.

What you can take from this

If you run a live system, look for the layer where your own code writes both the action and the verification. That is where this class of bug lives. The fix is not only better testing. The fix is making the action layer and the verification layer be authored by different code paths, ideally by different authors, with the verification path explicitly classifying anything it did not see as incomplete by default.

Render the difference, not the success. Five attempted and three succeeded is a normal display state. Five attempted and unknown succeeded is the state your dashboard probably hides today.

That is the line the audit drew.

If you are the bot, you do not get to be the auditor.

Source link