DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
We Asked 10 LLMs to Write Efficient Code. Only 4 Got Better.



By Vilius Vystartas | May 2026

Every LLM can write code that works. The question is: can they write code that’s efficient — and does telling them to be efficient actually help?

I tested 10 models on 10 coding tasks, each in two phases: unprompted (the model writes its own code) and prompted (explicitly told to write clean, DRY, efficient code). That’s 200 API calls, $0.56 total. The results are… not what most prompt engineers would predict.

GPT-5.4 was the only model where prompting gave a substantial boost (+0.20). For most models, the “write efficient code” prompt was meaningless or actively harmful.

How the Metric Works

Each task has a known optimal token budget — the minimum tokens needed to produce correct, DRY code for that task (e.g., 70 tokens for 10 styled buttons using CSS classes vs 340 tokens for 10 separate button blocks). The efficiency score is optimal_tokens / actual_tokens, capped at 1.0.

A score of 0.63 means the model used about 1.6x the optimal — not bad. A score of 0.43 means it used about 2.3x the optimal. The gap between unprompted and prompted tells you whether the “write efficient code” instruction actually changes behaviour.

The Leaderboard (Sorted by Prompted Efficiency)

#
Model
Unprompted
Prompted
Δ
Frugal
Cost
Correctness

🥇
GPT-5.4
0.43
0.63
+0.20
30%
$0.096
78% → 85%

🥈
Qwen 3.6 Plus
0.44
0.60
+0.17
40%
$0.158
78% → 87%

🥉
Gemma 4 31B
0.54
0.58
+0.04
50%
$0.003
92% both

4
DeepSeek Chat
0.51
0.55
+0.04
30%
$0.006
91% → 80%

5
Claude Sonnet 4
0.47
0.52
+0.04
40%
$0.121
92% both

6
LFM 2 24B A2B
0.54
0.47
-0.06
30%
$0.001
90% → 80%

7
Mistral Large 2411
0.54
0.46
-0.08
40%
$0.050
90% → 82%

8
Gemini 2.5 Flash
0.47
0.46
-0.01
50%
$0.020
92% → 90%

9
Cohere Command A
0.60
0.44
-0.17
40%
$0.071
90% → 82%

10
Kimi K2.6
0.34
0.43
+0.09
30%
$0.029
76% → 86%

What Stands Out

GPT-5.4 Is the Prompt Whisperer

GPT-5.4 improved on 7 of 10 tasks when prompted for efficiency. The biggest wins were config-generation (+0.81 — went from 12 inline JSON blocks to a template loop), html-from-data (+0.71), and magic-strings (+0.38 — switched to an Enum). It’s the only model in the batch where the “write efficient code” instruction consistently produces different (and better) output.

The cost is notable — $0.10 for 20 tasks is mid-range, not cheap, not expensive. But the efficiency gain is real.

Gemma 4 31B: The Quiet Winner

Half of Gemma 4’s tasks were already “frugal” — naturally efficient without being told. It scored 92% correctness on both phases at just $0.003 total. That’s a 40x cost advantage over GPT-5.4 with higher correctness and competitive efficiency. For high-volume production where you want concise, correct code, Gemma 4 31B is the value pick of this batch.

Cohere Command A: Prompting Backfires

Cohere Command A had the highest unprompted efficiency in the batch (0.60) — it naturally writes concise code. But when told “write efficient code,” it ballooned output on several tasks. html-from-data went from a tight 45-token solution to a 600+-token monstrosity (-0.92 gap). The prompt made it overthink.

Lesson: if a model is already efficient, don’t prompt it to be more efficient.

Qwen 3.6 Plus: Second Place, Slowest

Qwen 3.6 Plus scored second in prompted efficiency (+0.17 improvement) but took 26 minutes for 20 tasks — by far the slowest model. The efficiency gain is real (especially on html-from-data where it went from hardcoded rows to a map/join pattern), but you’re waiting for it. Batch workloads only.

The Kimi Surprise

Kimi K2.6 had the lowest unprompted efficiency (0.34 — verbose, boilerplate-heavy code) but improved the most at the bottom end (+0.09). Still last place, but the prompt actually helped it compress — which is the opposite of the Cohere effect. Some models need the nudge.

Frugality: What Does It Mean?

“Frugal” means the model naturally produced code at or near the optimal token count without being asked. Gemma 4 31B and Gemini 2.5 Flash led at 50% — half their tasks were already efficient. GPT-5.4, DeepSeek Chat, and Kimi K2.6 were only 30% frugal — they needed the prompt to tighten up.

The Bigger Picture

Group
Models
Behaviour

Prompt-responsive
GPT-5.4, Qwen 3.6 Plus
Efficiency improves substantially with prompting

Prompt-neutral
Gemma 4 31B, DeepSeek Chat, Claude Sonnet 4, Gemini 2.5 Flash, Kimi K2.6
Prompt has little effect (±0.04)

Prompt-antagonistic
LFM 2 24B A2B, Mistral Large 2411, Cohere Command A
Efficiency drops when prompted

The prompt-antagonistic group is the most interesting. These models know how to write efficient code (0.54-0.60 unprompted), but the explicit instruction triggers over-engineering — they add abstractions, comments, error handling, and other bloat that makes the output less efficient by the metric.

If the prompt says “write efficient code” and the model responds by writing more tokens, something in the training signal is misaligned.

My Picks

Best prompted efficiency: GPT-5.4 — 0.63, $0.10 for 20 tasks. The only model where prompting reliably improves output.

Best value overall: Gemma 4 31B — 0.58 prompted, 92% correctness, $0.003. Absurd price/performance.

Best natural efficiency: Cohere Command A — 0.60 unprompted. Don’t prompt it, just let it work.

Most consistent: Claude Sonnet 4 — 92% correctness on both phases, small +0.04 efficiency gain. Reliable.

Skip if you’re in a hurry: Qwen 3.6 Plus — 26 minutes for 20 tasks. Great efficiency gains, terrible latency.

Watch list: Kimi K2.6 — low base efficiency but the prompt actually helps. Worth retesting with a better prompt.

Methodology

Ten real-world coding tasks across CSS, JavaScript, Python, SQL, and bash — each with a known optimal token budget for a correct, DRY solution. Tasks included: styling 10 buttons (CSS), rendering 20 data rows as HTML (JS/HTML), bulk renaming (shell), form validation (Python), parametrized tests (Python), unit conversion (Python), SQL reporting queries, config generation (JSON), magic string replacement (Python/Enum), and middleware decorator pattern (Python/Flask).

Each model ran 10 tasks unprompted, then the same 10 tasks with an efficiency prompt appended. Scoring: efficiency_ratio = optimal_tokens / actual_tokens (capped at 1.0). Correctness scored against expected output patterns.

Total cost: $0.56 for 200 API calls (10 models × 10 tasks × 2 phases). Temperature: 0.1. Max tokens: 600.

Full results: benchmarks.workswithagents.dev



Source link

When APIs Lie: A Lesson in Defensive Debugging



Last week, I spent six hours chasing a ghost in our payment gateway integration. The API returned a 200 OK status, yet the transaction failed silently. My code assumed success based on the status code and moved on, only to discover later that the response body contained an error message buried under a success: false flag. The API wasn’t broken—it was my assumptions that were flawed. I’d treated the integration like a black box, trusting surface-level signals instead of validating the entire payload.

The fix was simple once I found it: I added strict validation for both status codes and response data, logging every field to catch discrepancies early. Now, I treat all API responses like potential liars—checking for hidden errors, rate limits, and unexpected formats, even when everything looks fine on the surface. This experience taught me that defensive programming isn’t paranoia; it’s preparation. APIs are complex ecosystems, and the most “reliable” ones can still surprise you. Always read the fine print in their documentation, and never assume a 200 means everything is okay.



Source link

Experienced devs are slower with AI tools. Nobody wants to admit it.



A recent study discovered that experienced open-source developers were 19% slower while using AI coding assistants. However, those same developers indicated themselves to believe that they were 20% faster.

Read it and weep: the disconnect between perception and reality is nearing 40 percentage points.

Why This Should Bother You

This wasn’t just any survey. It compared real task completion times with self-reported productivity. And the senior engineers – the ones we trust to make all the big architectural decisions – were certainly, but inaccurately, confident about their speed.

And the industry is building its entire tooling strategy around the opposite assumption.

The Perception Trap

I have a hypothesis to explain this phenomenon. As a senior dev, you have a lot of cognitive load taken up by context-switching costs that you are not aware of.

You prompt the AI. You read the output. You notice it got the abstraction wrong. You fix it. You re-prompt. You read again. You realize it missed an edge case you would have caught on line three. You fix that too.

Every single step appears to be helping. There is less typing on your part. More code on the screen. More dopamine. But wall-clock time, in total? It takes longer to finish the task.

→ AI output creates an illusion of velocity because characters appear fast→ Senior devs spend more time reviewing and correcting than they realize→ The cognitive load of evaluating generated code is real work that doesn’t feel like work

Who Actually Benefits?

This is not meant to be anti-AI. It’s a more nuanced perspective.

AI assistants are genuinely helpful when you are learning a new language or framework. They save you real minutes when you’re writing the boilerplate for the hundredth time. They are a decent starting point when you’re working with an unfamiliar API.

However, if you are familiar with the codebase, already understand the patterns, and can type as fast as you think? The AI is essentially adding an intermediary layer between your brain and the editor. This intermediary has a downside.

The study indicates that for skilled developers, the cost is about 19%. This is almost a full day per week.

The Industry Doesn’t Want to Hear This

A discussion with over 800 comments from experienced developers erupted about these findings. Reactions were polarized, but revealing. A lot of senior engineers acknowledged they had sensed this friction, but had assumed they were an outlier.

They were not the outlier. They were the average.

Meanwhile, every company is mandating AI tool adoption. Each job advertisement includes Copilot. Every engineering blog is publishing “how we 10x’d with AI” stories. The incentive structure punishes anyone who says “actually, this is slowing me down.”

Nobody wants to be the person who looks like they can’t adapt. Hence, everybody agrees and nods along to things. 🤷

What I Think We Should Do

Stop treating AI coding assistants as universally beneficial. Start treating them like any other tool — useful in specific contexts, counterproductive in others.

→ Measure actual output, not vibes→ Let senior engineers opt out without stigma→ Stop conflating “uses AI tools” with “is a modern developer”

The best developers I know are ruthless about removing friction from their workflow. If a tool becomes a hindrance, they eliminate it. We need to allow them to do so.

I have a question for you: Have you ever turned off Copilot or another similar tool and felt quicker than before, but you were too embarrassed to tell your team?



Source link