DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement
Why Your Gemini Bill Doesn’t Match the Model Names


Why Your Gemini Bill Doesn’t Match the Model Names

tl;dr – Across roughly 3,300 paired skill-eval runs, Gemini 3.5 Flash cost $1.05 per task against Gemini 3.1 Pro’s $0.66, for scores that were effectively identical: 88.6 versus 87.9.

The pricing is even stranger when you look at the actual task costs. Gemini 3.5 Flash and Gemini 4.5 Flash are separated by almost 8× in per-task cost, while Gemini 3.1 Pro comes in cheaper than both. The invoice does not appear to follow the naming hierarchy.

Where the numbers come from?

The benchmark ran every task twice, once with the relevant skill applied and once without, across four Gemini models in OpenHands, totaling roughly 800 tasks per model. Rather than relying on dashboard estimates, we pulled per-call token counts directly from agent session logs and computed costs using Google’s published per-token prices. We then compared the resulting per-task costs across models.

The headline data

Model
$/task (w/ skill)
Score
Pts per $
Input tokens
Turns
List $/Mtok

3.1 Flash Lite
$0.035
70.2
2,006
0.31M
17
$0.25

3 Flash Preview
$0.135
85.4
633
0.63M
24
$0.50

3.1 Pro Preview
$0.66
87.9
132
0.65M
26
$2.00

3.5 Flash
$1.05
88.6
85
1.41M
39
$1.50

A few things stand out from this data.

Cost order and name order are uncorrelated. Gemini 3.1 Pro is cheaper per task than Gemini 3.5 Flash despite carrying a higher per-token list price, while Gemini 4.5 Flash and Gemini 4.5 Flash-Lite, which sit in the same product family, differ dramatically in actual spend. Model names describe intended positioning, but they are a poor guide to real-world agent costs.
Scores do improve with each model generation, which is a genuine positive trend and a good reason to track releases, but capability gains do not automatically translate to cost reductions.
Finally, the practical value pick is Gemini 3 Flash Preview, which lands within three points of the leading models at roughly one-fifth the per-task cost, making it the most efficient option for workloads where a score in the 85 range is acceptable.

Why volume beats unit price

The cost of an agentic task is the product of two variables:

`Task cost = price-per-token × tokens the model decides to spend`

Enter fullscreen mode

Exit fullscreen mode

Model names establish the first variable. The second is determined at runtime by the model’s behavior on the specific task, and it only becomes visible after you read your session logs.

For Gemini 3.5 Flash, the per-task cost breaks down as follows:

Non-cached input: $0.72

Cache-read input: $0.14

Output (including thinking): $0.19

The dominant driver is input volume. Gemini 3.5 Flash sent 1.41 million tokens of context across 39 agent turns per task. Pro sent roughly half that volume across 26 turns, and even at its higher list price of $2.00 per million tokens, its lower volume resolves to a lower total bill.

A model with a cheaper per-token rate that takes more turns to reach an answer will erode its own discount. It is also worth noting that 63-75% of input across these runs was cache-read, which means the effective sensitivity to turn count is even higher than raw list prices suggest: the multiplier is accumulating in your session logs, not on your pricing page.

Skills move cost by tier

Adding a relevant skill to each run changed per-task cost in opposite directions depending on which model ran it:

Pro saw cost drop $0.20 per task (-23%) while the score gained 20 points. The model used fewer turns and less exploratory backtracking, which suggests it was able to act on the structured guidance directly rather than discovering the solution path through iteration.
3.5 Flash was essentially flat, with cost shifting by less than $0.03 in either direction.
3 Flash Preview and Flash Lite each spent slightly more tokens for marginal score gains (+$0.03 and +$0.01 respectively).

The underlying pattern is consistent: a skill compresses the solution path for a model capable of following structured guidance precisely, reducing turn count and therefore total cost. For a model still resolving ambiguity through exploration, the same skill adds context to process rather than a shortcut to apply, and the cost holds steady or rises marginally. A skill is a shortcut for a capable model and overhead for a weaker one.

In practical terms, this produces two clear operating points. Pro with a relevant skill at $0.66 per task is the most cost-efficient route to top-tier performance. Gemini 3 Flash Preview with a skill at $0.135 per task delivers roughly five times the score-per-dollar of either leader, for a score three points lower, which is a reasonable trade for many workloads.

Measure, don’t assume

Four takeaways from this data that apply beyond this specific benchmark:

1/ Do not budget from the rate card. Cost your workload based on measured tokens and turns on your specific tasks, with your specific prompts, in your specific agent harness. Per-token list prices are a useful first filter for ordering candidates, not a reliable predictor of relative spend.

2/ Read cost at the session layer. Aggregate dashboards can show $0 while spend accumulates in the background. Token usage needs to come from raw API responses or agent session logs to be trusted for budgeting purposes.

3/ Watch turn count first. The 39-versus-26 turn gap between 3.5 Flash and Pro is the primary cause of the price inversion observed here, and turn count is the variable most commonly absent from observability tooling. It is the multiplier on everything else in the cost equation.

4/ Re-measure when models update. Gemini 3.5 Flash is a newer release than Gemini 3 Flash Preview and scores higher, but it costs roughly eight times more in this agentic context. Capability improvements and cost improvements are independent variables, and any cost benchmark needs to be re-run with each version update rather than assumed to hold.

Caveats

These results come from a single agent harness (OpenHands), a single benchmark with explicit skill-relevance disclosure, and a specific sample window. Different tasks, prompt structures, and turn-length patterns will shift the absolute numbers and may shift the relative rankings. The finding to carry forward is not a specific model recommendation but a methodology: in agentic settings, cost rankings are not derivable from per-token rates alone, and the ranking that applies to your workload depends on that workload’s specific behavioral profile.

A model name is a pricing tier, not a cost forecast. In agentic workflows, the deciding variable is how many tokens the model chooses to spend to reach an answer, a figure visible only after you run the work and read the logs. The rate card gives you one of the two inputs; only measurement gives you both.

Next: which skills actually earn their tokens? In these runs, 42% produced significant performance gains while 5% were net overhead. We’ll follow up on this analysis in the next post.



Source link

Introducing Joanium: An Open-Source AI Desktop App That Can Actually Get Things Done



AI assistants today live in browser tabs. You open one for writing, another for code, a third because it’s better at research, and a fourth because your team uses it. Each one is locked to its own provider, its own pricing, and its own walled-off context. None of them can touch your files, run a command, or carry what they learned from one session into the next.

Joanium is built to change that.

Joanium is a local-first, open-source AI desktop app that runs on your machine, connects to nearly every major AI provider, and gives that AI the tools to actually act — not just respond.

Website: joanium.comSource code: github.com/Joanium/Joanium

One App, Every Model

Joanium supports Gemini, Claude, OpenAI models, and a growing list of additional providers including Fireworks, SambaNova, AI21, Lambda, and Hyperbolic — with live model fetching, so new releases show up without waiting on an update. You bring your own API keys and choose the right model for the task, the budget, or the moment, without rebuilding your workflow every time you switch.

An AI With Hands: 160+ Tool Integrations

Most AI tools can describe what they’d do. Joanium’s agents can go do it. With 160+ built-in tool integrations — covering platforms like GitHub, Gmail, YouTube, Linear, Netlify, Canva, and Stripe — Joanium can read your repos, draft and send emails, manage tickets, and interact with the services you already use, directly from a single interface.

Multiple Agents, Working Together

Rather than relying on one general-purpose assistant for everything, Joanium supports multi-agent execution along with a skills and personas system. You can configure agents for specific roles and let them collaborate on a task — closer to delegating work across a small team than managing a single chat window.

Full Visibility Into What Your AI Did

Two features sit at the core of how Joanium handles transparency:

Execution Replay — step through exactly what an agent did, in order, after the fact.Conversation forking with provenance tracking — branch a conversation in a new direction without losing the trail of where it came from.

When an AI is taking real actions on your behalf, being able to see — and audit — those actions isn’t a nice-to-have. It’s the baseline.

Built-In Browser, Git Integration, and a Marketplace

Joanium also includes an inbuilt browser for agents to use directly, native Git integration for working with repositories, a Daily Digest Agent that surfaces relevant updates automatically, and a Marketplace for discovering and installing additional tools and skills as the ecosystem grows.

Why Local-First, and Why Open Source

Most AI tools today route everything — your prompts, your files, your context — through someone else’s servers, under someone else’s terms. That tradeoff has been easy to ignore when AI was mostly a chat window. It gets harder to ignore once that AI is reading your emails, browsing on your behalf, and pushing to your repositories.

Joanium runs locally, and its source is fully open under the Apache 2.0 license. That means the code doing all of this is inspectable, forkable, and not contingent on a company’s roadmap or pricing decisions.

It’s also worth being precise about one thing: Joanium itself is free and open source, but using it still involves paying AI providers for API usage — there’s no way around that, and no product can honestly claim otherwise. What changes is who you pay and how much control you have over that. Instead of stacking multiple subscriptions regardless of use, you pay per request, choose your providers, and switch whenever it makes sense. For many people juggling several AI subscriptions today, that alone is a meaningful shift.

Get Started

Joanium is available now:

Website: joanium.comGitHub: github.com/Joanium/Joanium

Download it, connect an API key for a provider you already use, and explore what it can do. As an open-source project, feedback, issues, and contributions are welcome — and actively shape what gets built next.



Source link

Open-Source AI, Hugging Face, and the Building Blocks of Modern AI Development



Open-source AI has made it much easier for developers to experiment with powerful models without building everything from scratch.

Today, we have access to platforms, libraries, and tools that allow us to run text models, audio models, image-generation models, and even large language models with just a few lines of code. One of the biggest names in this ecosystem is Hugging Face.

Hugging Face has become a central place for working with open-source AI models, datasets, and applications. But to use it properly, it is important to understand the ecosystem around it — models, datasets, pipelines, tokenizers, transformers, quantization, and tools like Google Colab.

This blog gives a simple overview of these concepts and how they fit together.

What is Hugging Face?

Hugging Face is an open-source AI platform that provides access to pre-trained models, datasets, and demo applications.

It has three major parts:

1. Models

Models are pre-trained AI systems that can perform specific tasks.

For example, there are models for:

Text generation
Sentiment analysis
Translation
Question answering
Image generation
Speech recognition
Code generation

Instead of training a model from scratch, developers can use these pre-trained models and build applications on top of them.

2. Datasets

Datasets are collections of data used to train, fine-tune, or evaluate models.

Hugging Face provides access to many public datasets for NLP, vision, audio, and other AI tasks.

3. Spaces

Spaces are demo applications hosted on Hugging Face.

They are often built using tools like Gradio or Streamlit and allow developers to showcase AI projects directly in the browser.

Hugging Face Libraries

Hugging Face is not just a website. It also provides Python libraries that make AI development easier.

Some of the most important libraries are:

Transformers

The transformers library is used to load and run pre-trained models.

It supports many model families and tasks, including text generation, classification, summarization, translation, question answering, speech recognition, and image-related tasks.

Datasets

The datasets library is used to load and process datasets efficiently.

It helps when working with training data, evaluation data, or custom datasets.

Hub

The Hugging Face Hub allows developers to access, upload, and share models, datasets, and applications.

Together, these libraries make it easier to build AI applications with less boilerplate code.

Why Google Colab is Useful for AI Development

One major challenge in AI development is hardware.

Many models require GPUs, and not every developer has a powerful machine. Google Colab helps solve this problem by providing a browser-based Python environment with access to free or paid GPUs.

Colab is useful for:

Running AI/ML notebooks
Testing Hugging Face models
Running GPU-based experiments
Training or fine-tuning smaller models
Trying image, audio, and text models without local setup

For beginners, Colab is especially useful because it removes a lot of installation and hardware-related friction.

Running AI Models with Pipelines

One of the easiest ways to use Hugging Face models is through pipelines.

A pipeline is a high-level API that combines multiple steps into one simple interface.

Usually, running a model involves:

Loading the tokenizer
Loading the model
Preparing the input
Running inference
Processing the output

A pipeline hides much of this complexity.

Example:

from transformers import pipeline

classifier = pipeline(“sentiment-analysis”)

result = classifier(“Open-source AI is making development more accessible.”)
print(result)

Enter fullscreen mode

Exit fullscreen mode

This can return an output showing whether the sentence is positive or negative.

Pipelines are available for many tasks, including:

Sentiment analysis
Text generation
Named Entity Recognition
Question answering
Summarization
Translation
Speech recognition
Image classification

This makes pipelines one of the best starting points for quickly testing AI capabilities.

Common NLP Tasks: Sentiment Analysis, NER, and Question Answering

Hugging Face models can be used for many practical NLP tasks.

Sentiment Analysis

Sentiment analysis detects whether a piece of text is positive, negative, or neutral.

It is commonly used in:

Product reviews
Customer feedback
Social media analysis
Brand monitoring

Named Entity Recognition

Named Entity Recognition, or NER, identifies important entities in text.

For example, it can detect:

Person names
Organizations
Locations
Dates
Skills
Products

NER is useful in resume parsing, document processing, search systems, and information extraction.

Question Answering

Question-answering models can extract answers from a given context.

For example, if a paragraph says that Google Colab provides GPU access, the model can answer:

Question: What does Google Colab provide?Answer: GPU access.

This is useful for document assistants, search tools, and chatbot systems.

Audio Models: Whisper

Open-source AI is not limited to text.

Whisper is a speech recognition model used to convert audio into text.

It can be used for:

Meeting transcription
Podcast transcription
Subtitle generation
Voice assistants
Audio note-taking

A basic voice AI workflow can look like this:

User speech → Whisper → Text → LLM → Response

Enter fullscreen mode

Exit fullscreen mode

This is the foundation of many voice-based AI applications.

Image Generation with Stable Diffusion and FLUX

Image-generation models allow users to create images from text prompts.

Two popular examples are:

These models can be used for:

Content creation
Design
Concept art
Marketing visuals
Product mockups
Creative experiments

Because image-generation models can be resource-heavy, they are commonly run on GPUs using platforms like Google Colab.

What are Tokenizers?

Large language models do not directly understand raw text.

Before text is passed into a model, it is converted into smaller units called tokens. These tokens are then converted into numerical IDs.

This process is called tokenization.

A simple flow looks like this:

Text → Tokens → Token IDs → Model

Enter fullscreen mode

Exit fullscreen mode

Tokenizers usually provide two important methods:

encode() converts text into token IDs.

decode() converts token IDs back into readable text.

Tokenization matters because model input limits are measured in tokens, not words. When people say a model has an 8k, 32k, or 128k context window, they are talking about token capacity.

Special Tokens and Chat Templates

Some tokens have special meaning.

These are called special tokens.

They can represent things like:

Start of text
End of text
System message
User message
Assistant message

Chat models also use chat templates to structure conversations properly.

For example, a chat template helps the model understand which part of the input is the system instruction, which part is the user’s message, and where the assistant should respond.

Using the wrong chat template can reduce model performance because different models expect different input formats.

Why Different Tokenizers Matter

Different models use different tokenizers.

The same sentence may be split differently by LLaMA, DeepSeek, Qwen, or other model families.

This affects:

Token count
Speed
Context usage
Cost
Model behavior

For example, if one tokenizer converts a sentence into fewer tokens than another, it may use less context and run slightly more efficiently.

This becomes important when working with long prompts, documents, or retrieval-augmented generation systems.

Transformers: The Architecture Behind Modern LLMs

Transformers are the foundation of modern large language models.

The key idea behind transformers is attention.

Attention allows a model to focus on relevant tokens while processing input and generating output.

This is what helps models understand relationships between words, context, and meaning.

Transformers are used in:

Chatbots
Text generation
Translation
Summarization
Code generation
Multimodal AI systems

Most modern LLMs are based on transformer architecture.

Quantization: Making Models Smaller

AI models contain millions or billions of parameters.

These parameters are stored as numbers. Usually, they may be stored in formats like 32-bit or 16-bit precision.

Quantization reduces the precision of these numbers.

For example:

32-bit → 16-bit → 8-bit → 4-bit

Enter fullscreen mode

Exit fullscreen mode

The goal is to make models smaller and easier to run.

Benefits of quantization:

Lower memory usage
Faster inference
Easier deployment on limited hardware
Ability to run larger models on smaller GPUs

The trade-off is that extreme quantization may reduce output quality slightly. But in many practical cases, quantized models work well enough for real applications.

LLaMA-Style Model Architecture

LLaMA-style models follow the general transformer-based language model flow.

A simplified version looks like this:

Text → Tokens → Token IDs → Embeddings → Decoder Layers → Output

Enter fullscreen mode

Exit fullscreen mode

The important parts are:

Token Embeddings

Token IDs are converted into vectors called embeddings.

These embeddings help the model represent the meaning of tokens numerically.

Decoder Layers

Decoder layers process the input step by step and help the model generate the next token.

Attention

Attention helps the model decide which tokens are important in the current context.

Together, these parts allow the model to generate coherent and context-aware responses.

How These Concepts Connect

All these concepts are connected in the AI development workflow.

For example, if you are building a chatbot, the flow may look like this:

User input → Tokenizer → Model → Generated output → Decoding → Response

Enter fullscreen mode

Exit fullscreen mode

If you are building a voice assistant, the flow may become:

User speech → Whisper → Text → Tokenizer → LLM → Response

Enter fullscreen mode

Exit fullscreen mode

If you are building an image-generation tool:

Prompt → Text encoder/model → Diffusion model → Generated image

Enter fullscreen mode

Exit fullscreen mode

Platforms like Hugging Face and Google Colab make these workflows easier to experiment with and build upon.

Final Thoughts

Open-source AI has made powerful AI development more accessible than ever.

With platforms like Hugging Face, developers can use pre-trained models, datasets, and demo applications without starting from zero. With Google Colab, they can run experiments on GPUs without needing expensive local hardware.

But using these tools effectively requires understanding the basics behind them.

Concepts like tokenizers, pipelines, transformers, quantization, embeddings, and model architecture are not just theoretical terms. They directly affect how AI models are used, optimized, and deployed.

The more clearly we understand these building blocks, the better we can use open-source AI to build practical applications across text, audio, images, and automation.



Source link