{"id":6116,"date":"2026-06-26T03:29:52","date_gmt":"2026-06-25T20:29:52","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=6116"},"modified":"2026-06-26T03:29:52","modified_gmt":"2026-06-25T20:29:52","slug":"which-tokens-does-a-hybrid-model-predict-better","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=6116","title":{"rendered":"Which tokens does a hybrid model predict better?"},"content":{"rendered":"<p> <br \/>\n<br \/> <br \/>\n\ud83d\udcc4 Tech report: https:\/\/arxiv.org\/abs\/2606.20936<\/p>\n<p>Which kinds of tokens does a model predict well, and which does it not? That question is especially intriguing in the case of hybrids, a language model architecture that\u2019s begun to challenge the standard transformer and that we\u2019ve been investigating with Olmo Hybrid.<br \/>\nHybrids can match or beat transformers on standard benchmarks, but the headline numbers don\u2019t reveal much about what specific advantages hybrid models have over transformers.<br \/>\nIn an attempt to shed light on these token-level behaviors, we recently conducted experiments comparing our own strongest 7B transformer, Olmo 3, and hybrid model, Olmo Hybrid, head-to-head. Specifically, we compare the differences in model predictions in a fine-grained way across different types of tokens, or units of information that appear as input to an LLM.<br \/>\nBecause Olmo 3 and Olmo Hybrid were built to be as alike as possible outside their architectures \u2014 closely matched in data, tokenizer, and training recipe \u2014 any difference in their predictions mostly reflects the architecture itself. Viewing these differences at the token level allows us to glean insights about the specific strengths of hybrid models over transformers.<br \/>\nOur results show that the hybrid\u2019s advantage is real across many tokens, but not all. Olmo Hybrid is strongest on tokens that carry meaning, such as nouns, verbs, and adjectives, and on tokens that can only be predicted by following what\u2019s going on, like which person a pronoun refers to. But the hybrid\u2019s advantage almost disappears on tokens that simply repeat something already in the input \u2014 a word or phrase reproduced verbatim from earlier \u2014 where the answer is sitting right there to be looked up. That\u2019s where the transformer\u2019s strength lies.<\/p>\n<p>\t\tAttention versus recurrence, and measuring the difference<\/p>\n<p>A language model is built from a stack of repeated layers, each one refining its representation of every token using the tokens around it.<br \/>\nA transformer uses attention in every layer. The model can draw directly on every earlier token at once, weighing how relevant each is to the current prediction. That makes attention good at recalling a specific earlier token exactly, even when that token appeared far back in the input. The catch is that every token is compared against all the earlier ones, so attention\u2019s cost climbs steeply as the input grows. Additionally, while attention is strong at recalling and aggregating information, it also struggles to represent information that evolves sequentially over time.<br \/>\nA hybrid model keeps a few attention layers but swaps the rest for recurrent layers. Unlike an attention layer, a recurrent layer reads tokens left to right and carries a fixed-size memory, folding each new token into memory as it goes so the cost of processing each token stays flat however long the input gets. That memory is compressed and lossy, so a recurrent layer can\u2019t reach back for an exact earlier token the way attention can. But it is well suited to keeping a running account of anything that changes as the model reads tokens, providing a complementary strength to attention.<br \/>\nTo isolate the areas of strength and weakness for attention and recurrent layers, we fed Olmo 3 and Olmo Hybrid passages of text: articles, Wikipedia entries, books, and scientific papers, as well as structured text like Python, HTML, and LaTeX. We scored each model on how well it predicted each token from the tokens before it in a given sample.<br \/>\nBoth models saw the same earlier tokens and assigned a probability to every possible next token. We recorded the probability each gave to the token that actually followed. We then summarize the difference between the two models token by token by computing the loss gap, or the difference in loss between the two models. A positive gap means the hybrid predicted the real next token better. A negative gap means the transformer did.<br \/>\nTo find where the loss gaps might concentrate, we ran several analyses. First, we sorted each token into a category and averaged the loss gap within these categories. Because a raw average can be skewed by other factors, such as a category\u2019s rarity or how often tokens repeat in a sample of text, we re-checked each pattern with a regression that estimates the category\u2019s own effect while holding other factors constant.<\/p>\n<p>\t\tWhat real text shows<\/p>\n<p>We find that Olmo Hybrid has lower loss than Olmo 3 on most kinds of tokens, though not by the same amount on each.<br \/>\nIn prose, the clearest divide is between content words \u2014 meaning-bearing nouns, verbs, and adjectives \u2014 and function words like \u201cthe,\u201d \u201cof,\u201d and \u201cis.\u201d The hybrid predicts content words better than the transformer, with a loss gap around 0.040.040.04, whereas the gap is closer to 0.020.020.02 on function words.<br \/>\nIn particular, on content-word categories like adverbs and adjectives, the advantage of hybrid models is especially pronounced, though some function-word categories like existentials, such as \u201cthere,\u201d also show a large advantage for hybrid models. In short, the hybrid\u2019s edge is biggest on the words that say what a sentence is about and smallest on the grammatical words any model can nearly guess from syntax.<br \/>\nIn contrast, we find some specific contexts where the advantage of hybrid models over transformers disappears. The first is closing, but not opening, braces, a pattern that is robust across brackets in language, code, and markup. Why? It\u2019s known that attention suffices for representing bracket matching, which suggests attention alone suffices for closing brace prediction.<\/p>\n<p>The second place where the hybrid\u2019s advantage all but disappears is when the next token simply repeats something already in the passage. We spot these cases by looking for repeated n-grams: runs of text where the token that completes a sequence has appeared, verbatim, earlier in the same passage. The longer the repeated run, the smaller the hybrid\u2019s lead, until it approaches zero.<br \/>\nFinally, inspired by these findings, we explore using filtered losses on specific types of tokens as an evaluation to better compare different architectures in pretraining experiments. We use three 1B-parameter models from our earlier Olmo Hybrid work: a transformer, a hybrid, and a pure recurrent model with no attention at all.<br \/>\nOn meaning-bearing tokens that aren\u2019t repeats, the hybrid and pure recurrent model overtake the transformer, with the hybrid performing the best. On repeated tokens, the pure recurrent model \u2014 with no attention to reach back for the copy \u2014 falls behind both the hybrid and the transformer.<br \/>\nThus, these filtered token losses reveal different fine-grained differences between architectures, including copying abilities and differences on content words, early in training in a way that would not otherwise be visible.<\/p>\n<p>\t\tWhere this leaves us<\/p>\n<p>Filtered token losses surface architecture differences during 1B pretraining. Token-loss curves at WSD-annealed checkpoints for a transformer, a hybrid, and a pure recurrent neural network, or RNN.<br \/>\nTwo lessons follow from this work.<br \/>\nFirst, a single overall loss \u2014 the model\u2019s average error across all tokens \u2014 is too blunt to compare transformer and hybrid architectures. Scoring the loss on just the tokens that test a specific model ability surfaces key differences.<br \/>\nSecond, specifically for hybrid models, we found evidence of particular advantages on open-class tokens, which perhaps is related to the state-tracking capabilities of RNN layers.<br \/>\nAs a next step, we\u2019re taking these findings into our ongoing hybrid modeling work. We believe the best hybrid architectures will come from understanding, token by token, what each component of a model does well. We hope studies like this help that understanding grow across the whole AI community.<br \/>\nWe encourage you to read our full report, explore Olmo 3, try Olmo Hybrid, and dig into their associated open artifacts.<br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/huggingface.co\/blog\/allenai\/hybrid-token-prediction\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\ud83d\udcc4 Tech report: https:\/\/arxiv.org\/abs\/2606.20936 Which kinds of tokens does a model predict well, and which does it not? That question is especially intriguing in the case of hybrids, a language model architecture that\u2019s begun to challenge the standard transformer and that we\u2019ve been investigating with Olmo Hybrid. Hybrids can match or beat transformers on standard [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6117,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[],"class_list":["post-6116","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6116","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6116"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6116\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/6117"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6116"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6116"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6116"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}