{"id":5591,"date":"2026-06-16T15:17:13","date_gmt":"2026-06-16T08:17:13","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=5591"},"modified":"2026-06-16T15:17:13","modified_gmt":"2026-06-16T08:17:13","slug":"bassimeledath-kitchen-rush-kitchen-rush-a-benchmark-for-accurate-and-fast-native-tool-calling-%c2%b7-github","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=5591","title":{"rendered":"bassimeledath\/kitchen-rush: Kitchen Rush: a benchmark for accurate AND fast native tool calling \u00b7 GitHub"},"content":{"rendered":"<p> <br \/>\n<br \/>\nAn agent tool-calling benchmark where latency matters as much as intelligence.<\/p>\n<p>Most tool-calling benchmarks (BFCL, \u03c4-bench, ToolSandbox, AppWorld) check whether a model<br \/>\nmakes the right calls \u2014 and the world politely waits while it thinks. That&#8217;s fine for offline<br \/>\ntasks. But if you&#8217;re building a voice assistant, a live-ops agent, or anything realtime, you<br \/>\ncare about two things at once: does the model do the right thing, and does it do it fast<br \/>\nenough? A model that finds the perfect answer after thirty seconds of reasoning is, for you,<br \/>\nthe wrong model.<br \/>\nKitchen Rush measures both at once, by construction: the time a model spends thinking is<br \/>\nconverted into game time that passes before its actions land. While the model deliberates,<br \/>\nfood keeps cooking, food burns, and order deadlines slip away. Speed and accuracy aren&#8217;t two<br \/>\ncharts you squint at \u2014 they&#8217;re one score, experienced the way a deployment would experience<br \/>\nthem.<\/p>\n<p>The model plays a chef in an Overcooked-style<br \/>\nkitchen. Orders stream in (burgers, soups, ramen\u2026), and the model fulfils them with ordinary<br \/>\nnative function calls \u2014 collect, chop, cook, plate, serve \u2014 racing deadlines,<br \/>\nburn timers, and a combo bonus for consecutive successful dishes. Three deliberate changes from<br \/>\nOvercooked:<\/p>\n<p>Latency is the game. Every model response first charges its thinking time to the shared<br \/>\nworld clock, then its actions execute. (You can chain several calls in one response and pay<br \/>\nthe latency once \u2014 decisiveness is rewarded.)<br \/>\nNo joystick skills. The chef walks itself to the right station automatically; travel<br \/>\ntime is charged inside the action. What&#8217;s being tested is choosing the right action<br \/>\nsequence under time pressure, not video-game reflexes.<br \/>\nFully deterministic. Same seed, same actions, same latencies \u2192 exactly the same episode,<br \/>\nevery time, on any machine. Every run can be replayed in a browser viewer and audited.<\/p>\n<p>Every episode produces a single 0\u2013100 score we call KR (the Kitchen Rush score). It&#8217;s<br \/>\ngraded on a curve between two fixed anchors: KR 0 means &#8220;no better than doing nothing and<br \/>\nletting every order expire,&#8221; and KR 100 means &#8220;matched a scripted reference chef that plays<br \/>\nthe same kitchen with zero latency.&#8221;<br \/>\nA worked example makes it concrete. Say that on one kitchen the do-nothing chef finishes at<br \/>\n\u221260 points (every order expired), the zero-latency reference chef finishes at +140,<br \/>\nand your model finishes at +40. There are 200 points between the two anchors and your<br \/>\nmodel covered 100 of them, so its KR is 50 \u2014 it closed half the gap to the reference.<br \/>\nAverage that over many seeded kitchens and you have the leaderboard number<br \/>\n(docs\/METHODOLOGY.md has the full formula).<\/p>\n<p>Here&#8217;s the knob that makes Kitchen Rush flexible: every kitchen is generated at a latency<br \/>\nbudget B (&#8211;latency-budget, in seconds per decision). Think of B as the pace the<br \/>\nkitchen is priced for: order deadlines are set so that a chef spending exactly B seconds on<br \/>\neach decision can finish every order, with roughly 1.4\u20131.6\u00d7 headroom to spare. Each B gets its<br \/>\nown leaderboard \u2014 results at different budgets are never averaged together.<br \/>\nFor the mathematically inclined, the pricing is exact:<br \/>\ndeadline = arrival + \u2308\u03c3 \u00b7 C(B)\u2309,   where C(B) = A + K\u00b7B<\/p>\n<p>A is the order&#8217;s intrinsic cooking\/walking time, K is how many decisions a competent plan<br \/>\nneeds, and \u03c3 is the headroom (1.4\u20131.6 by tier). So a model that actually decides in \u2113 seconds<br \/>\ngains or loses K\u00b7(B \u2212 \u2113) seconds of breathing room per order. Faster than B? You bank slack<br \/>\nand serve while orders are still worth full value. Slower? You eat through the headroom, and<br \/>\norders start becoming unfinishable at around \u2113 \u2248 B + (\u03c3\u22121)\u00b7C(B)\/K \u2014 about 3\u20134 s\/decision at<br \/>\nB=1 on the current tiers, which is exactly where our calibration sweep shows the reference<br \/>\nchef collapsing (docs\/METHODOLOGY.md \u00a72,<br \/>\ndocs\/CALIBRATION.md).<br \/>\nAnd in plain deployment terms: the model that wins at B=1s is the best pick when every<br \/>\ndecision has to land in about a second \u2014 on the benchmark&#8217;s reproducible clock that&#8217;s a<br \/>\nbudget of roughly 65 output tokens per decision, i.e. terse, single-shot tool dispatch \u2014 what a<br \/>\nvoice agent needs. B=5s buys about 730 tokens per decision \u2014 enough for a short burst of<br \/>\nreasoning, what an interactive assistant can afford. The same model can rank very differently on the<br \/>\ntwo boards, and that reordering is precisely what the benchmark is for.<\/p>\n<p>17 model configurations \u00d7 12 seeds \u00d7 {medium, hard} kitchens \u00d7 two latency budgets \u2014 816<br \/>\nepisodes so far. Each chart is one latency budget; bars are mean KR, whiskers are 95%<br \/>\nconfidence intervals. The full per-tier table (with costs, reasoning tokens, and serve rates)<br \/>\nis at leaderboard\/results\/board.md.<\/p>\n<p>The left board (B=1s) is the realtime test: the kitchen is priced for one second per<br \/>\ndecision, which on the benchmark&#8217;s clock buys about 65 output tokens \u2014 terse, single-shot tool<br \/>\ndispatch. Winning here means &#8220;the model I&#8217;d trust to drive a voice agent or a live dashboard.&#8221;<br \/>\nThe right board (B=5s) prices the same kitchens for five seconds per decision (~730<br \/>\ntokens \u2014 room for a short burst of reasoning), what an interactive assistant can afford.<br \/>\nRead them side by side \u2014 that contrast is the product. Under tight realtime pressure (B=1s)<br \/>\nthe fast no-reasoning models hold the podium: gemini-3.1-flash-lite runs nearly even with<br \/>\nclaude-sonnet-4.6 (32 vs 37). Give every decision five seconds instead and the board<br \/>\nreorders: gpt-5.4-mini with low reasoning rockets from near-zero to a dead heat with<br \/>\nsonnet (44 vs 44) at about a fifth of the cost, while flash-lite drops to half its B=1<br \/>\nstanding. The same mini with reasoning fully off scores 0.0 at both budgets \u2014 reasoning it<br \/>\ncan&#8217;t afford at B=1 is exactly what makes it a frontier-level tool caller at B=5. That&#8217;s the<br \/>\nlatency tax, made visible. (\u00b7think rows ran with reasoning on at low effort; everything<br \/>\nelse with reasoning off \u2014 fast single-shot dispatch is the honest realtime default. One row<br \/>\nyou might expect is missing: there is no claude-sonnet-4.6\u00b7think, because Anthropic&#8217;s API<br \/>\ndoes not allow extended thinking when tool calls are forced, and the harness forces tool<br \/>\ncalls \u2014 sonnet competes thinking-off only.)<\/p>\n<p>The flip, watched live: the same two models from the clip at the top,<br \/>\nbut in a kitchen priced at B=5s. Now the mini&#8217;s reasoning burst is affordable \u2014 it finishes<br \/>\nevery order at 99 raw points (KR 86) while sonnet is still cooking at 40. This is the<br \/>\nmini&#8217;s best kitchen \u2014 the chart above shows the average, a 44\u201344 tie across all 24 \u2014 but the<br \/>\ndirection is real: it wins the medium tier at B=5 outright (59 vs 52). Same models, different<br \/>\nlatency budget, different winner: that&#8217;s exactly what the two boards measure.<\/p>\n<p>Two minutes \u2014 run the scripted reference chef locally (no model calls):<br \/>\npip install -e .                          # the core has zero dependencies<br \/>\nkitchenrush bench &#8211;baseline random &#8211;tier easy &#8211;seeds 12 &#8211;trials 2<br \/>\nkitchenrush calibrate &#8211;tier easy &#8211;latency-budget 1   # see how the reference chef degrades with latency<\/p>\n<p># watch a game in the browser (scripted chef):<br \/>\nkitchenrush replay &#8211;oracle &#8211;tier easy &#8211;seed 0       # writes ui\/replays\/easy_seed0.json<br \/>\ncd ui &#038;&#038; python3 -m http.server 8000                   # then open http:\/\/localhost:8000<br \/>\n# &#8230;or race up to 4 models side-by-side on one clock: ?replays=a.json,b.json (see ui\/README.md)<br \/>\nTo benchmark a real model, add provider support and your API key:<br \/>\npip install -e &#8216;.(providers)&#8217;<br \/>\nkitchenrush bench &#8211;model anthropic:claude-sonnet-4-6 &#8211;tier medium &#8211;latency-budget 1<br \/>\nAny LiteLLM-routable model works via provider:model. You can also plug in a fully custom<br \/>\nclient \u2014 it only needs a name and a generate(system, messages, tools) -> ModelResponse<br \/>\nmethod, registered with register_adapter. CLI commands: run, bench, replay, seeds,<br \/>\ncalibrate.<\/p>\n<p>If you use Kitchen Rush in your work, please cite it (machine-readable copy in<br \/>\nCITATION.cff):<br \/>\n@software{kitchenrush2026,<br \/>\n  author = {Eledath, Bassim},<br \/>\n  title  = {Kitchen Rush: A Benchmark for Accurate and Fast Tool Calling},<br \/>\n  url    = {https:\/\/github.com\/bassimeledath\/kitchen-rush},<br \/>\n  year   = {2026}<br \/>\n}<\/p>\n<p>Apache-2.0. See LICENSE.<br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/github.com\/bassimeledath\/kitchen-rush\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>An agent tool-calling benchmark where latency matters as much as intelligence. Most tool-calling benchmarks (BFCL, \u03c4-bench, ToolSandbox, AppWorld) check whether a model makes the right calls \u2014 and the world politely waits while it thinks. That&#8217;s fine for offline tasks. But if you&#8217;re building a voice assistant, a live-ops agent, or anything realtime, you care [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5592,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[],"class_list":["post-5591","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/5591","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5591"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/5591\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/5592"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5591"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5591"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5591"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}