{"id":6459,"date":"2026-07-03T00:22:38","date_gmt":"2026-07-02T17:22:38","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=6459"},"modified":"2026-07-03T00:22:38","modified_gmt":"2026-07-02T17:22:38","slug":"hugging-face-and-cerebras-bring-gemma-4-to-real-time-voice-ai","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=6459","title":{"rendered":"Hugging Face and Cerebras bring Gemma 4 to real-time voice AI"},"content":{"rendered":"<p> <br \/>\n       <\/p>\n<p>For voice AI, latency is a critical parameter. Developers have made tremendous progress in model quality, but the user experience is still often limited by response times. Hugging Face and Cerebras are changing that experience. Today, we demonstrate what becomes possible when an open, modular voice AI architecture is paired with industry-leading inference speed.<br \/>\nThe result is a speech-to-speech experience that feels dramatically more natural. Instead of waiting for an AI to respond, conversations flow with the responsiveness users expect from human interaction.<\/p>\n<p>\t\tArchitecture: an Open, Cascaded Speech-to-Speech stack<\/p>\n<p>The demo is built as a real-time speech-to-speech pipeline. Each part of the system is modular, open, and replaceable, making it easy for developers to adapt the stack for different assistants, robots, products, or research projects.<br \/>\nThis creates a fully open speech-to-speech loop:<br \/>\nSpeech input<br \/>\n  -> speech recognition with Nvidia&#8217;s Parakeet<br \/>\n  -> Gemma 4 VLM inference on Cerebras<br \/>\n  -> text-to-speech with Alibaba&#8217;s Qwen3TTS<br \/>\n  -> spoken response<\/p>\n<p>The architecture brings together the strength of the open-source AI ecosystem: Cerebras for fast inference, Google DeepMind\u2019s Gemma 4 31B for the language model, and Qwen for text-to-speech. Every layer can be inspected, modified, and extended by the developers<\/p>\n<p>\t\tCerebras and Hugging Face Partnership<\/p>\n<p>Today, some production systems see a reasonable median latency while still experiencing frustrating multi-second delays at the P95. Those delays become even more noticeable when tool calls or multimodal steps require multiple turns.<br \/>\nCerebras helps solve one of the most important bottlenecks in the stack: the language-model response time. By making inference dramatically faster and more stable, Cerebras allows the rest of the Hugging Face pipeline to shine.<br \/>\nThat stability is especially important at the long tail. Many systems can deliver acceptable median response times, but occasional slow responses still make conversations feel unreliable.<\/p>\n<p>\t\tBuilt for real-world interaction<\/p>\n<p>This same Hugging Face speech-to-speech pipeline already powers Reachy Mini robots, with more than 9,000 robots in the wild. For robots, voice assistants, and embodied AI, responsiveness is not a cosmetic improvement. It is what makes the interaction feel alive.<br \/>\nThe motivation to use Cerebras is therefore not simply cost reduction. It is low latency, predictable performance, and the ability to create real-time experiences that feel natural at scale.<br \/>\nThis collaboration reflects a shared belief that the future of AI will be both open and performant. Open-source models, open infrastructure, and breakthrough inference speed together create a foundation for the next generation of conversational AI.<br \/>\nWe invite developers to explore the demo, experiment with the code, and help shape what comes next for real-time voice AI.<br \/>\nDemo: Hugging Face Space<br \/>\nRepository: huggingface\/speech-to-speech<br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/huggingface.co\/blog\/cerebras-gemma4-voice-ai\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>For voice AI, latency is a critical parameter. Developers have made tremendous progress in model quality, but the user experience is still often limited by response times. Hugging Face and Cerebras are changing that experience. Today, we demonstrate what becomes possible when an open, modular voice AI architecture is paired with industry-leading inference speed. The [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6460,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[],"class_list":["post-6459","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6459","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6459"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6459\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/6460"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6459"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6459"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6459"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}