{"id":4100,"date":"2026-05-20T03:16:11","date_gmt":"2026-05-19T20:16:11","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=4100"},"modified":"2026-05-20T03:16:11","modified_gmt":"2026-05-19T20:16:11","slug":"how-i-built-a-6-node-12-gpu-on-prem-ai-cluster-running-1000-agents","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=4100","title":{"rendered":"How I built a 6-node 12-GPU on-prem AI cluster running 1000+ agents"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<p>TL;DR \u2014 6 machines, 12 GPUs, 1,000+ concurrent agents, P95 18 ms, voice <\/p>\n<p>  Why I built this<\/p>\n<p>I&#8217;m Franck. Toulouse, France. Over 3 years I paid roughly \u20ac280,000 to Azure + OpenAI before doing the math properly:<\/p>\n<p>Latency: 1.2s voice round-trip \u2014 incompatible with the voice-first UX I wanted.<\/p>\n<p>Compliance: customer data on US servers. Not GDPR-native, just GDPR-compliant-on-paper.<\/p>\n<p>Quotas: random throttling at the worst times.<\/p>\n<p>Lock-in: Azure outage = my product offline.<\/p>\n<p>I decided to rebuild everything on-prem. This is the result.<\/p>\n<p>  The cluster<\/p>\n<p>6 machines, 3 tiers, 12 GPUs total, <\/p>\n<p>  Tier 1 \u2014 GPU compute (heavy inference)<\/p>\n<p>M1 &#8220;La Cr\u00e9atrice&#8221; \u2014 Ryzen 5700X3D, 6\u00d7 RTX 3080+, 46 GB RAM. Primary LLM node, runs qwen3.5-9b, qwen3.5-35b-a3b, deepseek-r1, the Claude 4.5\/4.6 distillations, and the Whisper CUDA pipeline.<\/p>\n<p>M2 &#8220;Le Forge&#8221; \u2014 multi-GPU NVIDIA, secondary inference, failover from M1 in 1.3s.<\/p>\n<p>  Tier 2 \u2014 CPU\/RAM (orchestration, memory)<\/p>\n<p>M3 &#8220;Le Cerveau&#8221; \u2014 high-RAM CPU node. PostgreSQL + Redis + Pinecone. Runs the orchestrator, the 3-quorum consensus engine (M1+M2+M3), and the analytics\/monitoring agents.<\/p>\n<p>  Tier 3 \u2014 production \/ work<\/p>\n<p>M4 &#8220;Bridge Windows&#8221; \u2014 Windows 11, 2 GPUs, trading bot live.<\/p>\n<p>M5 &#8220;Interface Relay&#8221; \u2014 Linux i5-6500, 15 GB RAM. Dev interface, 15+ MCP servers, Claude Code.<\/p>\n<p>M6 &#8220;Mobile Ops&#8221; \u2014 laptop. SSH + VPN. Client demos and on-site ops.<\/p>\n<p>  The 9 layers I added on top of Ubuntu<\/p>\n<p>L9 \u2014 Vocal \/ conversational (Whisper CUDA STT, Piper TTS, wake word, 50+ languages)<br \/>\nL8 \u2014 Multi-agent orchestration (MCP-native, consensus engine)<br \/>\nL7 \u2014 Trading consensus engine (multi-model voting GPT\/Gemini\/Claude)<br \/>\nL6 \u2014 Browser + web automation (Chrome DevTools Protocol)<br \/>\nL5 \u2014 MCP tool registry (88+ handlers)<br \/>\nL4 \u2014 GPU cluster management (Docker Swarm, failover<br \/>\nL3 \u2014 Domino pipeline engine (835 chains)<br \/>\nL2 \u2014 systemd service layer (98 units)<br \/>\nL1 \u2014 Linux boot integration (GRUB hooks, ZRAM, kernel params)<\/p>\n<p>  Real numbers<\/p>\n<p>Metric<br \/>\nValue<\/p>\n<p>Concurrent agents<br \/>\n1,000+<\/p>\n<p>P95 latency (cluster internal)<br \/>\n18 ms<\/p>\n<p>Voice pipeline end-to-end<\/p>\n<p>Aggregate throughput<br \/>\n67 tok\/s<\/p>\n<p>Python lines<br \/>\n280,741<\/p>\n<p>Public repos<br \/>\n44 (all MIT)<\/p>\n<p>  Cost comparison (1M tokens\/day, team of 10)<\/p>\n<p>Provider<br \/>\n\u20ac\/month<br \/>\nP95<br \/>\nConcurrent agents<br \/>\nData residency<\/p>\n<p>Azure OpenAI<br \/>\n1,500<br \/>\n800ms-3s<br \/>\n~20<br \/>\nUS<\/p>\n<p>AWS Bedrock<br \/>\n1,800<br \/>\n700ms-2.5s<br \/>\n~15<br \/>\nUS<\/p>\n<p>Mistral Cloud<br \/>\n800<br \/>\n400-800ms<br \/>\n~30<br \/>\nEU<\/p>\n<p>JARVIS OS<br \/>\n0<br \/>\n18 ms<br \/>\n1,000+<br \/>\nAir-gapped<\/p>\n<p>For a 50K\u20ac turn-key deployment, break-even vs Azure is 7 months, and the marginal cost is zero after that.<\/p>\n<p>  What I sell now<\/p>\n<p>JARVIS OS turn-key \u2014 20K\u20ac to 250K\u20ac depending on scope.<\/p>\n<p>62 PDF trainings \u2014 from \u20ac39, 293h of content based on production code (+48 private).<\/p>\n<p>IA infra audit \u2014 \u20ac1,500, report in 48h.<\/p>\n<p>1-to-1 mentorship \u2014 \u20ac250\/h.<\/p>\n<p>Fractional CTO \u2014 TJM \u20ac1,000-1,150 \/ CDI \u20ac85-95K. Toulouse \/ remote.<\/p>\n<p>  Honest weaknesses<\/p>\n<p>Consensus voting is empirical. No formal verification of the agreement function.<\/p>\n<p>Tier-2 failure (M3 down) is the weakest scenario \u2014 orchestrator dies, cluster keeps inferring but loses persistent memory.<\/p>\n<p>MCP protocol bet \u2014 if Anthropic deprecates parts of MCP, I have 88 handlers to refactor.<\/p>\n<p>kWh-per-token efficiency \u2014 cloud probably wins on aggregate watts\/token, on-prem wins on marginal cost.<\/p>\n<p>  Links<\/p>\n<p><br \/>\n<br \/><a href=\"https:\/\/dev.to\/turbo31150\/how-i-built-a-6-node-12-gpu-on-prem-ai-cluster-running-1000-agents-3203\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>TL;DR \u2014 6 machines, 12 GPUs, 1,000+ concurrent agents, P95 18 ms, voice Why I built this I&#8217;m Franck. Toulouse, France. Over 3 years I paid roughly \u20ac280,000 to Azure + OpenAI before doing the math properly: Latency: 1.2s voice round-trip \u2014 incompatible with the voice-first UX I wanted. Compliance: customer data on US servers. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":4101,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[835,761,765,762,763,764,989,1523,937,760],"class_list":["post-4100","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai","tag-ai","tag-coding","tag-community","tag-development","tag-engineering","tag-inclusive","tag-infrastructure","tag-llm","tag-opensource","tag-software"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/4100","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4100"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/4100\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/4101"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4100"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4100"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4100"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}