{"id":6518,"date":"2026-07-04T05:22:03","date_gmt":"2026-07-03T22:22:03","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=6518"},"modified":"2026-07-04T05:22:03","modified_gmt":"2026-07-03T22:22:03","slug":"mlx-serve-run-any-llm-on-your-mac-%c2%b7-mlx-gguf-%c2%b7-faster-than-lm-studio-%c2%b7-openai-anthropic-api","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=6518","title":{"rendered":"mlx-serve \u2014 Run any LLM on your Mac \u00b7 MLX + GGUF \u00b7 faster than LM Studio \u00b7 OpenAI + Anthropic API"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<p>        Is mlx-serve faster than LM Studio?<br \/>\n        Yes \u2014 every cell, every model we&#8217;ve benchmarked. On identical 4-bit MLX weights mlx-serve wins by +39% geomean across 18 workloads (Gemma 4 E2B\/E4B\/31B\/26B-A4B-MoE and Qwen 3.6 27B\/35B-A3B-MoE). On the same .gguf file as LM Studio (gemma-4-E4B-it-Q4_K_M.gguf), mlx-serve&#8217;s embedded llama.cpp wrapper still wins +12-15% on decode and +5% on prefill. Speculative decoding pushes the lead further on echo-heavy and code-completion workloads \u2014 up to 2.65\u00d7 on Gemma 4 E4B echo.<\/p>\n<p>        Does mlx-serve replace LM Studio?<br \/>\n        For most use cases, yes. mlx-serve runs the same MLX and GGUF models, exposes an OpenAI-compatible API on the same kind of port, and ships a native menu-bar app instead of an Electron one. It also adds things LM Studio doesn&#8217;t have: a real Anthropic Messages API (works with Claude Code), the OpenAI Responses API + WebSockets, MCP tool calling, agent mode with 10 built-in tools, KV-cache quantization, continuous batching, and the antirez\/ds4 engine for DeepSeek V4 Flash.<\/p>\n<p>        Does mlx-serve replace Ollama on Mac?<br \/>\n        On Apple Silicon, yes \u2014 mlx-serve speaks the Ollama API natively (\/api\/chat, \/api\/generate, \/api\/tags, \/api\/embed, \/api\/pull\u2026), so Raycast, Obsidian, Enchanted, Open WebUI, and ollama-python\/js work unchanged: drop in http:\/\/localhost:11234 wherever you had http:\/\/localhost:11434. The CLI matches too \u2014 mlx-serve run gemma4 downloads, serves, and chats in one command. Underneath, it runs llama.cpp and native MLX with the Mac-specific optimizations Ollama doesn&#8217;t ship \u2014 Metal kernels through mlx-c, speculative decoding, and a shared-prefix KV cache.<\/p>\n<p>        Can I run GGUF models on Mac without Python?<br \/>\n        Yes. mlx-serve embeds llama.cpp&#8217;s inference library (libllama) inside the same signed, notarized binary. Point &#8211;model at any .gguf file and the server auto-detects the format and routes to the right engine \u2014 no pip, no venv, no llama-server to install separately. DeepSeek V4 Flash GGUFs go through the dedicated antirez\/ds4 engine instead, also embedded.<\/p>\n<p>        Does mlx-serve work with Claude Code?<br \/>\n        Yes \u2014 natively. mlx-serve implements Anthropic&#8217;s \/v1\/messages endpoint including streaming, tool calling, and extended thinking. Point Claude Code at it with ANTHROPIC_BASE_URL=http:\/\/localhost:11234. The MLX Core app ships a one-click Launch Claude Code button that wires up the env vars for you.<\/p>\n<p>        What about the OpenAI SDK, Continue, Cursor, Open WebUI?<br \/>\n        All work \u2014 anything that talks the OpenAI chat-completions or Anthropic Messages wire protocol does. mlx-serve also implements the newer OpenAI Responses API (\/v1\/responses) for clients that want stateful chains via previous_response_id, plus a WebSocket transport on the same endpoint.<\/p>\n<p>        Can mlx-serve run DeepSeek V4 Flash locally?<br \/>\n        Yes, on 96 GB+ Apple Silicon Macs. Open the MLX Core Model Browser, pick DeepSeek-V4-Flash, hit Download \u2014 the server routes the GGUF through the embedded ds4 engine (native Metal kernels, byte-validated against the reference forward). Agent mode and MCP tools work on DSV4 too.<\/p>\n<p>        What models are supported?<br \/>\n        Native MLX dispatch for Gemma 3\/4, Qwen 3 \/ 3.5 \/ 3.6 \/ 3-Next, Llama 3.x, Mistral, Nemotron-H, LFM2.5, and DeepSeek V4 Flash. Anything else as GGUF via embedded llama.cpp \u2014 Qwen, Llama, Mistral, Gemma, DeepSeek, Phi, Yi, and thousands more from HuggingFace. On the media side: FLUX.2 and Krea-2-Turbo for images, LTX-Video 2.3 for video, and Qwen3-TTS for speech and voice cloning \u2014 all running natively on-device.<\/p>\n<p>        Does it support tools \/ function calling?<br \/>\n        Yes, on both API surfaces. The server detects tool-call patterns across architectures (Hermes XML, Gemma 4 , raw JSON, ChatML), repairs common Qwen 3.5\/3.6 escape quirks, and emits OpenAI-style tool_calls deltas in the SSE stream. The MLX Core app ships 10 built-in tools (shell, file I\/O, search, browse, web search, memory) and connects to MCP servers from a curated marketplace. Malformed tool-call JSON from small models is repaired at the API layer.<\/p>\n<p>        How does it stay this small \/ fast?<br \/>\n        Zig with direct mlx-c FFI \u2014 no Python runtime, no Electron, no IPC bridge. The release binary is ~4.5 MB. Eager warmup at boot page-faults weights and pre-compiles decode kernels (first request 3.5\u00d7 faster). Multi-turn agent loops reuse KV across turns and skip re-prefilling system prompts via a shared-prefix cache that survives interleaved subagent traffic; a Claude Code-sized prompt tokenizes in 8 ms, so a warm agent turn round-trips in ~0.1 s end to end.<\/p>\n<p>        Is the inference exact, or quantized output drift?<br \/>\n        For greedy decoding (temp=0), mlx-serve is byte-identical to the reference for the first ~30-80 generated tokens, with long-tail divergence inherent to INT4 float-reduction order. For temp > 0, the Leviathan probability-ratio sampler keeps speculative decoding mathematically exact in distribution. Equivalence is pinned by automated tests on every release.<\/p>\n<p>        Can mlx-serve generate images, video, and audio locally?<br \/>\n        Yes \u2014 all on-device, no Python. Image: Krea-2-Turbo (a 12.9B photorealistic model) and FLUX.2 run natively on MLX, validated pixel-faithful to the reference \u2014 and it edits photos from a plain instruction (&#8220;make the hair blue&#8221;) while keeping subject and scene, does image-to-image variations with a strength slider, and takes runtime style LoRAs. Video: LTX-Video 2.3 turns a prompt, a photo, or a soundtrack into a clip with synced audio \u2014 put spoken lines in quotes and characters talk, lips synced to the voice. Audio: Qwen3-TTS does zero-shot voice cloning from a few seconds of reference audio \u2014 no transcript needed. Chat and every media type share one local server and one memory budget: a model loads on demand and unloads when done, so a chat model and a media model can coexist.<\/p>\n<p>        Where does my data go?<br \/>\n        Nowhere. Everything runs locally on your Mac \u2014 no analytics, no telemetry, no cloud calls. The HTTP server binds to 127.0.0.1 by default. Open source under MIT.<\/p>\n<p>        How do I install it?<br \/>\n        The easiest way is the MLX Core app from GitHub Releases (signed and notarized DMG). Or via Homebrew: brew tap ddalcu\/mlx-serve https:\/\/github.com\/ddalcu\/mlx-serve &#038;&#038; brew install &#8211;cask mlx-core. CLI server alone: brew install mlx-serve.<\/p>\n<p>      Have another question? Open an issue \u00b7 \u2605 Star the repo if mlx-serve saved you from spinning up another Electron app.<\/p>\n<p><br \/>\n<br \/><a href=\"https:\/\/mlxserve.com\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Is mlx-serve faster than LM Studio? Yes \u2014 every cell, every model we&#8217;ve benchmarked. On identical 4-bit MLX weights mlx-serve wins by +39% geomean across 18 workloads (Gemma 4 E2B\/E4B\/31B\/26B-A4B-MoE and Qwen 3.6 27B\/35B-A3B-MoE). On the same .gguf file as LM Studio (gemma-4-E4B-it-Q4_K_M.gguf), mlx-serve&#8217;s embedded llama.cpp wrapper still wins +12-15% on decode and +5% on [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6519,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[2325,2323,2315,2306,2316,2317,2318,2313,2324,2312,2308,2322,2307,2320,2303,2304,2305,2321,2309,2310,2311,2314,2319],"class_list":["post-6518","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai","tag-ai-agent-sandbox","tag-ai-photo-editing-local","tag-anthropic-api-local","tag-apple-silicon-llm","tag-claude-code-local","tag-deepseek-v4-flash","tag-gemma-4","tag-gguf-apple-silicon","tag-image-to-video-mac","tag-llama-cpp-mac","tag-lm-studio-alternative","tag-local-image-generation-mac","tag-local-llm-mac","tag-m1-m2-m3-m4-llm","tag-mlx","tag-mlx-server","tag-mlx-serve","tag-native-mlx-inference","tag-ollama-alternative","tag-ollama-api-compatible","tag-ollama-drop-in-replacement","tag-openai-compatible-local","tag-qwen-3-6"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6518","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6518"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6518\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/6519"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6518"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6518"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6518"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}