M1 M2 M3 M4 LLM – DAILY NEWS

TECH & AI

mlx-serve — Run any LLM on your Mac · MLX + GGUF · faster than LM Studio · OpenAI + Anthropic API

jackminion Jul 4, 2026 0

Is mlx-serve faster than LM Studio?
Yes — every cell, every model we’ve benchmarked. On identical 4-bit MLX weights mlx-serve wins by +39% geomean across 18 workloads (Gemma 4 E2B/E4B/31B/26B-A4B-MoE and Qwen 3.6 27B/35B-A3B-MoE). On the same .gguf file as LM Studio (gemma-4-E4B-it-Q4_K_M.gguf), mlx-serve’s embedded llama.cpp wrapper still wins +12-15% on decode and +5% on prefill. Speculative decoding pushes the lead further on echo-heavy and code-completion workloads — up to 2.65× on Gemma 4 E4B echo.

Does mlx-serve replace LM Studio?
For most use cases, yes. mlx-serve runs the same MLX and GGUF models, exposes an OpenAI-compatible API on the same kind of port, and ships a native menu-bar app instead of an Electron one. It also adds things LM Studio doesn’t have: a real Anthropic Messages API (works with Claude Code), the OpenAI Responses API + WebSockets, MCP tool calling, agent mode with 10 built-in tools, KV-cache quantization, continuous batching, and the antirez/ds4 engine for DeepSeek V4 Flash.

Does mlx-serve replace Ollama on Mac?
On Apple Silicon, yes — mlx-serve speaks the Ollama API natively (/api/chat, /api/generate, /api/tags, /api/embed, /api/pull…), so Raycast, Obsidian, Enchanted, Open WebUI, and ollama-python/js work unchanged: drop in http://localhost:11234 wherever you had http://localhost:11434. The CLI matches too — mlx-serve run gemma4 downloads, serves, and chats in one command. Underneath, it runs llama.cpp and native MLX with the Mac-specific optimizations Ollama doesn’t ship — Metal kernels through mlx-c, speculative decoding, and a shared-prefix KV cache.

Can I run GGUF models on Mac without Python?
Yes. mlx-serve embeds llama.cpp’s inference library (libllama) inside the same signed, notarized binary. Point –model at any .gguf file and the server auto-detects the format and routes to the right engine — no pip, no venv, no llama-server to install separately. DeepSeek V4 Flash GGUFs go through the dedicated antirez/ds4 engine instead, also embedded.

Does mlx-serve work with Claude Code?
Yes — natively. mlx-serve implements Anthropic’s /v1/messages endpoint including streaming, tool calling, and extended thinking. Point Claude Code at it with ANTHROPIC_BASE_URL=http://localhost:11234. The MLX Core app ships a one-click Launch Claude Code button that wires up the env vars for you.

What about the OpenAI SDK, Continue, Cursor, Open WebUI?
All work — anything that talks the OpenAI chat-completions or Anthropic Messages wire protocol does. mlx-serve also implements the newer OpenAI Responses API (/v1/responses) for clients that want stateful chains via previous_response_id, plus a WebSocket transport on the same endpoint.

Can mlx-serve run DeepSeek V4 Flash locally?
Yes, on 96 GB+ Apple Silicon Macs. Open the MLX Core Model Browser, pick DeepSeek-V4-Flash, hit Download — the server routes the GGUF through the embedded ds4 engine (native Metal kernels, byte-validated against the reference forward). Agent mode and MCP tools work on DSV4 too.

What models are supported?
Native MLX dispatch for Gemma 3/4, Qwen 3 / 3.5 / 3.6 / 3-Next, Llama 3.x, Mistral, Nemotron-H, LFM2.5, and DeepSeek V4 Flash. Anything else as GGUF via embedded llama.cpp — Qwen, Llama, Mistral, Gemma, DeepSeek, Phi, Yi, and thousands more from HuggingFace. On the media side: FLUX.2 and Krea-2-Turbo for images, LTX-Video 2.3 for video, and Qwen3-TTS for speech and voice cloning — all running natively on-device.

Does it support tools / function calling?
Yes, on both API surfaces. The server detects tool-call patterns across architectures (Hermes XML, Gemma 4 , raw JSON, ChatML), repairs common Qwen 3.5/3.6 escape quirks, and emits OpenAI-style tool_calls deltas in the SSE stream. The MLX Core app ships 10 built-in tools (shell, file I/O, search, browse, web search, memory) and connects to MCP servers from a curated marketplace. Malformed tool-call JSON from small models is repaired at the API layer.

How does it stay this small / fast?
Zig with direct mlx-c FFI — no Python runtime, no Electron, no IPC bridge. The release binary is ~4.5 MB. Eager warmup at boot page-faults weights and pre-compiles decode kernels (first request 3.5× faster). Multi-turn agent loops reuse KV across turns and skip re-prefilling system prompts via a shared-prefix cache that survives interleaved subagent traffic; a Claude Code-sized prompt tokenizes in 8 ms, so a warm agent turn round-trips in ~0.1 s end to end.

Is the inference exact, or quantized output drift?
For greedy decoding (temp=0), mlx-serve is byte-identical to the reference for the first ~30-80 generated tokens, with long-tail divergence inherent to INT4 float-reduction order. For temp > 0, the Leviathan probability-ratio sampler keeps speculative decoding mathematically exact in distribution. Equivalence is pinned by automated tests on every release.

Can mlx-serve generate images, video, and audio locally?
Yes — all on-device, no Python. Image: Krea-2-Turbo (a 12.9B photorealistic model) and FLUX.2 run natively on MLX, validated pixel-faithful to the reference — and it edits photos from a plain instruction (“make the hair blue”) while keeping subject and scene, does image-to-image variations with a strength slider, and takes runtime style LoRAs. Video: LTX-Video 2.3 turns a prompt, a photo, or a soundtrack into a clip with synced audio — put spoken lines in quotes and characters talk, lips synced to the voice. Audio: Qwen3-TTS does zero-shot voice cloning from a few seconds of reference audio — no transcript needed. Chat and every media type share one local server and one memory budget: a model loads on demand and unloads when done, so a chat model and a media model can coexist.

Where does my data go?
Nowhere. Everything runs locally on your Mac — no analytics, no telemetry, no cloud calls. The HTTP server binds to 127.0.0.1 by default. Open source under MIT.

How do I install it?
The easiest way is the MLX Core app from GitHub Releases (signed and notarized DMG). Or via Homebrew: brew tap ddalcu/mlx-serve https://github.com/ddalcu/mlx-serve && brew install –cask mlx-core. CLI server alone: brew install mlx-serve.

Have another question? Open an issue · ★ Star the repo if mlx-serve saved you from spinning up another Electron app.

Source link