{"id":6129,"date":"2026-06-26T09:48:28","date_gmt":"2026-06-26T02:48:28","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=6129"},"modified":"2026-06-26T09:48:28","modified_gmt":"2026-06-26T02:48:28","slug":"run-a-vllm-server-on-hf-jobs-in-one-command","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=6129","title":{"rendered":"Run a vLLM Server on HF Jobs in One Command"},"content":{"rendered":"<p> <br \/>\n<br \/> <br \/>\nYou can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command \u2014 no servers to provision, no Kubernetes, pay-per-second. Once it&#8217;s up, you can query it from your laptop, a notebook, or anywhere else.<br \/>\nIt&#8217;s the quickest way to stand up a model for tests, evals, or batch generation. (If you&#8217;re after a managed, production-ready service instead, that&#8217;s what Inference Endpoints are for \u2014 more on when to pick which at the end.)<br \/>\nHere&#8217;s the whole thing end to end.<\/p>\n<p>\t\tPrerequisites<\/p>\n<p>A payment method or a positive prepaid credit balance (Jobs is billed per\u2011minute by hardware usage).<br \/>\nhuggingface_hub >= 1.20.0: pip install -U &#8220;huggingface_hub>=1.20.0&#8221;.<br \/>\nLogged in locally: hf auth login.<\/p>\n<p>\t\tLaunch the server<\/p>\n<p>hf jobs run is docker run for HF infrastructure. We use the official vllm\/vllm-openai image, ask for a GPU with &#8211;flavor, and expose vLLM&#8217;s port with &#8211;expose:<br \/>\nhf jobs run &#8211;flavor a10g-large &#8211;expose 8000 &#8211;timeout 2h \\<br \/>\n  vllm\/vllm-openai:latest \\<br \/>\n  vllm serve Qwen\/Qwen3-4B &#8211;host 0.0.0.0 &#8211;port 8000<\/p>\n<p>&#8211;expose 8000 routes the container&#8217;s port through HF&#8217;s public jobs proxy (see the Serve Models guide for the full reference). The command prints the URL your server is reachable at:<br \/>\n\u2713 Job started<br \/>\n  id: 6a381ca1953ed90bfb947332<br \/>\n  url: https:\/\/huggingface.co\/jobs\/qgallouedec\/6a381ca1953ed90bfb947332<br \/>\nHint: Exposed ports are reachable at (requires an HF token with read access to the job):<br \/>\n  https:\/\/6a381ca1953ed90bfb947332&#8211;8000.hf.jobs<\/p>\n<p>6a381ca1953ed90bfb947332 is your job ID. Keep track of it, we&#8217;ll need it. We&#8217;ll use  as a placeholder for it in the rest of the post.<br \/>\nGive it a couple of minutes to download weights and boot. When the logs show Application startup complete, you&#8217;re live.<\/p>\n<p>\t\tQuery it from anywhere<\/p>\n<p>vLLM speaks the OpenAI API, and every request just needs your HF token as a bearer token. The quickest way to hit it is curl:<br \/>\ncurl https:\/\/&#8211;8000.hf.jobs\/v1\/chat\/completions \\<br \/>\n  -H &#8220;Authorization: Bearer $(hf auth token)&#8221; \\<br \/>\n  -H &#8220;Content-Type: application\/json&#8221; \\<br \/>\n  -d &#8216;{<br \/>\n    &#8220;model&#8221;: &#8220;Qwen\/Qwen3-4B&#8221;,<br \/>\n    &#8220;messages&#8221;: ({&#8220;role&#8221;: &#8220;user&#8221;, &#8220;content&#8221;: &#8220;Hello!&#8221;}),<br \/>\n    &#8220;chat_template_kwargs&#8221;: {&#8220;enable_thinking&#8221;: false}<br \/>\n  }&#8217;<\/p>\n<p>which returns the usual OpenAI-style JSON, with choices(0).message.content holding &#8220;Hello! How can I assist you today? \ud83d\ude0a&#8221;.<br \/>\nOr, from Python, point the OpenAI client at the exposed URL and pass the token as the API key:<br \/>\nfrom huggingface_hub import get_token<br \/>\nfrom openai import OpenAI<\/p>\n<p>client = OpenAI(<br \/>\n    base_url=&#8221;https:\/\/&#8211;8000.hf.jobs\/v1&#8243;,<br \/>\n    api_key=get_token(),<br \/>\n)<br \/>\nresp = client.chat.completions.create(<br \/>\n    model=&#8221;Qwen\/Qwen3-4B&#8221;,<br \/>\n    messages=({&#8220;role&#8221;: &#8220;user&#8221;, &#8220;content&#8221;: &#8220;Hello!&#8221;}),<br \/>\n    extra_body={&#8220;chat_template_kwargs&#8221;: {&#8220;enable_thinking&#8221;: False}},<br \/>\n)<br \/>\nprint(resp.choices(0).message.content)<\/p>\n<p>Hello! How can I assist you today? \ud83d\ude0a<\/p>\n<p>Quick health check before you start: curl https:\/\/&#8211;8000.hf.jobs\/v1\/models -H &#8220;Authorization: Bearer $(hf auth token)&#8221; should list the model.<\/p>\n<p>\ud83d\udd10 The endpoint is gated, not public. Every request must carry an HF token with read access to the job&#8217;s namespace. A plain browser visit will be rejected. In effect, the jobs proxy is your API gate: access is scoped to you (and your org). That&#8217;s fine for private use, but treat the URL accordingly: don&#8217;t share it expecting it to be open, and don&#8217;t paste your token into untrusted places. If you need finer-grained or public access, put a proper gateway in front instead. Or see HF Jobs or Inference Endpoints? below.<\/p>\n<p>\t\tClean up<\/p>\n<p>Jobs are billed per second, so stop the server when you&#8217;re done:<br \/>\nhf jobs cancel <\/p>\n<p>The &#8211;timeout you set is a safety net (it&#8217;ll auto-stop), but cancelling explicitly is cheaper. An a10g-large runs at $1.50\/hour \u2014 check hf jobs hardware for the full price list and pick the smallest flavor that fits your model.<\/p>\n<p>\t\tGoing further: bigger models<\/p>\n<p>The same command scales to much larger models \u2014 pick a beefier &#8211;flavor and tell vLLM to shard the model across the GPUs with &#8211;tensor-parallel-size. For example, the 122B Qwen3.5 mixture-of-experts model on 2\u00d7 H200:<br \/>\nhf jobs run &#8211;flavor h200x2 &#8211;expose 8000 &#8211;timeout 2h \\<br \/>\n  vllm\/vllm-openai:latest \\<br \/>\n  vllm serve Qwen\/Qwen3.5-122B-A10B \\<br \/>\n  &#8211;host 0.0.0.0 &#8211;port 8000 &#8211;tensor-parallel-size 2 \\<br \/>\n  &#8211;max-model-len 32768 &#8211;max-num-seqs 256<\/p>\n<p>&#8211;tensor-parallel-size should match the number of GPUs in the flavor (h200x2 \u2192 2, h200x8 \u2192 8). Run hf jobs hardware to see what&#8217;s available and give bigger models a longer &#8211;timeout, since they take longer to download and load. For large models, H200 flavors are usually the best value.<br \/>\nThe &#8211;max-model-len 32768 &#8211;max-num-seqs 256 flags are specific to this model: Qwen3.5-122B is a hybrid Mamba\/attention architecture with a 256K-token default context, which doesn&#8217;t leave enough memory for vLLM&#8217;s default batch settings. Capping the context length and concurrent-sequence count keeps it within the GPUs&#8217; memory. If a model fails to start with an out-of-memory or cache-block error, dialing these two down is the first thing to try. Everything else (the exposed URL, the OpenAI client, the token auth) stays exactly the same.<\/p>\n<p>\t\tGoing further: Chat with it in a UI<\/p>\n<p>Prefer a chat window over curl? A few lines of Gradio point at the same endpoint. Add &#8211;reasoning-parser deepseek_r1 to the vllm serve command so Qwen3&#8217;s thinking comes back as a separate field (not necessary, but helpful), then run this code locally (you&#8217;ll just need the job ID):<br \/>\nimport gradio as gr<br \/>\nfrom gradio import ChatMessage<br \/>\nfrom huggingface_hub import get_token<br \/>\nfrom openai import OpenAI<\/p>\n<p>client = OpenAI(base_url=&#8221;https:\/\/&#8211;8000.hf.jobs\/v1&#8243;, api_key=get_token())<\/p>\n<p>def chat(message, history):<br \/>\n    messages = ({&#8220;role&#8221;: m(&#8220;role&#8221;), &#8220;content&#8221;: m(&#8220;content&#8221;)} for m in history if not m.get(&#8220;metadata&#8221;))<br \/>\n    messages.append({&#8220;role&#8221;: &#8220;user&#8221;, &#8220;content&#8221;: message})<br \/>\n    stream = client.chat.completions.create(model=&#8221;Qwen\/Qwen3-4B&#8221;, messages=messages, stream=True)<\/p>\n<p>    thinking, answer = &#8220;&#8221;, &#8220;&#8221;<br \/>\n    for chunk in stream:<br \/>\n        delta = chunk.choices(0).delta<br \/>\n        thinking += delta.model_extra.get(&#8220;reasoning&#8221;, &#8220;&#8221;)<br \/>\n        answer += delta.content or &#8220;&#8221;<br \/>\n        out = ()<br \/>\n        if thinking.strip():<br \/>\n            status = &#8220;done&#8221; if answer.strip() else &#8220;pending&#8221;<br \/>\n            out.append(ChatMessage(role=&#8221;assistant&#8221;, content=thinking, metadata={&#8220;title&#8221;: &#8220;\ud83d\udcad Thinking&#8221;, &#8220;status&#8221;: status}))<br \/>\n        if answer.strip():<br \/>\n            out.append(ChatMessage(role=&#8221;assistant&#8221;, content=answer))<br \/>\n        yield out<\/p>\n<p>gr.ChatInterface(chat).launch()<\/p>\n<p>Run it, open http:\/\/127.0.0.1:7860, and chat \u2014 reasoning streams into the collapsible panel, the answer below.<\/p>\n<p>\t\tGoing further: SSH into the running server<\/p>\n<p>Need to debug a startup failure, watch GPU memory, or tail logs interactively? You can open a shell straight into the running job. Launch it with &#8211;ssh and make sure your public key is registered at huggingface.co\/settings\/keys:<br \/>\nhf jobs run &#8211;flavor a10g-large &#8211;expose 8000 &#8211;timeout 2h &#8211;ssh \\<br \/>\n  vllm\/vllm-openai:latest \\<br \/>\n  vllm serve Qwen\/Qwen3-4B &#8211;host 0.0.0.0 &#8211;port 8000<\/p>\n<p>then connect with the job ID:<br \/>\nhf jobs ssh <\/p>\n<p>You&#8217;re now inside the container, where you can run nvidia-smi, inspect the process, or poke at the model directly \u2014 which makes debugging and monitoring much easier than reading logs from the outside. SSH support requires huggingface_hub >= 1.20.0.<\/p>\n<p>\t\tGoing further: Use it as a coding-agent backend with Pi<\/p>\n<p>The same endpoint can back a terminal coding agent. Pi is a provider-agnostic agent harness. Point it at the job and you get a Read\/Write\/Edit\/Bash agent running on your own self-hosted model.<br \/>\nOne thing to set up first: agents drive the model through tool calls, and vLLM only accepts those if the server is launched with tool calling enabled. So relaunch with &#8211;enable-auto-tool-choice and a &#8211;tool-call-parser matching the model family (hermes for Qwen3). Agents also benefit from a stronger model, so this is a good place to bring in the bigger one:<br \/>\nhf jobs run &#8211;flavor h200x2 &#8211;expose 8000 &#8211;timeout 2h \\<br \/>\n  vllm\/vllm-openai:latest \\<br \/>\n  vllm serve Qwen\/Qwen3.5-122B-A10B \\<br \/>\n  &#8211;host 0.0.0.0 &#8211;port 8000 &#8211;tensor-parallel-size 2 \\<br \/>\n  &#8211;max-model-len 32768 &#8211;max-num-seqs 256 \\<br \/>\n  &#8211;reasoning-parser deepseek_r1 \\<br \/>\n  &#8211;enable-auto-tool-choice &#8211;tool-call-parser hermes<\/p>\n<p>Then add the job as a custom provider in ~\/.pi\/agent\/models.json:<br \/>\n{<br \/>\n  &#8220;providers&#8221;: {<br \/>\n    &#8220;hf-jobs&#8221;: {<br \/>\n      &#8220;baseUrl&#8221;: &#8220;https:\/\/&#8211;8000.hf.jobs\/v1&#8221;,<br \/>\n      &#8220;api&#8221;: &#8220;openai-completions&#8221;,<br \/>\n      &#8220;apiKey&#8221;: &#8220;!hf auth token&#8221;,<br \/>\n      &#8220;models&#8221;: (<br \/>\n        { &#8220;id&#8221;: &#8220;Qwen\/Qwen3.5-122B-A10B&#8221; }<br \/>\n      )<br \/>\n    }<br \/>\n  }<br \/>\n}<\/p>\n<p>Then launch the agent against it:<br \/>\npi<\/p>\n<p>The model you spun up a couple of commands ago, now driving an interactive coding agent in your terminal.<\/p>\n<p>\t\tHF Jobs or Inference Endpoints?<\/p>\n<p>HF Jobs isn&#8217;t the only way to serve a model on Hugging Face. Inference Endpoints are our managed product for the same job, and which one fits depends on what you&#8217;re after.<br \/>\nReach for HF Jobs when you want maximum flexibility and control: it&#8217;s just docker run on HF infrastructure, so you pick the image, the exact vllm serve flags, and the hardware, and you pay per second for as long as the job runs. That makes it a great fit for experiments, one-off evals, batch generation, or kicking the tires on a model before committing to anything.<br \/>\nReach for Inference Endpoints when you want something more production-ready. They add the operational niceties a long-lived service needs: finer-grained access control (an endpoint can be public, protected, or private), and scale-to-zero, so you&#8217;re not billed during periods of inactivity. If you&#8217;re standing up a durable endpoint rather than running a job, that&#8217;s the tool to grab.<\/p>\n<p>\t\tFurther reading<\/p>\n<p>This post sticks to vLLM, but the same expose-a-port pattern works with any OpenAI-compatible server. To serve GGUFs with llama.cpp or run SGLang instead, see the Serve Models on Jobs guide, which walks through those backends.<br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/huggingface.co\/blog\/vllm-jobs\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command \u2014 no servers to provision, no Kubernetes, pay-per-second. Once it&#8217;s up, you can query it from your laptop, a notebook, or anywhere else. It&#8217;s the quickest way to stand up a model for tests, evals, or batch generation. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5331,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[],"class_list":["post-6129","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6129","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6129"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6129\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/5331"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6129"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6129"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6129"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}