{"id":5567,"date":"2026-06-16T04:59:14","date_gmt":"2026-06-15T21:59:14","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=5567"},"modified":"2026-06-16T04:59:14","modified_gmt":"2026-06-15T21:59:14","slug":"running-local-llms-with-ollama-for-private-development","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=5567","title":{"rendered":"Running Local LLMs With Ollama For Private Development"},"content":{"rendered":"<p> <br \/>\n<br \/>\n                Here&#8217;s a thing that catches almost everyone the first week they run a model locally. You paste a 600-line file into your shiny new local assistant, ask it to find the bug, and it confidently rewrites a function that isn&#8217;t even in the part it read. No error. No warning. It just&#8230; silently dropped most of your file on the floor before the model ever saw it.<\/p>\n<p>That&#8217;s not the model being dumb. That&#8217;s Ollama doing exactly what it was told. By default it gives every model a context window of 2048 tokens and quietly truncates anything past that. It&#8217;s one of a handful of small surprises that separate &#8220;I installed Ollama&#8221; from &#8220;I actually understand what&#8217;s running on my machine.&#8221; Let&#8217;s go through the ones that matter: how the thing works under the hood, what hardware you really need, the gotchas, and the honest answer to &#8220;should I even bother instead of just calling an API?&#8221;<\/p>\n<p>  What Ollama actually is<\/p>\n<p>Ollama gets described as &#8220;Docker for LLMs,&#8221; and that&#8217;s a decent first approximation. You pull a model, you run it, there&#8217;s a registry. But it hides what&#8217;s doing the heavy lifting. Underneath, Ollama is a friendly wrapper around llama.cpp, the C\/C++ inference engine that made running these models on consumer hardware practical in the first place. When you type ollama run, you&#8217;re really booting a llama.cpp runtime with a sane default config and a tidy HTTP server bolted on.<\/p>\n<p>The models it runs are in a format called GGUF (GPT-Generated Unified Format). A GGUF file isn&#8217;t just weights. It&#8217;s a self-contained package that bundles the tensors, the tokenizer config, the architecture details, and hyperparameters like the trained context length, all in one file. That&#8217;s why ollama pull llama3.1 gives you something that just works: everything the runtime needs to reconstruct the model is in the box.<\/p>\n<p>Ollama itself is young. The project shipped its first release in early July 2023, and it rode the wave of open-weight models (Llama 2 landed that same month) that suddenly made &#8220;run a real LLM on your laptop&#8221; a thing normal developers could do. Before that, local inference meant compiling things and reading a lot of GitHub issues. Ollama&#8217;s whole pitch is removing that friction.<\/p>\n<p>  The hardware math nobody explains up front<\/p>\n<p>The number that decides whether a model runs well on your machine isn&#8217;t its parameter count. It&#8217;s how much memory the weights occupy after quantization. This is the single most important concept for running models locally, so it&#8217;s worth slowing down for.<\/p>\n<p>A model&#8217;s weights are originally stored in 16-bit floating point. Quantization squeezes them down to a lower precision, commonly 4-bit integers, which shrinks the file and, just as importantly, eases the memory-bandwidth pressure that bottlenecks inference. The format you&#8217;ll see by default in Ollama is Q4_K_M, part of llama.cpp&#8217;s &#8220;K-quant&#8221; family. The trade is genuinely good: Q4_K_M cuts memory use by roughly 75% versus the 16-bit original while losing well under 1% of quality on most benchmarks. That&#8217;s not a free lunch exactly, but it&#8217;s close enough that most people never run anything else.<\/p>\n<p>Here&#8217;s the rule of thumb that actually helps you size hardware: budget about 0.6 GB per billion parameters at Q4_K_M, then add headroom for context. So:<\/p>\n<p>Model size<br \/>\nQ4_K_M footprint<br \/>\nFits comfortably on<\/p>\n<p>7B<br \/>\n~4-6 GB<br \/>\n8 GB GPU, or any M-series Mac<\/p>\n<p>13B<br \/>\n~8-10 GB<br \/>\n12 GB GPU<\/p>\n<p>32B<br \/>\n~22-24 GB<br \/>\nRTX 4090 (24 GB)<\/p>\n<p>70B<br \/>\n~38-48 GB<br \/>\n2x 24 GB GPUs, or a 64 GB Mac<\/p>\n<p>The memory you want this to live in is VRAM, your GPU&#8217;s memory, because that&#8217;s where inference is fast. If the model doesn&#8217;t fit in VRAM, Ollama will happily run it on the CPU using system RAM instead, and it&#8217;ll work, just slowly. On Apple Silicon the line blurs in a nice way: unified memory means the GPU and CPU share one pool, so a 64 GB Mac can run models that would need multiple discrete GPUs on a PC.<\/p>\n<p>What does this buy you in speed? Be realistic about it. On CPU-only inference you&#8217;re looking at roughly 10-25 tokens per second, usable for short answers, painful for long ones. Put the same model fully on a decent GPU and you jump to 40-80+ tokens\/sec; an RTX 4090 can hit 130-160 tokens\/sec, which is in the same league as a cloud API. The hardware is the whole game here. A local model on the wrong hardware isn&#8217;t a cheaper API, it&#8217;s a worse one.<\/p>\n<p>  The silent context-window trap<\/p>\n<p>Back to the gotcha from the opener, because it&#8217;s the one that wastes the most hours. Ollama defaults num_ctx, the context window, to 2048 tokens for every model, regardless of what that model was actually trained to handle. Llama 3.1 supports 128k tokens of context; out of the box, Ollama gives it 2048.<\/p>\n<p>This default is deliberate, not a bug. It lets Ollama boot any model instantly on any hardware, including an 8 GB laptop, without forcing you to calculate your memory budget first. The problem is what happens when you exceed it: Ollama silently clips the input. No error, no warning. The tokens past your limit simply never reach the model. If you&#8217;ve ever fed a local model a big file and watched it &#8220;forget&#8221; the beginning, this is almost always why.<\/p>\n<p>You fix it in one of two places. For a one-off, pass num_ctx in the request options:<\/p>\n<p>Per-request override<\/p>\n<p>curl http:\/\/localhost:11434\/api\/generate -d &#8216;{<br \/>\n  &#8220;model&#8221;: &#8220;llama3.1&#8221;,<br \/>\n  &#8220;prompt&#8221;: &#8220;Summarize this file&#8230;&#8221;,<br \/>\n  &#8220;options&#8221;: { &#8220;num_ctx&#8221;: 16384 }<br \/>\n}&#8217;<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>For a permanent per-model default, bake it into a Modelfile and create your own variant:<\/p>\n<p>Modelfile<\/p>\n<p>FROM llama3.1<br \/>\nPARAMETER num_ctx 16384<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>Build it once<\/p>\n<p>ollama create llama3.1-16k -f Modelfile<br \/>\nollama run llama3.1-16k<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>But there&#8217;s a cost, and it&#8217;s not optional: the context window lives in the KV cache, and that grows linearly with num_ctx. Bumping a 7B model to a 32k window can add around 6 GB of VRAM on top of the weights. So context length isn&#8217;t a free dial you crank to maximum. It competes directly with the model for the same memory. Pick the smallest window that fits your actual workload.<\/p>\n<p>WarningThe 2048 default plus silent truncation is the single most common reason people conclude &#8220;local models are dumb.&#8221; They&#8217;re usually not. They&#8217;re just being shown a fraction of the input. Check your num_ctx before you blame the model.<\/p>\n<p>  Wiring it into your editor<\/p>\n<p>The reason most developers reach for this in the first place is a private coding assistant: autocomplete and chat that never sends a line of your code anywhere. Ollama exposes a local HTTP API on port 11434, and editor extensions like Continue talk to it directly. Your code goes from your editor, to a process on your own machine, and back. Nothing crosses the network.<\/p>\n<p>The wiring is small. Point your Continue config at the local model:<\/p>\n<p>Continue config (shape may vary by version)<\/p>\n<p>{<br \/>\n  &#8220;models&#8221;: (<br \/>\n    {<br \/>\n      &#8220;title&#8221;: &#8220;Llama 3.1 8B (local)&#8221;,<br \/>\n      &#8220;provider&#8221;: &#8220;ollama&#8221;,<br \/>\n      &#8220;model&#8221;: &#8220;llama3.1:8b&#8221;<br \/>\n    }<br \/>\n  )<br \/>\n}<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>That&#8217;s the whole privacy story, and it&#8217;s a real one: with the model pulled, you can pull the ethernet cable out and it keeps working. Ollama doesn&#8217;t phone home during normal inference: no telemetry upload, no cloud sync, no prompts shipped to a third party. The model files sit on your disk until you delete them, and only the initial ollama pull needs the internet. For anyone working under HIPAA, PCI-DSS, or GDPR data-residency rules, that&#8217;s not a nice-to-have. It&#8217;s frequently the only arrangement that&#8217;s even allowed, because no amount of vendor paperwork beats the data physically never leaving your machine.<\/p>\n<p>  The memory-management gotcha<\/p>\n<p>One more behavior worth knowing before it confuses you. After you finish a request, Ollama keeps the model loaded in VRAM for 5 minutes by default, so your next prompt answers instantly instead of paying the load cost again. Handy, until you&#8217;re trying to run a second large model and discover the first one is still squatting on your GPU memory.<\/p>\n<p>You control this with keep_alive. Set it to 0 to unload the moment a response finishes, or to something like &#8220;24h&#8221; to pin a model in memory all day:<\/p>\n<p>Unload immediately after responding<\/p>\n<p>curl http:\/\/localhost:11434\/api\/generate -d &#8216;{<br \/>\n  &#8220;model&#8221;: &#8220;llama3.1&#8221;,<br \/>\n  &#8220;prompt&#8221;: &#8220;quick question&#8221;,<br \/>\n  &#8220;keep_alive&#8221;: 0<br \/>\n}&#8217;<\/p>\n<p>    Enter fullscreen mode<\/p>\n<p>    Exit fullscreen mode<\/p>\n<p>You can check what&#8217;s currently resident with ollama ps and evict a model by hand with ollama stop. If you&#8217;re juggling several models on a memory-tight machine, managing keep_alive is the difference between smooth switching and constant out-of-memory errors.<\/p>\n<p>  When local actually beats an API<\/p>\n<p>Now the honest part, because the answer isn&#8217;t &#8220;always.&#8221; Running locally is a real engineering trade, and plenty of the time the cloud is just the better call.<\/p>\n<p>Cost is the trap people get wrong in both directions. The rough crossover: under about 1M tokens a day, a cloud API is usually cheaper once you account for the hardware you&#8217;d have to buy and run. Past roughly 5M tokens a day, owning the hardware starts paying for itself. Below that line, a $1,600 GPU sitting mostly idle is a worse deal than per-token pricing. Buying a 4090 to occasionally autocomplete is a hobby, not a saving.<\/p>\n<p>Latency can favor local, especially for short, frequent calls where the network round-trip dominates. But only if your hardware keeps up. Remember the numbers: a top GPU matches cloud throughput, CPU-only inference is 4-10x slower. Local isn&#8217;t automatically faster. It&#8217;s faster when the GPU is there.<\/p>\n<p>Capability still favors the cloud at the top end. The biggest frontier models you reach through an API are stronger than anything you&#8217;ll fit on a single machine. For routine work (autocomplete, summarizing, boilerplate, straightforward refactors) a good local 8B or 32B model is more than enough. For genuinely hard reasoning, the gap is still real.<\/p>\n<p>Privacy and compliance is where local stops being a preference and becomes a requirement. If your data legally can&#8217;t leave a boundary (patient records, payment data, regulated EU data) then keeping inference on hardware you control isn&#8217;t a tradeoff, it&#8217;s the entire point. No enterprise agreement substitutes for the data simply never being transmitted.<\/p>\n<p>The pattern a lot of teams land on isn&#8217;t all-or-nothing. It&#8217;s a blend: local models for the private, high-volume, latency-sensitive, offline work, and a cloud API for the occasional heavy request that needs the strongest model available. You don&#8217;t have to pick a side. You have to know which job each tool is actually good at.<\/p>\n<p>So start small. Pull an 8B model, point your editor at it, write some real code through it for a week, and watch your token meter not move. Then decide what&#8217;s worth keeping local, now that you know what&#8217;s actually running on your machine, and why.<\/p>\n<p>Originally published at nazarboyko.com.<\/p>\n<p><br \/>\n<br \/><a href=\"https:\/\/dev.to\/nazar_boyko\/running-local-llms-with-ollama-for-private-development-4924\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here&#8217;s a thing that catches almost everyone the first week they run a model locally. You paste a 600-line file into your shiny new local assistant, ask it to find the bug, and it confidently rewrites a function that isn&#8217;t even in the part it read. No error. No warning. It just&#8230; silently dropped most [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":5568,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[835,761,765,762,1088,763,764,1977,1976,760],"class_list":["post-5567","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai","tag-ai","tag-coding","tag-community","tag-development","tag-devops","tag-engineering","tag-inclusive","tag-local","tag-ollama","tag-software"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/5567","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=5567"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/5567\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/5568"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=5567"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=5567"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=5567"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}