Cloud-based models like ChatGPT and Claude are trained on massive datasets using computing infrastructure most organizations could never run locally. Local models are smaller by necessity: they have to fit and run on hardware you own. That size difference has real consequences for output quality, but the gap is narrowing fast and is often overstated for the kinds of tasks teams need to perform.

What "smaller" actually means

  • Models are measured in parameters, the numerical weights that determine how a model reasons. Larger parameter counts generally mean better reasoning and more nuanced outputs.
  • Cloud models run in the hundreds of billions of parameters. A capable local model runs 7 to 70 billion. For creative synthesis or complex reasoning across unfamiliar domains, that gap shows. For writing R documentation, explaining legacy code, or drafting report sections from structured data, local models perform well.
  • Models also have a context window, a limit on how much text they can hold in working memory at once. Smaller local models have shorter context windows, meaning they can lose track of earlier content in long documents. Chunking, breaking documents into smaller, overlapping segments, is the primary mitigation.

Hardware required

  • Small models (7 to 14 billion parameters): Run on a standard laptop or mini PC with 16GB of RAM. No GPU required, though one helps with speed. Entry-level setup.
  • Medium models (30 to 40 billion parameters): Require 32GB or more RAM. A workstation or a well-specced mini PC. Noticeably more capable on complex tasks.
  • Large models (70 billion parameters): Require 48GB or more of unified memory or GPU VRAM. A Mac Mini M4 Pro with 48GB of unified memory (around $2,000) or a workstation with a high-end GPU reaches this threshold. Generates roughly 15 to 20 tokens per second: slower than cloud tools but functional for document querying and drafting.
  • Models are free to download. The only cost is hardware and electricity.

How to make local models better

  • RAG (Retrieval-Augmented Generation): Rather than loading entire documents into the model, RAG retrieves only the most relevant chunks before generating a response. This extends effective context, reduces errors, and makes outputs traceable to source documents. It is the most practical improvement for document-heavy workflows.
  • Prompt engineering: Local models respond well to explicit, structured prompts. Providing context, specifying output format, and giving examples improves output quality significantly without changing the model itself.
  • Model selection: Quantized models, compressed versions that trade minor accuracy for dramatically reduced hardware requirements, have closed the gap with older, larger models. Choosing the right model for the task matters more than raw parameter count.
  • Fine-tuning: A model can be trained further on domain-specific data, water quality reports, tribal resolutions, regulatory documents, to improve performance on that specific content. This requires technical expertise and is not a starting point, but it is achievable with moderate infrastructure.

Open-source tools for getting started

  • Ollama is the most accessible way to download and run open-source models locally. Free, runs on Mac, Windows, and Linux. One command installs a model; another runs it. Can be configured to serve an entire local network, not just a single machine.
  • Open WebUI is a browser-based chat interface that connects to Ollama. Gives non-technical users a familiar chat experience without touching a command line. Runs entirely on your own hardware.
  • LM Studio is a graphical interface for downloading and running local models. A good starting point for teams without command-line experience.

Sources

  • Local LLM deployment privacy guide, evaluating Ollama, LM Studio, vLLM, and llama.cpp: Digital Applied, 2025
  • Full list of freely downloadable open-source models with size and hardware requirements: Ollama Model Library