Run Hugging Face Models Locally with Ollama: Step-by-Step Python Guide

With over 45,000 GGUF-format models (a binary file format that stores large language models (LLMs)) available on Hugging Face, Ollama has become the go-to tool for running large language models (LLMs) on local hardware. This guide will walk you through setting up Ollama, selecting models, and integrating them into Python applications, even if you're new to AI deployment.

Prerequisites

  • Python 3.8+ installed
  • Hugging Face CLI tools
  • Ollama v0.1.44+
  • 8GB+ RAM (16GB recommended)
  • Basic terminal familiarity

Step 1: Install Ollama (Linux/macOS)

Download and set up the Ollama framework:

# For Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
      

Step 1: Install Ollama (Windows)

  1. Open a web browser in Windows.
  2. Go to Ollama's official website.
  3. Download the installation script for windows and run the script.

Verify installation:

ollama --version

Step 2: Install Hugging Face CLI

pip install huggingface_hub

Authenticate with your Hugging Face account:

huggingface-cli login

Step 3: Select a GGUF Model

Popular options for beginners:

  • TinyLlama-1.1B (Good for CPU)
  • Mistral-7B (Balanced performance)
  • Llama-3.2-8B (High quality)

Find models using:

huggingface-cli scan-cache

Step 4: Run Your First Model

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Test with a query:

>>> What's quantum computing?
Quantum computing uses quantum-mechanical phenomena...

Step 5: Customize with Modelfile

Create custom-model.Modelfile:

FROM ./mistral-7b-q4_k_m.gguf
PARAMETER temperature 0.7
SYSTEM "You're a helpful coding assistant"
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}{{ end }}
<|user|>
{{ .Prompt }}
<|assistant|>"""

Build and run:

ollama create my-mistral -f custom-model.Modelfile
ollama run my-mistral

Step 6: Python Integration

Basic API interaction:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "my-mistral",
        "prompt": "Explain Python decorators",
        "stream": False
    }
)
print(response.json()["response"])

Streaming response handler:

def stream_response(prompt):
    with requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "my-mistral",
            "prompt": prompt,
            "stream": True
        },
        stream=True
    ) as r:
        for line in r.iter_lines():
            if line:
                yield line.decode().split("data: ")[1]

Troubleshooting Common Issues

Issue Solution
Model not found Check quantization tags
CUDA out of memory Use smaller quant (e.g., Q2_K)
Slow responses Enable GPU layers: OLLAMA_GPU_LAYERS=35

Next Steps

  • Experiment with different quantization levels
  • Implement RAG architecture
  • Explore multi-model pipelines

With this setup, you're ready to explore 45,000+ models while maintaining complete data privacy. The Ollama-Hugging Face integration brings enterprise-grade AI capabilities to local machines .


Category: GenAI

Trending
Latest Articles