Run Hugging Face Models Locally with Ollama: Step-by-Step Python Guide

With over 45,000 GGUF-format models (a binary file format that stores large language models (LLMs)) available on Hugging Face, Ollama has become the go-to tool for running large language models (LLMs) on local hardware. This guide will walk you through setting up Ollama, selecting models, and integrating them into Python applications, even if you're new to AI deployment.

Prerequisites

Python 3.8+ installed
Hugging Face CLI tools
Ollama v0.1.44+
8GB+ RAM (16GB recommended)
Basic terminal familiarity

Step 1: Install Ollama (Linux/macOS)

Download and set up the Ollama framework:

# For Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh

Step 1: Install Ollama (Windows)

Open a web browser in Windows.
Go to Ollama's official website.
Download the installation script for windows and run the script.

Verify installation:

ollama --version

Step 2: Install Hugging Face CLI

pip install huggingface_hub

Authenticate with your Hugging Face account:

huggingface-cli login

Step 3: Select a GGUF Model

Popular options for beginners:

TinyLlama-1.1B (Good for CPU)
Mistral-7B (Balanced performance)
Llama-3.2-8B (High quality)

Find models using:

huggingface-cli scan-cache

Step 4: Run Your First Model

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M

Test with a query:

>>> What's quantum computing?
Quantum computing uses quantum-mechanical phenomena...

Step 5: Customize with Modelfile

Create custom-model.Modelfile:

FROM ./mistral-7b-q4_k_m.gguf
PARAMETER temperature 0.7
SYSTEM "You're a helpful coding assistant"
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}{{ end }}
<|user|>
{{ .Prompt }}
<|assistant|>"""

Build and run:

ollama create my-mistral -f custom-model.Modelfile
ollama run my-mistral

Step 6: Python Integration

Basic API interaction:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "my-mistral",
        "prompt": "Explain Python decorators",
        "stream": False
    }
)
print(response.json()["response"])

Streaming response handler:

def stream_response(prompt):
    with requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "my-mistral",
            "prompt": prompt,
            "stream": True
        },
        stream=True
    ) as r:
        for line in r.iter_lines():
            if line:
                yield line.decode().split("data: ")[1]

Troubleshooting Common Issues

Issue	Solution
Model not found	Check quantization tags
CUDA out of memory	Use smaller quant (e.g., Q2_K)
Slow responses	Enable GPU layers: `OLLAMA_GPU_LAYERS=35`

Next Steps

Experiment with different quantization levels
Implement RAG architecture
Explore multi-model pipelines

With this setup, you're ready to explore 45,000+ models while maintaining complete data privacy. The Ollama-Hugging Face integration brings enterprise-grade AI capabilities to local machines .

Category: GenAI

Run Hugging Face Models Locally with Ollama: Step-by-Step Python Guide

Prerequisites

Step 1: Install Ollama (Linux/macOS)

Step 1: Install Ollama (Windows)

Step 2: Install Hugging Face CLI

Step 3: Select a GGUF Model

Step 4: Run Your First Model

Step 5: Customize with Modelfile

Step 6: Python Integration

Troubleshooting Common Issues

Next Steps

Offline AI Deployment: Run LLMs Locally with Ollama & Hugging Face - Complete Tutorial

First Steps with SageMaker: Setting Up Your ML Environment in 2025 | AWS Machine Learning Guide

How to Run Llama 3.2-Vision on Laptop: Private Image Analysis Setup Guide With Ollama

Amazon Bedrock Explained: Harnessing Foundation Models for AI Innovation Step-by-Step Guide

Trending

Latest Articles