With over 45,000 GGUF-format models (a binary file format that stores large language models (LLMs)) available on Hugging Face, Ollama has become the go-to tool for running large language models (LLMs) on local hardware. This guide will walk you through setting up Ollama, selecting models, and integrating them into Python applications, even if you're new to AI deployment.
Prerequisites
- Python 3.8+ installed
- Hugging Face CLI tools
- Ollama v0.1.44+
- 8GB+ RAM (16GB recommended)
- Basic terminal familiarity
Step 1: Install Ollama (Linux/macOS)
Download and set up the Ollama framework:
# For Linux/macOS
curl -fsSL https://ollama.com/install.sh | sh
Step 1: Install Ollama (Windows)
- Open a web browser in Windows.
- Go to Ollama's official website.
- Download the installation script for windows and run the script.
Verify installation:
ollama --version
Step 2: Install Hugging Face CLI
pip install huggingface_hub
Authenticate with your Hugging Face account:
huggingface-cli login
Step 3: Select a GGUF Model
Popular options for beginners:
- TinyLlama-1.1B (Good for CPU)
- Mistral-7B (Balanced performance)
- Llama-3.2-8B (High quality)
Find models using:
huggingface-cli scan-cache
Step 4: Run Your First Model
ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF:Q4_K_M
Test with a query:
>>> What's quantum computing?
Quantum computing uses quantum-mechanical phenomena...
Step 5: Customize with Modelfile
Create custom-model.Modelfile
:
FROM ./mistral-7b-q4_k_m.gguf
PARAMETER temperature 0.7
SYSTEM "You're a helpful coding assistant"
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}{{ end }}
<|user|>
{{ .Prompt }}
<|assistant|>"""
Build and run:
ollama create my-mistral -f custom-model.Modelfile
ollama run my-mistral
Step 6: Python Integration
Basic API interaction:
import requests
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "my-mistral",
"prompt": "Explain Python decorators",
"stream": False
}
)
print(response.json()["response"])
Streaming response handler:
def stream_response(prompt):
with requests.post(
"http://localhost:11434/api/generate",
json={
"model": "my-mistral",
"prompt": prompt,
"stream": True
},
stream=True
) as r:
for line in r.iter_lines():
if line:
yield line.decode().split("data: ")[1]
Troubleshooting Common Issues
Issue | Solution |
---|---|
Model not found | Check quantization tags |
CUDA out of memory | Use smaller quant (e.g., Q2_K) |
Slow responses | Enable GPU layers: OLLAMA_GPU_LAYERS=35 |
Next Steps
- Experiment with different quantization levels
- Implement RAG architecture
- Explore multi-model pipelines
With this setup, you're ready to explore 45,000+ models while maintaining complete data privacy. The Ollama-Hugging Face integration brings enterprise-grade AI capabilities to local machines .