Offline AI Deployment: Run LLMs Locally with Ollama & Hugging Face - Complete Tutorial

Private AI Power: Run Language Models Offline with Ollama & Hugging Face

This tutorial reveals how to deploy large language models (LLMs) entirely offline, combining Hugging Face's model zoo with Ollama's optimized runtime. No cloud dependencies, no API costs - just private AI processing on your hardware.

Why Go Offline?

  • 100% data privacy
  • No internet required
  • Reduce cloud costs by 90%+
  • Full model control

1. System Requirements & Setup

Hardware Recommendations

Minimum:
- 8GB RAM
- 20GB Storage
- x64 CPU

Recommended:
- 16GB+ RAM
- NVIDIA GPU (Optional)
- SSD Storage

Install Dependencies

# For Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Install Python requirements
pip install transformers torch sentencepiece

2. Convert Hugging Face Models to Ollama Format

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")

# Save in Ollama-compatible format
model.save_pretrained("./zephyr-ollama")
tokenizer.save_pretrained("./zephyr-ollama")

This converts the model to a format Ollama can understand while preserving all original capabilities.

3. Create Ollama Model File

# zephyr-ollama/Modelfile
FROM ./zephyr-ollama
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
TEMPLATE """{% for message in messages %}{{message['role']}}: {{message['content']}}{% endfor %}"""

This configuration file controls model behavior and prompt formatting.

4. Build & Run Local Model

# Build model package
ollama create zephyr -f ./zephyr-ollama/Modelfile

# Start inference server
ollama serve &

# Run model interactively
ollama run zephyr "Explain quantum physics simply"

Common Installation Issues

  • Missing CUDA drivers for GPU acceleration
  • Insufficient disk space for large models
  • Permission errors on Linux (fix with sudo)

5. Python API Integration

import requests

def ask_ollama(prompt, model="zephyr"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

print(ask_ollama("Write Python code for bubble sort"))

6. Build a Private Chat Interface

from flask import Flask, request, render_template_string

app = Flask(__name__)

HTML_TEMPLATE = '''
<form method="POST">
    <input name="query" placeholder="Ask me anything...">
    <button type="submit">Ask</button>
</form>
{% if response %}<div>{{ response }}</div>{% endif %}
'''

@app.route('/', methods=['GET', 'POST'])
def chat():
    response = None
    if request.method == 'POST':
        response = ask_ollama(request.form['query'])
    return render_template_string(HTML_TEMPLATE, response=response)

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

7. Optimize Performance

Technique Command Speed Boost
4-bit Quantization ollama quantize 2.5x
GPU Acceleration --gpu-layers 20 5-8x

8. Model Management

# List installed models
ollama list

# Remove unused models
ollama rm zephyr

# Update existing models
ollama pull mistral

Production Checklist

  • Enable auto-start on boot
  • Set memory limits
  • Implement API rate limiting
  • Regularly update models

Next Steps

  • Explore multi-model ensembles
  • Implement RAG (Retrieval Augmented Generation)
  • Set up monitoring with Prometheus

You've now created a fully private AI deployment capable of complex language tasks without internet connectivity. This setup forms the foundation for secure enterprise AI solutions and personal projects requiring absolute data privacy.


Category: GenAI

Trending
Latest Articles