This tutorial reveals how to deploy large language models (LLMs) entirely offline, combining Hugging Face's model zoo with Ollama's optimized runtime. No cloud dependencies, no API costs - just private AI processing on your hardware.
Minimum:
- 8GB RAM
- 20GB Storage
- x64 CPU
Recommended:
- 16GB+ RAM
- NVIDIA GPU (Optional)
- SSD Storage
# For Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh
# Install Python requirements
pip install transformers torch sentencepiece
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
# Save in Ollama-compatible format
model.save_pretrained("./zephyr-ollama")
tokenizer.save_pretrained("./zephyr-ollama")
This converts the model to a format Ollama can understand while preserving all original capabilities.
# zephyr-ollama/Modelfile
FROM ./zephyr-ollama
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
TEMPLATE """{% for message in messages %}{{message['role']}}: {{message['content']}}{% endfor %}"""
This configuration file controls model behavior and prompt formatting.
# Build model package
ollama create zephyr -f ./zephyr-ollama/Modelfile
# Start inference server
ollama serve &
# Run model interactively
ollama run zephyr "Explain quantum physics simply"
sudo
)import requests
def ask_ollama(prompt, model="zephyr"):
response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": model,
"prompt": prompt,
"stream": False
}
)
return response.json()["response"]
print(ask_ollama("Write Python code for bubble sort"))
from flask import Flask, request, render_template_string
app = Flask(__name__)
HTML_TEMPLATE = '''
<form method="POST">
<input name="query" placeholder="Ask me anything...">
<button type="submit">Ask</button>
</form>
{% if response %}<div>{{ response }}</div>{% endif %}
'''
@app.route('/', methods=['GET', 'POST'])
def chat():
response = None
if request.method == 'POST':
response = ask_ollama(request.form['query'])
return render_template_string(HTML_TEMPLATE, response=response)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Technique | Command | Speed Boost |
---|---|---|
4-bit Quantization |
ollama quantize
|
2.5x |
GPU Acceleration |
--gpu-layers 20
|
5-8x |
# List installed models
ollama list
# Remove unused models
ollama rm zephyr
# Update existing models
ollama pull mistral
You've now created a fully private AI deployment capable of complex language tasks without internet connectivity. This setup forms the foundation for secure enterprise AI solutions and personal projects requiring absolute data privacy.
Category: GenAI
Similar Articles