Building a Private AI Chatbot for Your Company with Ollama and Open WebUI
Deploy a fully private AI chatbot using Ollama and Open WebUI on your own servers. No data leaves your network. Complete setup guide with Docker Compose.
Why Your Company Needs a Private AI Chatbot
Every enterprise wants the productivity gains of ChatGPT. Few want their proprietary data flowing through OpenAI's servers. Internal chat logs, customer data, financial projections, and trade secrets have no business leaving your network perimeter.
<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><text x="80" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Input</text><circle cx="80" cy="50" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="100" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="150" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><text x="230" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="230" cy="45" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="85" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="125" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="165" r="14" fill="#6366f1" opacity="0.8"/><text x="380" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="380" cy="55" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="100" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="145" r="14" fill="#a855f7" opacity="0.8"/><text x="520" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Output</text><circle cx="520" cy="80" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><circle cx="520" cy="130" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><line x1="94" y1="50" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Neural network architecture: data flows through input, hidden, and output layers.</p></div>
The solution is straightforward: run a large language model locally using Ollama and give your team a polished web interface with Open WebUI. At TechSaaS, we deploy this stack for clients in a single afternoon.
The Architecture
The stack is deceptively simple:
All traffic stays internal. Zero external API calls. Zero per-token costs after setup.
Hardware Requirements
For a team of 10-50 people with sequential queries:
|-----------|-----------|----------|-------|
A single NVIDIA RTX 4090 (24GB VRAM) handles 13B models with room to spare. For CPU-only, expect 5-10 tokens/second on a modern Ryzen or Xeon.
Docker Compose Setup
Create a docker-compose.yml:
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
restart: unless-stopped
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
networks:
- ai-net
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: unless-stopped
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
- WEBUI_SECRET_KEY=change-this-to-a-random-string
volumes:
- webui_data:/app/backend/data
ports:
- "3000:8080"
depends_on:
- ollama
networks:
- ai-net
volumes:
ollama_data:
webui_data:
networks:
ai-net:
driver: bridgeStart everything:
docker compose up -dPulling Your First Model
Once Ollama is running, pull a model:
# Fast and capable general-purpose model
docker exec ollama ollama pull llama3.1:8b
# Excellent for code generation
docker exec ollama ollama pull codellama:13b
# Small but fast for simple tasks
docker exec ollama ollama pull phi3:miniOpen WebUI at http://localhost:3000, create an admin account, and start chatting. Models appear automatically.
<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 180" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="180" rx="12" fill="#1a1a2e"/><rect x="30" y="60" width="80" height="50" rx="25" fill="#3b82f6" opacity="0.85"/><text x="70" y="90" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Prompt</text><rect x="145" y="50" width="90" height="70" rx="8" fill="#6366f1" opacity="0.85"/><text x="190" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Embed</text><text x="190" y="95" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">[0.2, 0.8...]</text><rect x="270" y="50" width="90" height="70" rx="8" fill="#a855f7" opacity="0.85"/><text x="315" y="75" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Vector</text><text x="315" y="90" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Search</text><text x="315" y="105" text-anchor="middle" fill="#ffffff" font-size="9" font-family="system-ui" opacity="0.7">top-k=5</text><rect x="395" y="50" width="90" height="70" rx="8" fill="#2dd4bf" opacity="0.85"/><text x="440" y="80" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">LLM</text><text x="440" y="95" text-anchor="middle" fill="#1a1a2e" font-size="9" font-family="system-ui">+ context</text><rect x="520" y="60" width="55" height="50" rx="25" fill="#f59e0b" opacity="0.85"/><text x="547" y="90" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Reply</text><defs><marker id="arrow4" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#e2e8f0"/></marker></defs><line x1="112" y1="85" x2="143" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="237" y1="85" x2="268" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="362" y1="85" x2="393" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="487" y1="85" x2="518" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><text x="300" y="155" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Retrieval-Augmented Generation (RAG) Flow</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.</p></div>
Adding RAG for Internal Documents
Open WebUI has built-in RAG (Retrieval-Augmented Generation). Upload PDFs, markdown files, or paste text into the knowledge base:
1. Go to Workspace > Knowledge 2. Create a new collection (e.g., "Engineering Docs") 3. Upload your documents 4. In any chat, click the + button and attach the knowledge base 5. The model now answers questions using your documents as context
For larger document sets, configure an external vector database:
# Add to docker-compose.yml
chromadb:
image: chromadb/chroma:latest
container_name: chromadb
volumes:
- chroma_data:/chroma/chroma
networks:
- ai-netThen point Open WebUI's RAG settings to http://chromadb:8000.
Putting It Behind Authentication
If you are running Traefik and Authelia (like we do at TechSaaS), add labels to the Open WebUI service:
labels:
- "traefik.enable=true"
- "traefik.http.routers.ai-chat.rule=Host(`chat.internal.company.com`)"
- "traefik.http.routers.ai-chat.middlewares=authelia@file"
- "traefik.http.services.ai-chat.loadbalancer.server.port=8080"This ensures only authenticated employees can access the chatbot while keeping it off the public internet.
Performance Tuning
A few tips to maximize throughput:
deploy block is present in your compose fileOLLAMA_NUM_CTX=4096 prevents memory exhaustion on long conversationsnvidia-smi -l 1 watches GPU memory in real time# Check model memory usage
docker exec ollama ollama ps
# See available models
docker exec ollama ollama listCost Comparison
Running Llama 3.1 8B locally for a 50-person team:
Equivalent with OpenAI GPT-4o:
The self-hosted option pays for itself in month one.
<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 160" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="160" rx="12" fill="#1a1a2e"/><rect x="20" y="40" width="80" height="60" rx="6" fill="#3b82f6" opacity="0.85"/><text x="60" y="65" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Raw</text><text x="60" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Data</text><rect x="125" y="40" width="80" height="60" rx="6" fill="#6366f1" opacity="0.85"/><text x="165" y="65" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Pre-</text><text x="165" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">process</text><rect x="230" y="40" width="80" height="60" rx="6" fill="#a855f7" opacity="0.85"/><text x="270" y="65" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Train</text><text x="270" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Model</text><rect x="335" y="40" width="80" height="60" rx="6" fill="#2dd4bf" opacity="0.85"/><text x="375" y="65" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Evaluate</text><text x="375" y="80" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Metrics</text><rect x="440" y="40" width="80" height="60" rx="6" fill="#f59e0b" opacity="0.85"/><text x="480" y="65" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Deploy</text><text x="480" y="80" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Model</text><rect x="545" y="40" width="40" height="60" rx="6" fill="#6366f1" opacity="0.6"/><text x="565" y="75" text-anchor="middle" fill="#ffffff" font-size="9" font-family="system-ui">Mon</text><defs><marker id="arrow3" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#e2e8f0"/></marker></defs><line x1="102" y1="70" x2="123" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><line x1="207" y1="70" x2="228" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><line x1="312" y1="70" x2="333" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><line x1="417" y1="70" x2="438" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><line x1="522" y1="70" x2="543" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><path d="M375,102 L375,130 L270,130 L270,102" stroke="#f59e0b" stroke-width="1" stroke-dasharray="4,3" fill="none" marker-end="url(#arrow3b)"/><defs><marker id="arrow3b" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto-start-reverse"><path d="M0,0 L8,3 L0,6" fill="#f59e0b"/></marker></defs><text x="322" y="143" text-anchor="middle" fill="#f59e0b" font-size="9" font-family="system-ui">retrain loop</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">ML pipeline: from raw data collection through training, evaluation, deployment, and continuous monitoring.</p></div>
What We Deploy for Clients
At TechSaaS, our private AI chatbot package includes Ollama, Open WebUI, ChromaDB for RAG, Traefik routing, Authelia authentication, automated backups, and monitoring via Grafana. The entire stack runs on a single server alongside your other services. Contact [email protected] to get started.
Need help with ai & machine learning?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.