← All articlesAI & Machine Learning

Building a Private AI Chatbot for Your Company with Ollama and Open WebUI

Deploy a fully private AI chatbot using Ollama and Open WebUI on your own servers. No data leaves your network. Complete setup guide with Docker Compose.

Y
Yash Pritwani
14 min read

Why Your Company Needs a Private AI Chatbot

Every enterprise wants the productivity gains of ChatGPT. Few want their proprietary data flowing through OpenAI's servers. Internal chat logs, customer data, financial projections, and trade secrets have no business leaving your network perimeter.

InputHiddenHiddenOutput

Neural network architecture: data flows through input, hidden, and output layers.

The solution is straightforward: run a large language model locally using Ollama and give your team a polished web interface with Open WebUI. At TechSaaS, we deploy this stack for clients in a single afternoon.

The Architecture

The stack is deceptively simple:

  • Ollama: Model runtime that manages downloading, quantizing, and serving LLMs
  • Open WebUI: Feature-rich chat interface with multi-user support, conversation history, RAG, and model management
  • Your hardware: Any machine with 16GB+ RAM (CPU inference) or an NVIDIA GPU (fast inference)

All traffic stays internal. Zero external API calls. Zero per-token costs after setup.

Hardware Requirements

For a team of 10-50 people with sequential queries:

Model Size RAM Needed GPU VRAM Speed
7B (Mistral, Llama 3.1) 8GB 6GB Fast
13B (CodeLlama, Llama 2) 16GB 10GB Good
70B (Llama 3.1 70B) 48GB 40GB Moderate

A single NVIDIA RTX 4090 (24GB VRAM) handles 13B models with room to spare. For CPU-only, expect 5-10 tokens/second on a modern Ryzen or Xeon.

Get more insights on AI & Machine Learning

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Docker Compose Setup

Create a docker-compose.yml:

version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    networks:
      - ai-net

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - WEBUI_SECRET_KEY=change-this-to-a-random-string
    volumes:
      - webui_data:/app/backend/data
    ports:
      - "3000:8080"
    depends_on:
      - ollama
    networks:
      - ai-net

volumes:
  ollama_data:
  webui_data:

networks:
  ai-net:
    driver: bridge

Start everything:

docker compose up -d

Pulling Your First Model

Once Ollama is running, pull a model:

# Fast and capable general-purpose model
docker exec ollama ollama pull llama3.1:8b

# Excellent for code generation
docker exec ollama ollama pull codellama:13b

# Small but fast for simple tasks
docker exec ollama ollama pull phi3:mini

Open WebUI at http://localhost:3000, create an admin account, and start chatting. Models appear automatically.

PromptEmbed[0.2, 0.8...]VectorSearchtop-k=5LLM+ contextReplyRetrieval-Augmented Generation (RAG) Flow

RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.

Adding RAG for Internal Documents

Open WebUI has built-in RAG (Retrieval-Augmented Generation). Upload PDFs, markdown files, or paste text into the knowledge base:

  1. Go to Workspace > Knowledge
  2. Create a new collection (e.g., "Engineering Docs")
  3. Upload your documents
  4. In any chat, click the + button and attach the knowledge base
  5. The model now answers questions using your documents as context

For larger document sets, configure an external vector database:

# Add to docker-compose.yml
chromadb:
  image: chromadb/chroma:latest
  container_name: chromadb
  volumes:
    - chroma_data:/chroma/chroma
  networks:
    - ai-net

Then point Open WebUI's RAG settings to http://chromadb:8000.

Putting It Behind Authentication

If you are running Traefik and Authelia (like we do at TechSaaS), add labels to the Open WebUI service:

labels:
  - "traefik.enable=true"
  - "traefik.http.routers.ai-chat.rule=Host(`chat.internal.company.com`)"
  - "traefik.http.routers.ai-chat.middlewares=authelia@file"
  - "traefik.http.services.ai-chat.loadbalancer.server.port=8080"

This ensures only authenticated employees can access the chatbot while keeping it off the public internet.

Performance Tuning

A few tips to maximize throughput:

  • Enable GPU acceleration: Ensure the NVIDIA Container Toolkit is installed and the deploy block is present in your compose file
  • Use quantized models: 4-bit quantized (Q4_K_M) models run 2-3x faster with minimal quality loss
  • Set context limits: OLLAMA_NUM_CTX=4096 prevents memory exhaustion on long conversations
  • Monitor VRAM: nvidia-smi -l 1 watches GPU memory in real time

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist
# Check model memory usage
docker exec ollama ollama ps

# See available models
docker exec ollama ollama list

Cost Comparison

Running Llama 3.1 8B locally for a 50-person team:

  • Hardware: ~$300/month (dedicated GPU server) or $0 if you own the hardware
  • Per-query cost: $0.00
  • Monthly queries: Unlimited

Equivalent with OpenAI GPT-4o:

  • 50 users x 100 queries/day x 1000 tokens avg: ~$3,000/month
  • Plus data privacy concerns: Priceless risk

The self-hosted option pays for itself in month one.

RawDataPre-processTrainModelEvaluateMetricsDeployModelMonretrain loop

ML pipeline: from raw data collection through training, evaluation, deployment, and continuous monitoring.

What We Deploy for Clients

At TechSaaS, our private AI chatbot package includes Ollama, Open WebUI, ChromaDB for RAG, Traefik routing, Authelia authentication, automated backups, and monitoring via Grafana. The entire stack runs on a single server alongside your other services. Contact [email protected] to get started.

#ollama#open-webui#private-ai#self-hosted#llm#docker

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.