← All articlesAI & Machine Learning

Building Semantic Search for Your Documentation with AI

Build an AI-powered semantic search engine for your docs using embeddings, pgvector, and a simple API. Find answers by meaning, not just keywords.

Y
Yash Pritwani
14 min read

The Problem with Keyword Search

Your documentation is growing. Engineers add pages daily. But finding the right answer is still painful. Search for "how to restart the queue" and you get nothing — because the docs say "reset the job processor." Keyword search fails when users and authors use different words for the same concept.

<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 200" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="200" rx="12" fill="#1a1a2e"/><text x="80" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Input</text><circle cx="80" cy="50" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="100" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><circle cx="80" cy="150" r="14" fill="none" stroke="#3b82f6" stroke-width="2"/><text x="230" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="230" cy="45" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="85" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="125" r="14" fill="#6366f1" opacity="0.8"/><circle cx="230" cy="165" r="14" fill="#6366f1" opacity="0.8"/><text x="380" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Hidden</text><circle cx="380" cy="55" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="100" r="14" fill="#a855f7" opacity="0.8"/><circle cx="380" cy="145" r="14" fill="#a855f7" opacity="0.8"/><text x="520" y="25" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Output</text><circle cx="520" cy="80" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><circle cx="520" cy="130" r="14" fill="none" stroke="#2dd4bf" stroke-width="2"/><line x1="94" y1="50" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="50" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="100" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="45" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="85" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="125" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="94" y1="150" x2="216" y2="165" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="45" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="85" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="125" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="55" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="100" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="244" y1="165" x2="366" y2="145" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="55" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="100" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="80" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/><line x1="394" y1="145" x2="506" y2="130" stroke="#e2e8f0" stroke-width="0.5" opacity="0.3"/></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">Neural network architecture: data flows through input, hidden, and output layers.</p></div>

Semantic search understands meaning. It knows that "restart the queue" and "reset the job processor" are about the same thing. At TechSaaS, we build semantic search into every documentation platform we deploy.

Architecture Overview

The pipeline has four stages:

1. Ingest: Crawl docs, split into chunks, clean text 2. Embed: Convert chunks to vectors using an embedding model 3. Store: Save vectors in pgvector (PostgreSQL extension) 4. Query: Embed the user's question, find nearest chunks, return results

User Query → Embed → pgvector Nearest Neighbor → Top-K Chunks → Display

Step 1: Set Up pgvector

If you already run PostgreSQL (and you should), just add the extension:

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the documents table
CREATE TABLE doc_chunks (
    id SERIAL PRIMARY KEY,
    source_url TEXT NOT NULL,
    title TEXT,
    content TEXT NOT NULL,
    chunk_index INTEGER,
    embedding vector(384),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create an index for fast similarity search
CREATE INDEX ON doc_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

We use 384 dimensions here to match the all-MiniLM-L6-v2 model, which is fast and free.

Step 2: Ingest and Chunk Documents

import re
from pathlib import Path

def chunk_markdown(text: str, max_tokens: int = 256) -> list[str]:
    """Split markdown into semantic chunks by headers."""
    sections = re.split(r'\n## ', text)
    chunks = []

    for section in sections:
        if len(section.split()) <= max_tokens:
            chunks.append(section.strip())
        else:
            # Split long sections by paragraphs
            paragraphs = section.split('\n\n')
            current_chunk = ""
            for para in paragraphs:
                if len((current_chunk + para).split()) > max_tokens:
                    if current_chunk:
                        chunks.append(current_chunk.strip())
                    current_chunk = para
                else:
                    current_chunk += "\n\n" + para
            if current_chunk:
                chunks.append(current_chunk.strip())

    return [c for c in chunks if len(c.split()) > 20]


# Process all markdown files
docs_dir = Path("./docs")
all_chunks = []

for md_file in docs_dir.rglob("*.md"):
    text = md_file.read_text()
    chunks = chunk_markdown(text)
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "source": str(md_file),
            "title": md_file.stem.replace("-", " ").title(),
            "content": chunk,
            "chunk_index": i
        })

print(f"Created {len(all_chunks)} chunks from {len(list(docs_dir.rglob('*.md')))} files")

<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 180" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="180" rx="12" fill="#1a1a2e"/><rect x="30" y="60" width="80" height="50" rx="25" fill="#3b82f6" opacity="0.85"/><text x="70" y="90" text-anchor="middle" fill="#ffffff" font-size="11" font-family="system-ui">Prompt</text><rect x="145" y="50" width="90" height="70" rx="8" fill="#6366f1" opacity="0.85"/><text x="190" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Embed</text><text x="190" y="95" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">[0.2, 0.8...]</text><rect x="270" y="50" width="90" height="70" rx="8" fill="#a855f7" opacity="0.85"/><text x="315" y="75" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Vector</text><text x="315" y="90" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Search</text><text x="315" y="105" text-anchor="middle" fill="#ffffff" font-size="9" font-family="system-ui" opacity="0.7">top-k=5</text><rect x="395" y="50" width="90" height="70" rx="8" fill="#2dd4bf" opacity="0.85"/><text x="440" y="80" text-anchor="middle" fill="#1a1a2e" font-size="11" font-family="system-ui" font-weight="bold">LLM</text><text x="440" y="95" text-anchor="middle" fill="#1a1a2e" font-size="9" font-family="system-ui">+ context</text><rect x="520" y="60" width="55" height="50" rx="25" fill="#f59e0b" opacity="0.85"/><text x="547" y="90" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Reply</text><defs><marker id="arrow4" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#e2e8f0"/></marker></defs><line x1="112" y1="85" x2="143" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="237" y1="85" x2="268" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="362" y1="85" x2="393" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><line x1="487" y1="85" x2="518" y2="85" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow4)"/><text x="300" y="155" text-anchor="middle" fill="#94a3b8" font-size="10" font-family="system-ui">Retrieval-Augmented Generation (RAG) Flow</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.</p></div>

Step 3: Generate Embeddings

from sentence_transformers import SentenceTransformer
import psycopg2
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Batch embed all chunks
texts = [c["content"] for c in all_chunks]
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

# Store in PostgreSQL
conn = psycopg2.connect("postgresql://user:pass@localhost/docs")
cur = conn.cursor()

for chunk, embedding in zip(all_chunks, embeddings):
    cur.execute(
        """INSERT INTO doc_chunks (source_url, title, content, chunk_index, embedding)
           VALUES (%s, %s, %s, %s, %s)""",
        (chunk["source"], chunk["title"], chunk["content"],
         chunk["chunk_index"], embedding.tolist())
    )

conn.commit()
print(f"Indexed {len(all_chunks)} chunks")

Embedding 10,000 chunks takes about 30 seconds on a modern CPU. On a GPU, under 5 seconds.

Step 4: Build the Search API

from fastapi import FastAPI, Query
from sentence_transformers import SentenceTransformer
import psycopg2

app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')

@app.get("/search")
def search(q: str = Query(..., min_length=3), limit: int = 5):
    # Embed the query
    query_embedding = model.encode(q).tolist()

    conn = psycopg2.connect("postgresql://user:pass@localhost/docs")
    cur = conn.cursor()

    cur.execute("""
        SELECT title, content, source_url,
               1 - (embedding <=> %s::vector) AS similarity
        FROM doc_chunks
        WHERE 1 - (embedding <=> %s::vector) > 0.3
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_embedding, query_embedding, query_embedding, limit))

    results = []
    for title, content, source, similarity in cur.fetchall():
        results.append({
            "title": title,
            "content": content[:300],
            "source": source,
            "similarity": round(similarity, 3)
        })

    return {"query": q, "results": results}

Step 5: Add AI-Powered Answers

Combine search results with an LLM to generate direct answers:

@app.get("/ask")
def ask(q: str = Query(..., min_length=3)):
    # Get relevant chunks
    search_results = search(q, limit=5)

    context = "\n\n".join([
        f"From {r['title']}:\n{r['content']}"
        for r in search_results["results"]
    ])

    # Generate answer using local Ollama
    import requests
    response = requests.post("http://localhost:11434/api/generate", json={
        "model": "llama3.1:8b",
        "prompt": f"Answer based ONLY on the context below. "
                  f"If the answer is not in the context say so.\n\n"
                  f"Context:\n{context}\n\nQuestion: {q}",
        "stream": False
    })

    return {
        "question": q,
        "answer": response.json()["response"],
        "sources": search_results["results"]
    }

Keeping the Index Fresh

Set up a cron job or webhook to re-index when docs change:

# Re-index changed files only
def reindex_changed(since_hours=24):
    import time
    cutoff = time.time() - (since_hours * 3600)

    for md_file in docs_dir.rglob("*.md"):
        if md_file.stat().st_mtime > cutoff:
            # Delete old chunks
            cur.execute("DELETE FROM doc_chunks WHERE source_url = %s",
                       (str(md_file),))
            # Re-chunk and embed
            chunks = chunk_markdown(md_file.read_text())
            # ... insert new chunks

<div style="margin:2.5rem auto;max-width:600px;width:100%;text-align:center;"><svg viewBox="0 0 600 160" xmlns="http://www.w3.org/2000/svg" style="width:100%;height:auto;"><rect width="600" height="160" rx="12" fill="#1a1a2e"/><rect x="20" y="40" width="80" height="60" rx="6" fill="#3b82f6" opacity="0.85"/><text x="60" y="65" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Raw</text><text x="60" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Data</text><rect x="125" y="40" width="80" height="60" rx="6" fill="#6366f1" opacity="0.85"/><text x="165" y="65" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Pre-</text><text x="165" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">process</text><rect x="230" y="40" width="80" height="60" rx="6" fill="#a855f7" opacity="0.85"/><text x="270" y="65" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Train</text><text x="270" y="80" text-anchor="middle" fill="#ffffff" font-size="10" font-family="system-ui">Model</text><rect x="335" y="40" width="80" height="60" rx="6" fill="#2dd4bf" opacity="0.85"/><text x="375" y="65" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Evaluate</text><text x="375" y="80" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Metrics</text><rect x="440" y="40" width="80" height="60" rx="6" fill="#f59e0b" opacity="0.85"/><text x="480" y="65" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Deploy</text><text x="480" y="80" text-anchor="middle" fill="#1a1a2e" font-size="10" font-family="system-ui">Model</text><rect x="545" y="40" width="40" height="60" rx="6" fill="#6366f1" opacity="0.6"/><text x="565" y="75" text-anchor="middle" fill="#ffffff" font-size="9" font-family="system-ui">Mon</text><defs><marker id="arrow3" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto"><path d="M0,0 L8,3 L0,6" fill="#e2e8f0"/></marker></defs><line x1="102" y1="70" x2="123" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><line x1="207" y1="70" x2="228" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><line x1="312" y1="70" x2="333" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><line x1="417" y1="70" x2="438" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><line x1="522" y1="70" x2="543" y2="70" stroke="#e2e8f0" stroke-width="1.5" marker-end="url(#arrow3)"/><path d="M375,102 L375,130 L270,130 L270,102" stroke="#f59e0b" stroke-width="1" stroke-dasharray="4,3" fill="none" marker-end="url(#arrow3b)"/><defs><marker id="arrow3b" markerWidth="8" markerHeight="6" refX="8" refY="3" orient="auto-start-reverse"><path d="M0,0 L8,3 L0,6" fill="#f59e0b"/></marker></defs><text x="322" y="143" text-anchor="middle" fill="#f59e0b" font-size="9" font-family="system-ui">retrain loop</text></svg><p style="margin-top:0.75rem;font-size:0.85rem;color:#94a3b8;font-style:italic;line-height:1.4;">ML pipeline: from raw data collection through training, evaluation, deployment, and continuous monitoring.</p></div>

Performance at Scale

Numbers from our deployments:

10,000 chunks: <50ms query time, 15MB index
100,000 chunks: <100ms query time, 150MB index
1,000,000 chunks: <200ms with HNSW index, 1.5GB index

pgvector handles this easily alongside your existing PostgreSQL workload. No separate vector database needed.

This is one of the highest-impact AI features you can build. Users go from frustrated keyword searching to asking natural questions and getting accurate answers. At TechSaaS, semantic search is a standard component of our documentation platforms.

#semantic-search#embeddings#pgvector#rag#documentation#ai

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.