← All articlesAI & Machine Learning

Building Semantic Search for Your Documentation with AI

Build an AI-powered semantic search engine for your docs using embeddings, pgvector, and a simple API. Find answers by meaning, not just keywords.

Yash Pritwani

4 November 202514 min read

The Problem with Keyword Search

Your documentation is growing. Engineers add pages daily. But finding the right answer is still painful. Search for "how to restart the queue" and you get nothing — because the docs say "reset the job processor." Keyword search fails when users and authors use different words for the same concept.

Neural network architecture: data flows through input, hidden, and output layers.

Semantic search understands meaning. It knows that "restart the queue" and "reset the job processor" are about the same thing. At TechSaaS, we build semantic search into every documentation platform we deploy.

Architecture Overview

The pipeline has four stages:

Ingest: Crawl docs, split into chunks, clean text
Embed: Convert chunks to vectors using an embedding model
Store: Save vectors in pgvector (PostgreSQL extension)
Query: Embed the user's question, find nearest chunks, return results

User Query → Embed → pgvector Nearest Neighbor → Top-K Chunks → Display

Get more insights on AI & Machine Learning

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Step 1: Set Up pgvector

If you already run PostgreSQL (and you should), just add the extension:

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create the documents table
CREATE TABLE doc_chunks (
    id SERIAL PRIMARY KEY,
    source_url TEXT NOT NULL,
    title TEXT,
    content TEXT NOT NULL,
    chunk_index INTEGER,
    embedding vector(384),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create an index for fast similarity search
CREATE INDEX ON doc_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

We use 384 dimensions here to match the all-MiniLM-L6-v2 model, which is fast and free.

Step 2: Ingest and Chunk Documents

import re
from pathlib import Path

def chunk_markdown(text: str, max_tokens: int = 256) -> list[str]:
    """Split markdown into semantic chunks by headers."""
    sections = re.split(r'\n## ', text)
    chunks = []

    for section in sections:
        if len(section.split()) <= max_tokens:
            chunks.append(section.strip())
        else:
            # Split long sections by paragraphs
            paragraphs = section.split('\n\n')
            current_chunk = ""
            for para in paragraphs:
                if len((current_chunk + para).split()) > max_tokens:
                    if current_chunk:
                        chunks.append(current_chunk.strip())
                    current_chunk = para
                else:
                    current_chunk += "\n\n" + para
            if current_chunk:
                chunks.append(current_chunk.strip())

    return [c for c in chunks if len(c.split()) > 20]


# Process all markdown files
docs_dir = Path("./docs")
all_chunks = []

for md_file in docs_dir.rglob("*.md"):
    text = md_file.read_text()
    chunks = chunk_markdown(text)
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "source": str(md_file),
            "title": md_file.stem.replace("-", " ").title(),
            "content": chunk,
            "chunk_index": i
        })

print(f"Created {len(all_chunks)} chunks from {len(list(docs_dir.rglob('*.md')))} files")

RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.

→

Building an AI Screening Pipeline With Embeddings12 min read read

→

How We Built AI Recruitment Matching for Skillety: Embeddings, Bias Handling, and Performance at Scale13 min read

→

Small Language Models at the Edge: The On-Device AI Revolution Changing Everything11 min read

Step 3: Generate Embeddings

from sentence_transformers import SentenceTransformer
import psycopg2
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

# Batch embed all chunks
texts = [c["content"] for c in all_chunks]
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)

# Store in PostgreSQL
conn = psycopg2.connect("postgresql://user:pass@localhost/docs")
cur = conn.cursor()

for chunk, embedding in zip(all_chunks, embeddings):
    cur.execute(
        """INSERT INTO doc_chunks (source_url, title, content, chunk_index, embedding)
           VALUES (%s, %s, %s, %s, %s)""",
        (chunk["source"], chunk["title"], chunk["content"],
         chunk["chunk_index"], embedding.tolist())
    )

conn.commit()
print(f"Indexed {len(all_chunks)} chunks")

Embedding 10,000 chunks takes about 30 seconds on a modern CPU. On a GPU, under 5 seconds.

Step 4: Build the Search API

from fastapi import FastAPI, Query
from sentence_transformers import SentenceTransformer
import psycopg2

app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')

@app.get("/search")
def search(q: str = Query(..., min_length=3), limit: int = 5):
    # Embed the query
    query_embedding = model.encode(q).tolist()

    conn = psycopg2.connect("postgresql://user:pass@localhost/docs")
    cur = conn.cursor()

    cur.execute("""
        SELECT title, content, source_url,
               1 - (embedding <=> %s::vector) AS similarity
        FROM doc_chunks
        WHERE 1 - (embedding <=> %s::vector) > 0.3
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_embedding, query_embedding, query_embedding, limit))

    results = []
    for title, content, source, similarity in cur.fetchall():
        results.append({
            "title": title,
            "content": content[:300],
            "source": source,
            "similarity": round(similarity, 3)
        })

    return {"query": q, "results": results}

Step 5: Add AI-Powered Answers

Combine search results with an LLM to generate direct answers:

@app.get("/ask")
def ask(q: str = Query(..., min_length=3)):
    # Get relevant chunks
    search_results = search(q, limit=5)

    context = "\n\n".join([
        f"From {r['title']}:\n{r['content']}"
        for r in search_results["results"]
    ])

    # Generate answer using local Ollama
    import requests
    response = requests.post("http://localhost:11434/api/generate", json={
        "model": "llama3.1:8b",
        "prompt": f"Answer based ONLY on the context below. "
                  f"If the answer is not in the context say so.\n\n"
                  f"Context:\n{context}\n\nQuestion: {q}",
        "stream": False
    })

    return {
        "question": q,
        "answer": response.json()["response"],
        "sources": search_results["results"]
    }

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist

Keeping the Index Fresh

Set up a cron job or webhook to re-index when docs change:

# Re-index changed files only
def reindex_changed(since_hours=24):
    import time
    cutoff = time.time() - (since_hours * 3600)

    for md_file in docs_dir.rglob("*.md"):
        if md_file.stat().st_mtime > cutoff:
            # Delete old chunks
            cur.execute("DELETE FROM doc_chunks WHERE source_url = %s",
                       (str(md_file),))
            # Re-chunk and embed
            chunks = chunk_markdown(md_file.read_text())
            # ... insert new chunks

ML pipeline: from raw data collection through training, evaluation, deployment, and continuous monitoring.

Performance at Scale

Numbers from our deployments:

10,000 chunks: <50ms query time, 15MB index
100,000 chunks: <100ms query time, 150MB index
1,000,000 chunks: <200ms with HNSW index, 1.5GB index

pgvector handles this easily alongside your existing PostgreSQL workload. No separate vector database needed.

This is one of the highest-impact AI features you can build. Users go from frustrated keyword searching to asking natural questions and getting accurate answers. At TechSaaS, semantic search is a standard component of our documentation platforms.

#semantic-search#embeddings#pgvector#rag#documentation#ai

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Get a Consultation Chat on WhatsApp

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

Get a Free Consultation WhatsApp Us

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us

99.99% uptime

< 48hr response

No spam. No contracts. Just a free demo.