Building Semantic Search for Your Documentation with AI
Build an AI-powered semantic search engine for your docs using embeddings, pgvector, and a simple API. Find answers by meaning, not just keywords.
The Problem with Keyword Search
Your documentation is growing. Engineers add pages daily. But finding the right answer is still painful. Search for "how to restart the queue" and you get nothing — because the docs say "reset the job processor." Keyword search fails when users and authors use different words for the same concept.
Neural network architecture: data flows through input, hidden, and output layers.
Semantic search understands meaning. It knows that "restart the queue" and "reset the job processor" are about the same thing. At TechSaaS, we build semantic search into every documentation platform we deploy.
Architecture Overview
The pipeline has four stages:
- Ingest: Crawl docs, split into chunks, clean text
- Embed: Convert chunks to vectors using an embedding model
- Store: Save vectors in pgvector (PostgreSQL extension)
- Query: Embed the user's question, find nearest chunks, return results
User Query → Embed → pgvector Nearest Neighbor → Top-K Chunks → Display
Get more insights on AI & Machine Learning
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
Step 1: Set Up pgvector
If you already run PostgreSQL (and you should), just add the extension:
-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;
-- Create the documents table
CREATE TABLE doc_chunks (
id SERIAL PRIMARY KEY,
source_url TEXT NOT NULL,
title TEXT,
content TEXT NOT NULL,
chunk_index INTEGER,
embedding vector(384),
created_at TIMESTAMP DEFAULT NOW()
);
-- Create an index for fast similarity search
CREATE INDEX ON doc_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
We use 384 dimensions here to match the all-MiniLM-L6-v2 model, which is fast and free.
Step 2: Ingest and Chunk Documents
import re
from pathlib import Path
def chunk_markdown(text: str, max_tokens: int = 256) -> list[str]:
"""Split markdown into semantic chunks by headers."""
sections = re.split(r'\n## ', text)
chunks = []
for section in sections:
if len(section.split()) <= max_tokens:
chunks.append(section.strip())
else:
# Split long sections by paragraphs
paragraphs = section.split('\n\n')
current_chunk = ""
for para in paragraphs:
if len((current_chunk + para).split()) > max_tokens:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = para
else:
current_chunk += "\n\n" + para
if current_chunk:
chunks.append(current_chunk.strip())
return [c for c in chunks if len(c.split()) > 20]
# Process all markdown files
docs_dir = Path("./docs")
all_chunks = []
for md_file in docs_dir.rglob("*.md"):
text = md_file.read_text()
chunks = chunk_markdown(text)
for i, chunk in enumerate(chunks):
all_chunks.append({
"source": str(md_file),
"title": md_file.stem.replace("-", " ").title(),
"content": chunk,
"chunk_index": i
})
print(f"Created {len(all_chunks)} chunks from {len(list(docs_dir.rglob('*.md')))} files")
RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.
Step 3: Generate Embeddings
from sentence_transformers import SentenceTransformer
import psycopg2
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
# Batch embed all chunks
texts = [c["content"] for c in all_chunks]
embeddings = model.encode(texts, batch_size=64, show_progress_bar=True)
# Store in PostgreSQL
conn = psycopg2.connect("postgresql://user:pass@localhost/docs")
cur = conn.cursor()
for chunk, embedding in zip(all_chunks, embeddings):
cur.execute(
"""INSERT INTO doc_chunks (source_url, title, content, chunk_index, embedding)
VALUES (%s, %s, %s, %s, %s)""",
(chunk["source"], chunk["title"], chunk["content"],
chunk["chunk_index"], embedding.tolist())
)
conn.commit()
print(f"Indexed {len(all_chunks)} chunks")
Embedding 10,000 chunks takes about 30 seconds on a modern CPU. On a GPU, under 5 seconds.
Step 4: Build the Search API
from fastapi import FastAPI, Query
from sentence_transformers import SentenceTransformer
import psycopg2
app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')
@app.get("/search")
def search(q: str = Query(..., min_length=3), limit: int = 5):
# Embed the query
query_embedding = model.encode(q).tolist()
conn = psycopg2.connect("postgresql://user:pass@localhost/docs")
cur = conn.cursor()
cur.execute("""
SELECT title, content, source_url,
1 - (embedding <=> %s::vector) AS similarity
FROM doc_chunks
WHERE 1 - (embedding <=> %s::vector) > 0.3
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, query_embedding, query_embedding, limit))
results = []
for title, content, source, similarity in cur.fetchall():
results.append({
"title": title,
"content": content[:300],
"source": source,
"similarity": round(similarity, 3)
})
return {"query": q, "results": results}
Step 5: Add AI-Powered Answers
Combine search results with an LLM to generate direct answers:
@app.get("/ask")
def ask(q: str = Query(..., min_length=3)):
# Get relevant chunks
search_results = search(q, limit=5)
context = "\n\n".join([
f"From {r['title']}:\n{r['content']}"
for r in search_results["results"]
])
# Generate answer using local Ollama
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3.1:8b",
"prompt": f"Answer based ONLY on the context below. "
f"If the answer is not in the context say so.\n\n"
f"Context:\n{context}\n\nQuestion: {q}",
"stream": False
})
return {
"question": q,
"answer": response.json()["response"],
"sources": search_results["results"]
}
Free Resource
Free Cloud Architecture Checklist
A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.
Keeping the Index Fresh
Set up a cron job or webhook to re-index when docs change:
# Re-index changed files only
def reindex_changed(since_hours=24):
import time
cutoff = time.time() - (since_hours * 3600)
for md_file in docs_dir.rglob("*.md"):
if md_file.stat().st_mtime > cutoff:
# Delete old chunks
cur.execute("DELETE FROM doc_chunks WHERE source_url = %s",
(str(md_file),))
# Re-chunk and embed
chunks = chunk_markdown(md_file.read_text())
# ... insert new chunks
ML pipeline: from raw data collection through training, evaluation, deployment, and continuous monitoring.
Performance at Scale
Numbers from our deployments:
- 10,000 chunks: <50ms query time, 15MB index
- 100,000 chunks: <100ms query time, 150MB index
- 1,000,000 chunks: <200ms with HNSW index, 1.5GB index
pgvector handles this easily alongside your existing PostgreSQL workload. No separate vector database needed.
This is one of the highest-impact AI features you can build. Users go from frustrated keyword searching to asking natural questions and getting accurate answers. At TechSaaS, semantic search is a standard component of our documentation platforms.
Related Service
Cloud Solutions
Let our experts help you build the right technology strategy for your business.
Need help with ai & machine learning?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.