How We Built AI Recruitment Matching for Skillety: Embeddings, Bias Handling, and Performance at Scale
A deep technical walkthrough of building Skillety's AI-powered candidate-job matching system. Covers embedding-based semantic scoring, bias mitigation...
The Problem with Keyword Matching in Recruitment
Traditional Applicant Tracking Systems match candidates to jobs using keyword overlap. A job description says "React" and a resume says "React" — it's a match. Simple, fast, and fundamentally broken.
Neural network architecture: data flows through input, hidden, and output layers.
Here's why: a senior frontend engineer who built design systems in Vue.js, led a team of 8, and architected a component library used by 200 developers is a better match for a "Senior React Developer" role than a junior developer who listed React in their skills section after completing a Udemy course. But keyword matching ranks them equally — or worse, ranks the junior higher because they literally typed "React" more times.
When Skillety approached us to build their AI matching engine, the brief was clear: match candidates to jobs based on what they can actually do, not what keywords appear in their resume. The system needed to handle 500,000+ candidate profiles, return results in under 100ms, and — critically — not discriminate based on gender, age, ethnicity, or educational pedigree.
This is how we built it.
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ Skillety Matching Engine │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ │
│ │ Ingestion │───▶│ Embedding │───▶│ Vector Store │ │
│ │ Pipeline │ │ Generation │ │ (PostgreSQL │ │
│ │ │ │ (multi- │ │ + pgvector) │ │
│ │ Resume │ │ field) │ │ │ │
│ │ Parser │ │ │ │ 500K+ vectors │ │
│ └──────────┘ └──────────────┘ └───────┬───────┘ │
│ │ │
│ ┌──────────┐ ┌──────────────┐ ┌───────▼───────┐ │
│ │ Job Desc │───▶│ Query │───▶│ Matching │ │
│ │ Parser │ │ Embedding │ │ + Scoring │ │
│ │ │ │ │ │ + Ranking │ │
│ └──────────┘ └──────────────┘ └───────┬───────┘ │
│ │ │
│ ┌──────────────────────────────────┐ ┌─────▼───────┐ │
│ │ Bias Detection Layer │ │ Results │ │
│ │ Demographic parity monitoring │ │ API │ │
│ │ Adversarial debiasing │ │ │ │
│ │ Fairness metrics dashboard │ │ │ │
│ └──────────────────────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
Step 1: Multi-Field Embedding Strategy
The naive approach is to embed the entire resume as a single vector and the entire job description as another, then compute cosine similarity. This works poorly because it collapses structured information into a single point in vector space.
A candidate who's a strong skills match but a weak experience-level match gets the same score as someone who's mediocre across the board. You lose the ability to explain why someone matched or didn't.
Instead, we decompose both candidates and jobs into structured fields and embed each independently:
from dataclasses import dataclass
from typing import Optional
import numpy as np
@dataclass
class CandidateEmbeddings:
"""Multi-field embedding representation of a candidate."""
candidate_id: str
skills: np.ndarray # Technical skills + tools
experience: np.ndarray # Work experience narrative
domain: np.ndarray # Industry/domain expertise
seniority: np.ndarray # Level indicators
education: np.ndarray # Educational background
# Metadata (not embedded — used for filtering)
years_exp: int
location: str
salary_range: Optional[tuple] = None
@dataclass
class JobEmbeddings:
"""Multi-field embedding representation of a job."""
job_id: str
skills_required: np.ndarray
experience_needed: np.ndarray
domain: np.ndarray
seniority: np.ndarray
education_req: np.ndarray
# Metadata
min_years: int
max_years: int
location: str
remote_ok: bool
Generating Field Embeddings
We use a fine-tuned sentence transformer for generating embeddings. The base model is all-MiniLM-L6-v2 (384 dimensions), fine-tuned on 50K pairs of job descriptions and matching candidate profiles:
from sentence_transformers import SentenceTransformer
import re
class RecruitmentEmbedder:
def __init__(self, model_path: str = "models/skillety-match-v3"):
self.model = SentenceTransformer(model_path)
self.skill_extractor = SkillExtractor() # NER-based
def embed_candidate(self, resume: dict) -> CandidateEmbeddings:
# Extract structured fields from parsed resume
skills_text = self._format_skills(resume)
experience_text = self._format_experience(resume)
domain_text = self._extract_domain_signals(resume)
seniority_text = self._infer_seniority(resume)
education_text = self._format_education(resume)
# Generate embeddings for each field
texts = [skills_text, experience_text, domain_text,
seniority_text, education_text]
embeddings = self.model.encode(texts, normalize_embeddings=True)
return CandidateEmbeddings(
candidate_id=resume['id'],
skills=embeddings[0],
experience=embeddings[1],
domain=embeddings[2],
seniority=embeddings[3],
education=embeddings[4],
years_exp=resume.get('years_experience', 0),
location=resume.get('location', 'unknown'),
)
def _format_skills(self, resume: dict) -> str:
"""Extract and normalize technical skills."""
raw_skills = resume.get('skills', [])
# Normalize: "ReactJS" -> "React", "node" -> "Node.js"
normalized = [self.skill_extractor.normalize(s) for s in raw_skills]
# Add inferred skills from experience descriptions
inferred = self.skill_extractor.extract_from_text(
' '.join(exp['description'] for exp in resume.get('experience', []))
)
all_skills = list(set(normalized + inferred))
return f"Technical skills: {', '.join(all_skills)}"
def _infer_seniority(self, resume: dict) -> str:
"""Infer seniority from signals beyond just years."""
signals = []
years = resume.get('years_experience', 0)
titles = [exp.get('title', '') for exp in resume.get('experience', [])]
# Title-based signals
senior_keywords = ['senior', 'lead', 'principal', 'staff', 'architect',
'director', 'vp', 'head of', 'manager']
for title in titles:
if any(kw in title.lower() for kw in senior_keywords):
signals.append(f"Held title: {title}")
# Team leadership signals
descriptions = ' '.join(
exp.get('description', '') for exp in resume.get('experience', [])
)
if re.search(r'led.*team|managed.*engineers|mentored', descriptions, re.I):
signals.append("Led or managed engineering teams")
# Scope signals
if re.search(r'architected|designed system|built from scratch', descriptions, re.I):
signals.append("System architecture experience")
signals.append(f"{years} years of professional experience")
return f"Seniority indicators: {'; '.join(signals)}"
Get more insights on AI & Machine Learning
Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.
Step 2: Weighted Multi-Field Scoring
Matching isn't a single cosine similarity — it's a weighted combination of field-level similarities with configurable weights per job type:
class MatchScorer:
# Default weights — can be overridden per job or recruiter preference
DEFAULT_WEIGHTS = {
'skills': 0.35,
'experience': 0.25,
'domain': 0.20,
'seniority': 0.15,
'education': 0.05,
}
def __init__(self, weights: dict = None):
self.weights = weights or self.DEFAULT_WEIGHTS
def score(self, candidate: CandidateEmbeddings,
job: JobEmbeddings) -> dict:
"""Compute weighted match score with field-level breakdown."""
field_scores = {
'skills': self._cosine_sim(candidate.skills, job.skills_required),
'experience': self._cosine_sim(candidate.experience, job.experience_needed),
'domain': self._cosine_sim(candidate.domain, job.domain),
'seniority': self._cosine_sim(candidate.seniority, job.seniority),
'education': self._cosine_sim(candidate.education, job.education_req),
}
# Weighted aggregate
total = sum(
field_scores[field] * self.weights[field]
for field in field_scores
)
# Apply hard filters
penalties = self._apply_penalties(candidate, job)
adjusted_total = total * penalties['multiplier']
return {
'total_score': round(adjusted_total, 4),
'raw_score': round(total, 4),
'field_scores': {k: round(v, 4) for k, v in field_scores.items()},
'penalties': penalties['reasons'],
'explainability': self._explain(field_scores, penalties),
}
def _apply_penalties(self, candidate, job) -> dict:
multiplier = 1.0
reasons = []
# Experience range penalty
if candidate.years_exp < job.min_years:
gap = job.min_years - candidate.years_exp
penalty = max(0.5, 1.0 - (gap * 0.1))
multiplier *= penalty
reasons.append(f"Below min experience ({candidate.years_exp} < {job.min_years} years)")
return {'multiplier': multiplier, 'reasons': reasons}
def _explain(self, field_scores: dict, penalties: dict) -> str:
"""Generate human-readable match explanation."""
strengths = [f for f, s in field_scores.items() if s > 0.75]
gaps = [f for f, s in field_scores.items() if s < 0.4]
parts = []
if strengths:
parts.append(f"Strong match in: {', '.join(strengths)}")
if gaps:
parts.append(f"Gaps in: {', '.join(gaps)}")
if penalties['reasons']:
parts.append(f"Considerations: {'; '.join(penalties['reasons'])}")
return '. '.join(parts)
@staticmethod
def _cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
# Vectors are pre-normalized, so dot product = cosine similarity
return float(np.dot(a, b))
Notice education gets only 5% weight by default. This is intentional — research consistently shows that educational pedigree is a poor predictor of job performance, and over-weighting it introduces socioeconomic bias.
Step 3: Bias Detection and Mitigation
This is where recruitment AI gets dangerous if you're not careful. Historical hiring data is biased. If you train on it naively, you amplify those biases at scale.
The Bias Sources
- Training data bias: Historical hires over-represent certain demographics because of past human bias
- Proxy features: ZIP codes correlate with race. University names correlate with socioeconomic status. Graduation years correlate with age.
- Language bias: Gendered language in resumes ("aggressive" vs. "collaborative") can shift similarity scores
- Name bias: Embedding models trained on internet text absorb cultural associations with names
Our Mitigation Strategy
class BiasAuditor:
"""Continuous bias monitoring for the matching pipeline."""
# Fields that should NEVER influence matching scores
PROTECTED_FIELDS = ['name', 'gender', 'age', 'date_of_birth',
'photo', 'marital_status', 'nationality']
def __init__(self, db_connection):
self.db = db_connection
def pre_embedding_scrub(self, resume: dict) -> dict:
"""Remove protected information before embedding generation."""
scrubbed = {k: v for k, v in resume.items()
if k not in self.PROTECTED_FIELDS}
# Anonymize university tier signals
if 'education' in scrubbed:
for edu in scrubbed['education']:
# Keep degree type and field, remove institution name
# during embedding (institution is stored separately for display)
edu['institution_for_embedding'] = edu.get('degree_type', '')
# Remove age signals
if 'experience' in scrubbed:
for exp in scrubbed['experience']:
# Remove specific years, keep duration
exp.pop('start_year', None)
exp.pop('end_year', None)
return scrubbed
def measure_demographic_parity(self, job_id: str,
results: list,
top_k: int = 50) -> dict:
"""Measure score distribution across demographic groups."""
top_candidates = results[:top_k]
all_candidates = results
metrics = {}
for attribute in ['gender', 'age_band', 'ethnicity_inferred']:
group_scores = {}
for candidate in all_candidates:
group = self._get_demographic(candidate['candidate_id'], attribute)
if group not in group_scores:
group_scores[group] = []
group_scores[group].append(candidate['total_score'])
# Demographic parity ratio
group_means = {g: np.mean(scores) for g, scores in group_scores.items()}
if group_means:
max_mean = max(group_means.values())
min_mean = min(group_means.values())
parity_ratio = min_mean / max_mean if max_mean > 0 else 1.0
metrics[attribute] = {
'parity_ratio': round(parity_ratio, 4),
'group_means': {g: round(m, 4) for g, m in group_means.items()},
'flag': parity_ratio < 0.8, # 80% rule threshold
}
return metrics
def adversarial_debiasing_check(self, embeddings: np.ndarray,
labels: np.ndarray) -> float:
"""Can a classifier predict protected attributes from embeddings?
If yes, the embeddings encode demographic information and need
further debiasing.
"""
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
clf = LogisticRegression(max_iter=1000)
scores = cross_val_score(clf, embeddings, labels, cv=5, scoring='accuracy')
# If accuracy >> random chance, embeddings leak demographic info
return float(np.mean(scores))
RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.
The 80% Rule
We implement the "four-fifths rule" from US employment law: the selection rate for any protected group should be at least 80% of the rate for the most-selected group. If our system recommends 40% of male candidates but only 25% of female candidates for a role, that's a 62.5% ratio — a red flag that triggers automatic review.
def check_four_fifths_rule(self, results_by_group: dict) -> dict:
"""EEOC four-fifths rule compliance check."""
selection_rates = {}
for group, candidates in results_by_group.items():
total = len(candidates)
selected = len([c for c in candidates if c['total_score'] > 0.7])
selection_rates[group] = selected / total if total > 0 else 0
max_rate = max(selection_rates.values()) if selection_rates else 0
compliance = {}
for group, rate in selection_rates.items():
ratio = rate / max_rate if max_rate > 0 else 1.0
compliance[group] = {
'selection_rate': round(rate, 4),
'adverse_impact_ratio': round(ratio, 4),
'compliant': ratio >= 0.8,
}
return compliance
Every matching query logs its bias metrics. A weekly automated report flags any job or recruiter account where the four-fifths rule is violated.
Step 4: Vector Storage and Search
With 500K+ candidates, we need fast approximate nearest neighbor (ANN) search. We use PostgreSQL with pgvector — one less database to operate.
-- Candidate embeddings table
CREATE TABLE candidate_embeddings (
candidate_id UUID PRIMARY KEY REFERENCES candidates(id),
skills_vec vector(384) NOT NULL,
experience_vec vector(384) NOT NULL,
domain_vec vector(384) NOT NULL,
seniority_vec vector(384) NOT NULL,
education_vec vector(384) NOT NULL,
years_exp INTEGER,
location TEXT,
updated_at TIMESTAMPTZ DEFAULT now()
);
-- HNSW indexes for each field
CREATE INDEX idx_skills_hnsw ON candidate_embeddings
USING hnsw (skills_vec vector_cosine_ops) WITH (m = 24, ef_construction = 200);
CREATE INDEX idx_experience_hnsw ON candidate_embeddings
USING hnsw (experience_vec vector_cosine_ops) WITH (m = 16, ef_construction = 200);
CREATE INDEX idx_domain_hnsw ON candidate_embeddings
USING hnsw (domain_vec vector_cosine_ops) WITH (m = 16, ef_construction = 200);
The Two-Phase Search
We use a two-phase approach: fast vector recall followed by precise re-scoring:
async def match_candidates(self, job: JobEmbeddings,
limit: int = 50) -> list:
"""Two-phase matching: fast recall + precise scoring."""
# Phase 1: Recall — get top 500 by skills similarity (fast, approximate)
recall_candidates = await self.db.fetch("""
SELECT candidate_id, skills_vec, experience_vec,
domain_vec, seniority_vec, education_vec,
years_exp, location
FROM candidate_embeddings
WHERE years_exp >= $1
ORDER BY skills_vec <=> $2
LIMIT 500
""", job.min_years - 1, job.skills_required.tolist())
# Phase 2: Score — precise multi-field scoring on recall set
scored = []
for row in recall_candidates:
candidate = CandidateEmbeddings(
candidate_id=str(row['candidate_id']),
skills=np.array(row['skills_vec']),
experience=np.array(row['experience_vec']),
domain=np.array(row['domain_vec']),
seniority=np.array(row['seniority_vec']),
education=np.array(row['education_vec']),
years_exp=row['years_exp'],
location=row['location'],
)
score = self.scorer.score(candidate, job)
scored.append({**score, 'candidate_id': candidate.candidate_id})
# Sort by total score and return top N
scored.sort(key=lambda x: x['total_score'], reverse=True)
return scored[:limit]
Phase 1 uses pgvector's HNSW index for sub-10ms approximate search on 500K vectors. Phase 2 does precise multi-field scoring on the top 500 candidates. Total latency: 60-90ms.
Step 5: Performance at Scale
Benchmarks
| Metric | Value |
|---|---|
| Candidate pool | 523,000 profiles |
| Embedding dimensions | 384 per field (5 fields) |
| Index type | HNSW (m=24, ef_construction=200) |
| Phase 1 (recall) latency | 8-12ms |
| Phase 2 (scoring) latency | 45-70ms |
| Total end-to-end | 60-90ms (p95) |
| Index memory | ~3.2 GB for all 5 field indexes |
| Embedding generation | 150ms per candidate (batch: 50ms/candidate) |
Optimization Techniques
Batch embedding generation: New candidates are embedded in batches of 256, leveraging GPU parallelism. A queue worker processes new signups every 30 seconds.
Pre-filtered search: The
WHERE years_exp >= $1filter runs before the vector search, reducing the candidate pool by 30-60% depending on the seniority requirement.Connection pooling: PgBouncer in transaction mode with 20 connections handles 200+ concurrent matching requests.
Embedding caching: Job description embeddings are cached in Redis for 24 hours. The same job matched against different candidate pools doesn't re-embed.
Incremental index updates: New candidate embeddings are inserted without rebuilding the HNSW index. PostgreSQL handles this natively.
Free Resource
Free Cloud Architecture Checklist
A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.
Results
After deploying the matching engine:
- Recruiter satisfaction: 73% of recruiters rated AI matches as "good" or "excellent" (up from 31% with keyword matching)
- Time-to-shortlist: Reduced from 4.2 hours to 12 minutes average
- Diversity impact: Female candidate representation in shortlists increased 18% without any explicit demographic targeting — a natural result of deweighting proxy features
- False positive rate: 15% of AI-recommended candidates were rejected at phone screen (vs. 42% with keyword matching)
Lessons Learned
Multi-field beats single-vector: Decomposing into structured fields gave us both better accuracy and explainability. Recruiters trust recommendations they can understand.
Bias is a continuous problem: You don't solve bias once and move on. We run weekly audits, retrain the bias classifier monthly, and have a human review process for flagged results.
Explainability is non-negotiable: "This candidate scored 0.87" means nothing to a recruiter. "Strong skills match (92%), relevant domain experience (88%), but slightly below the seniority level you specified (65%)" — that's actionable.
PostgreSQL + pgvector was the right call: We considered Pinecone and Milvus. Keeping everything in PostgreSQL meant one backup strategy, one monitoring setup, and transactional consistency between candidate profiles and their embeddings.
Education weight should be near-zero by default: Every time we increased education weight in A/B tests, diversity metrics worsened with no improvement in hire quality. The 5% default is there for compliance, not signal.
ML pipeline: from raw data collection through training, evaluation, deployment, and continuous monitoring.
The Bottom Line
Recruitment AI is one of the highest-stakes applications of machine learning. Get it right, and you help companies find talent they would have overlooked. Get it wrong, and you automate discrimination at scale.
The technical challenge isn't embedding generation or vector search — those are solved problems. The real challenge is building systems that are simultaneously accurate, fair, explainable, and fast. Multi-field embeddings with weighted scoring, continuous bias auditing, and transparent match explanations are how we approached it for Skillety.
The system is live, processing thousands of matches daily. But we treat every deployment as a hypothesis to be tested, every bias audit as a chance to improve, and every recruiter's feedback as training data for the next iteration.
Related Service
Cloud Solutions
Let our experts help you build the right technology strategy for your business.
Need help with ai & machine learning?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.
We Will Build You a Demo Site — For Free
Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.
No spam. No contracts. Just a free demo.