← All articlesAI & Machine Learning

How We Built AI Recruitment Matching for Skillety: Embeddings, Bias Handling, and Performance at Scale

A deep technical walkthrough of building Skillety's AI-powered candidate-job matching system. Covers embedding-based semantic scoring, bias mitigation...

T
TechSaaS Team
13 min read

The Problem with Keyword Matching in Recruitment

Traditional Applicant Tracking Systems match candidates to jobs using keyword overlap. A job description says "React" and a resume says "React" — it's a match. Simple, fast, and fundamentally broken.

InputHiddenHiddenOutput

Neural network architecture: data flows through input, hidden, and output layers.

Here's why: a senior frontend engineer who built design systems in Vue.js, led a team of 8, and architected a component library used by 200 developers is a better match for a "Senior React Developer" role than a junior developer who listed React in their skills section after completing a Udemy course. But keyword matching ranks them equally — or worse, ranks the junior higher because they literally typed "React" more times.

When Skillety approached us to build their AI matching engine, the brief was clear: match candidates to jobs based on what they can actually do, not what keywords appear in their resume. The system needed to handle 500,000+ candidate profiles, return results in under 100ms, and — critically — not discriminate based on gender, age, ethnicity, or educational pedigree.

This is how we built it.

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│                    Skillety Matching Engine               │
│                                                           │
│  ┌──────────┐    ┌──────────────┐    ┌───────────────┐  │
│  │ Ingestion │───▶│  Embedding   │───▶│ Vector Store  │  │
│  │ Pipeline  │    │  Generation  │    │ (PostgreSQL   │  │
│  │           │    │  (multi-     │    │  + pgvector)  │  │
│  │ Resume    │    │   field)     │    │               │  │
│  │ Parser    │    │              │    │ 500K+ vectors │  │
│  └──────────┘    └──────────────┘    └───────┬───────┘  │
│                                               │          │
│  ┌──────────┐    ┌──────────────┐    ┌───────▼───────┐  │
│  │ Job Desc  │───▶│  Query       │───▶│   Matching    │  │
│  │ Parser    │    │  Embedding   │    │   + Scoring   │  │
│  │           │    │              │    │   + Ranking   │  │
│  └──────────┘    └──────────────┘    └───────┬───────┘  │
│                                               │          │
│  ┌──────────────────────────────────┐  ┌─────▼───────┐  │
│  │       Bias Detection Layer       │  │   Results    │  │
│  │  Demographic parity monitoring   │  │   API        │  │
│  │  Adversarial debiasing           │  │             │  │
│  │  Fairness metrics dashboard      │  │             │  │
│  └──────────────────────────────────┘  └─────────────┘  │
└─────────────────────────────────────────────────────────┘

Step 1: Multi-Field Embedding Strategy

The naive approach is to embed the entire resume as a single vector and the entire job description as another, then compute cosine similarity. This works poorly because it collapses structured information into a single point in vector space.

A candidate who's a strong skills match but a weak experience-level match gets the same score as someone who's mediocre across the board. You lose the ability to explain why someone matched or didn't.

Instead, we decompose both candidates and jobs into structured fields and embed each independently:

from dataclasses import dataclass
from typing import Optional
import numpy as np

@dataclass
class CandidateEmbeddings:
    """Multi-field embedding representation of a candidate."""
    candidate_id: str
    skills: np.ndarray          # Technical skills + tools
    experience: np.ndarray      # Work experience narrative
    domain: np.ndarray          # Industry/domain expertise
    seniority: np.ndarray       # Level indicators
    education: np.ndarray       # Educational background
    # Metadata (not embedded — used for filtering)
    years_exp: int
    location: str
    salary_range: Optional[tuple] = None

@dataclass
class JobEmbeddings:
    """Multi-field embedding representation of a job."""
    job_id: str
    skills_required: np.ndarray
    experience_needed: np.ndarray
    domain: np.ndarray
    seniority: np.ndarray
    education_req: np.ndarray
    # Metadata
    min_years: int
    max_years: int
    location: str
    remote_ok: bool

Generating Field Embeddings

We use a fine-tuned sentence transformer for generating embeddings. The base model is all-MiniLM-L6-v2 (384 dimensions), fine-tuned on 50K pairs of job descriptions and matching candidate profiles:

from sentence_transformers import SentenceTransformer
import re

class RecruitmentEmbedder:
    def __init__(self, model_path: str = "models/skillety-match-v3"):
        self.model = SentenceTransformer(model_path)
        self.skill_extractor = SkillExtractor()  # NER-based

    def embed_candidate(self, resume: dict) -> CandidateEmbeddings:
        # Extract structured fields from parsed resume
        skills_text = self._format_skills(resume)
        experience_text = self._format_experience(resume)
        domain_text = self._extract_domain_signals(resume)
        seniority_text = self._infer_seniority(resume)
        education_text = self._format_education(resume)

        # Generate embeddings for each field
        texts = [skills_text, experience_text, domain_text,
                 seniority_text, education_text]
        embeddings = self.model.encode(texts, normalize_embeddings=True)

        return CandidateEmbeddings(
            candidate_id=resume['id'],
            skills=embeddings[0],
            experience=embeddings[1],
            domain=embeddings[2],
            seniority=embeddings[3],
            education=embeddings[4],
            years_exp=resume.get('years_experience', 0),
            location=resume.get('location', 'unknown'),
        )

    def _format_skills(self, resume: dict) -> str:
        """Extract and normalize technical skills."""
        raw_skills = resume.get('skills', [])
        # Normalize: "ReactJS" -> "React", "node" -> "Node.js"
        normalized = [self.skill_extractor.normalize(s) for s in raw_skills]
        # Add inferred skills from experience descriptions
        inferred = self.skill_extractor.extract_from_text(
            ' '.join(exp['description'] for exp in resume.get('experience', []))
        )
        all_skills = list(set(normalized + inferred))
        return f"Technical skills: {', '.join(all_skills)}"

    def _infer_seniority(self, resume: dict) -> str:
        """Infer seniority from signals beyond just years."""
        signals = []
        years = resume.get('years_experience', 0)
        titles = [exp.get('title', '') for exp in resume.get('experience', [])]

        # Title-based signals
        senior_keywords = ['senior', 'lead', 'principal', 'staff', 'architect',
                          'director', 'vp', 'head of', 'manager']
        for title in titles:
            if any(kw in title.lower() for kw in senior_keywords):
                signals.append(f"Held title: {title}")

        # Team leadership signals
        descriptions = ' '.join(
            exp.get('description', '') for exp in resume.get('experience', [])
        )
        if re.search(r'led.*team|managed.*engineers|mentored', descriptions, re.I):
            signals.append("Led or managed engineering teams")

        # Scope signals
        if re.search(r'architected|designed system|built from scratch', descriptions, re.I):
            signals.append("System architecture experience")

        signals.append(f"{years} years of professional experience")
        return f"Seniority indicators: {'; '.join(signals)}"

Get more insights on AI & Machine Learning

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Step 2: Weighted Multi-Field Scoring

Matching isn't a single cosine similarity — it's a weighted combination of field-level similarities with configurable weights per job type:

class MatchScorer:
    # Default weights — can be overridden per job or recruiter preference
    DEFAULT_WEIGHTS = {
        'skills': 0.35,
        'experience': 0.25,
        'domain': 0.20,
        'seniority': 0.15,
        'education': 0.05,
    }

    def __init__(self, weights: dict = None):
        self.weights = weights or self.DEFAULT_WEIGHTS

    def score(self, candidate: CandidateEmbeddings,
              job: JobEmbeddings) -> dict:
        """Compute weighted match score with field-level breakdown."""
        field_scores = {
            'skills': self._cosine_sim(candidate.skills, job.skills_required),
            'experience': self._cosine_sim(candidate.experience, job.experience_needed),
            'domain': self._cosine_sim(candidate.domain, job.domain),
            'seniority': self._cosine_sim(candidate.seniority, job.seniority),
            'education': self._cosine_sim(candidate.education, job.education_req),
        }

        # Weighted aggregate
        total = sum(
            field_scores[field] * self.weights[field]
            for field in field_scores
        )

        # Apply hard filters
        penalties = self._apply_penalties(candidate, job)
        adjusted_total = total * penalties['multiplier']

        return {
            'total_score': round(adjusted_total, 4),
            'raw_score': round(total, 4),
            'field_scores': {k: round(v, 4) for k, v in field_scores.items()},
            'penalties': penalties['reasons'],
            'explainability': self._explain(field_scores, penalties),
        }

    def _apply_penalties(self, candidate, job) -> dict:
        multiplier = 1.0
        reasons = []

        # Experience range penalty
        if candidate.years_exp < job.min_years:
            gap = job.min_years - candidate.years_exp
            penalty = max(0.5, 1.0 - (gap * 0.1))
            multiplier *= penalty
            reasons.append(f"Below min experience ({candidate.years_exp} < {job.min_years} years)")

        return {'multiplier': multiplier, 'reasons': reasons}

    def _explain(self, field_scores: dict, penalties: dict) -> str:
        """Generate human-readable match explanation."""
        strengths = [f for f, s in field_scores.items() if s > 0.75]
        gaps = [f for f, s in field_scores.items() if s < 0.4]

        parts = []
        if strengths:
            parts.append(f"Strong match in: {', '.join(strengths)}")
        if gaps:
            parts.append(f"Gaps in: {', '.join(gaps)}")
        if penalties['reasons']:
            parts.append(f"Considerations: {'; '.join(penalties['reasons'])}")
        return '. '.join(parts)

    @staticmethod
    def _cosine_sim(a: np.ndarray, b: np.ndarray) -> float:
        # Vectors are pre-normalized, so dot product = cosine similarity
        return float(np.dot(a, b))

Notice education gets only 5% weight by default. This is intentional — research consistently shows that educational pedigree is a poor predictor of job performance, and over-weighting it introduces socioeconomic bias.

Step 3: Bias Detection and Mitigation

This is where recruitment AI gets dangerous if you're not careful. Historical hiring data is biased. If you train on it naively, you amplify those biases at scale.

The Bias Sources

  1. Training data bias: Historical hires over-represent certain demographics because of past human bias
  2. Proxy features: ZIP codes correlate with race. University names correlate with socioeconomic status. Graduation years correlate with age.
  3. Language bias: Gendered language in resumes ("aggressive" vs. "collaborative") can shift similarity scores
  4. Name bias: Embedding models trained on internet text absorb cultural associations with names

Our Mitigation Strategy

class BiasAuditor:
    """Continuous bias monitoring for the matching pipeline."""

    # Fields that should NEVER influence matching scores
    PROTECTED_FIELDS = ['name', 'gender', 'age', 'date_of_birth',
                        'photo', 'marital_status', 'nationality']

    def __init__(self, db_connection):
        self.db = db_connection

    def pre_embedding_scrub(self, resume: dict) -> dict:
        """Remove protected information before embedding generation."""
        scrubbed = {k: v for k, v in resume.items()
                    if k not in self.PROTECTED_FIELDS}

        # Anonymize university tier signals
        if 'education' in scrubbed:
            for edu in scrubbed['education']:
                # Keep degree type and field, remove institution name
                # during embedding (institution is stored separately for display)
                edu['institution_for_embedding'] = edu.get('degree_type', '')

        # Remove age signals
        if 'experience' in scrubbed:
            for exp in scrubbed['experience']:
                # Remove specific years, keep duration
                exp.pop('start_year', None)
                exp.pop('end_year', None)

        return scrubbed

    def measure_demographic_parity(self, job_id: str,
                                     results: list,
                                     top_k: int = 50) -> dict:
        """Measure score distribution across demographic groups."""
        top_candidates = results[:top_k]
        all_candidates = results

        metrics = {}
        for attribute in ['gender', 'age_band', 'ethnicity_inferred']:
            group_scores = {}
            for candidate in all_candidates:
                group = self._get_demographic(candidate['candidate_id'], attribute)
                if group not in group_scores:
                    group_scores[group] = []
                group_scores[group].append(candidate['total_score'])

            # Demographic parity ratio
            group_means = {g: np.mean(scores) for g, scores in group_scores.items()}
            if group_means:
                max_mean = max(group_means.values())
                min_mean = min(group_means.values())
                parity_ratio = min_mean / max_mean if max_mean > 0 else 1.0

                metrics[attribute] = {
                    'parity_ratio': round(parity_ratio, 4),
                    'group_means': {g: round(m, 4) for g, m in group_means.items()},
                    'flag': parity_ratio < 0.8,  # 80% rule threshold
                }

        return metrics

    def adversarial_debiasing_check(self, embeddings: np.ndarray,
                                      labels: np.ndarray) -> float:
        """Can a classifier predict protected attributes from embeddings?

        If yes, the embeddings encode demographic information and need
        further debiasing.
        """
        from sklearn.linear_model import LogisticRegression
        from sklearn.model_selection import cross_val_score

        clf = LogisticRegression(max_iter=1000)
        scores = cross_val_score(clf, embeddings, labels, cv=5, scoring='accuracy')
        # If accuracy >> random chance, embeddings leak demographic info
        return float(np.mean(scores))
PromptEmbed[0.2, 0.8...]VectorSearchtop-k=5LLM+ contextReplyRetrieval-Augmented Generation (RAG) Flow

RAG architecture: user prompts are embedded, matched against a vector store, then fed to an LLM with retrieved context.

The 80% Rule

We implement the "four-fifths rule" from US employment law: the selection rate for any protected group should be at least 80% of the rate for the most-selected group. If our system recommends 40% of male candidates but only 25% of female candidates for a role, that's a 62.5% ratio — a red flag that triggers automatic review.

def check_four_fifths_rule(self, results_by_group: dict) -> dict:
    """EEOC four-fifths rule compliance check."""
    selection_rates = {}
    for group, candidates in results_by_group.items():
        total = len(candidates)
        selected = len([c for c in candidates if c['total_score'] > 0.7])
        selection_rates[group] = selected / total if total > 0 else 0

    max_rate = max(selection_rates.values()) if selection_rates else 0
    compliance = {}
    for group, rate in selection_rates.items():
        ratio = rate / max_rate if max_rate > 0 else 1.0
        compliance[group] = {
            'selection_rate': round(rate, 4),
            'adverse_impact_ratio': round(ratio, 4),
            'compliant': ratio >= 0.8,
        }
    return compliance

Every matching query logs its bias metrics. A weekly automated report flags any job or recruiter account where the four-fifths rule is violated.

Step 4: Vector Storage and Search

With 500K+ candidates, we need fast approximate nearest neighbor (ANN) search. We use PostgreSQL with pgvector — one less database to operate.

-- Candidate embeddings table
CREATE TABLE candidate_embeddings (
    candidate_id UUID PRIMARY KEY REFERENCES candidates(id),
    skills_vec vector(384) NOT NULL,
    experience_vec vector(384) NOT NULL,
    domain_vec vector(384) NOT NULL,
    seniority_vec vector(384) NOT NULL,
    education_vec vector(384) NOT NULL,
    years_exp INTEGER,
    location TEXT,
    updated_at TIMESTAMPTZ DEFAULT now()
);

-- HNSW indexes for each field
CREATE INDEX idx_skills_hnsw ON candidate_embeddings
    USING hnsw (skills_vec vector_cosine_ops) WITH (m = 24, ef_construction = 200);
CREATE INDEX idx_experience_hnsw ON candidate_embeddings
    USING hnsw (experience_vec vector_cosine_ops) WITH (m = 16, ef_construction = 200);
CREATE INDEX idx_domain_hnsw ON candidate_embeddings
    USING hnsw (domain_vec vector_cosine_ops) WITH (m = 16, ef_construction = 200);

The Two-Phase Search

We use a two-phase approach: fast vector recall followed by precise re-scoring:

async def match_candidates(self, job: JobEmbeddings,
                           limit: int = 50) -> list:
    """Two-phase matching: fast recall + precise scoring."""

    # Phase 1: Recall — get top 500 by skills similarity (fast, approximate)
    recall_candidates = await self.db.fetch("""
        SELECT candidate_id, skills_vec, experience_vec,
               domain_vec, seniority_vec, education_vec,
               years_exp, location
        FROM candidate_embeddings
        WHERE years_exp >= $1
        ORDER BY skills_vec <=> $2
        LIMIT 500
    """, job.min_years - 1, job.skills_required.tolist())

    # Phase 2: Score — precise multi-field scoring on recall set
    scored = []
    for row in recall_candidates:
        candidate = CandidateEmbeddings(
            candidate_id=str(row['candidate_id']),
            skills=np.array(row['skills_vec']),
            experience=np.array(row['experience_vec']),
            domain=np.array(row['domain_vec']),
            seniority=np.array(row['seniority_vec']),
            education=np.array(row['education_vec']),
            years_exp=row['years_exp'],
            location=row['location'],
        )
        score = self.scorer.score(candidate, job)
        scored.append({**score, 'candidate_id': candidate.candidate_id})

    # Sort by total score and return top N
    scored.sort(key=lambda x: x['total_score'], reverse=True)
    return scored[:limit]

Phase 1 uses pgvector's HNSW index for sub-10ms approximate search on 500K vectors. Phase 2 does precise multi-field scoring on the top 500 candidates. Total latency: 60-90ms.

Step 5: Performance at Scale

Benchmarks

Metric Value
Candidate pool 523,000 profiles
Embedding dimensions 384 per field (5 fields)
Index type HNSW (m=24, ef_construction=200)
Phase 1 (recall) latency 8-12ms
Phase 2 (scoring) latency 45-70ms
Total end-to-end 60-90ms (p95)
Index memory ~3.2 GB for all 5 field indexes
Embedding generation 150ms per candidate (batch: 50ms/candidate)

Optimization Techniques

  1. Batch embedding generation: New candidates are embedded in batches of 256, leveraging GPU parallelism. A queue worker processes new signups every 30 seconds.

  2. Pre-filtered search: The WHERE years_exp >= $1 filter runs before the vector search, reducing the candidate pool by 30-60% depending on the seniority requirement.

  3. Connection pooling: PgBouncer in transaction mode with 20 connections handles 200+ concurrent matching requests.

  4. Embedding caching: Job description embeddings are cached in Redis for 24 hours. The same job matched against different candidate pools doesn't re-embed.

  5. Incremental index updates: New candidate embeddings are inserted without rebuilding the HNSW index. PostgreSQL handles this natively.

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist

Results

After deploying the matching engine:

  • Recruiter satisfaction: 73% of recruiters rated AI matches as "good" or "excellent" (up from 31% with keyword matching)
  • Time-to-shortlist: Reduced from 4.2 hours to 12 minutes average
  • Diversity impact: Female candidate representation in shortlists increased 18% without any explicit demographic targeting — a natural result of deweighting proxy features
  • False positive rate: 15% of AI-recommended candidates were rejected at phone screen (vs. 42% with keyword matching)

Lessons Learned

  1. Multi-field beats single-vector: Decomposing into structured fields gave us both better accuracy and explainability. Recruiters trust recommendations they can understand.

  2. Bias is a continuous problem: You don't solve bias once and move on. We run weekly audits, retrain the bias classifier monthly, and have a human review process for flagged results.

  3. Explainability is non-negotiable: "This candidate scored 0.87" means nothing to a recruiter. "Strong skills match (92%), relevant domain experience (88%), but slightly below the seniority level you specified (65%)" — that's actionable.

  4. PostgreSQL + pgvector was the right call: We considered Pinecone and Milvus. Keeping everything in PostgreSQL meant one backup strategy, one monitoring setup, and transactional consistency between candidate profiles and their embeddings.

  5. Education weight should be near-zero by default: Every time we increased education weight in A/B tests, diversity metrics worsened with no improvement in hire quality. The 5% default is there for compliance, not signal.

RawDataPre-processTrainModelEvaluateMetricsDeployModelMonretrain loop

ML pipeline: from raw data collection through training, evaluation, deployment, and continuous monitoring.

The Bottom Line

Recruitment AI is one of the highest-stakes applications of machine learning. Get it right, and you help companies find talent they would have overlooked. Get it wrong, and you automate discrimination at scale.

The technical challenge isn't embedding generation or vector search — those are solved problems. The real challenge is building systems that are simultaneously accurate, fair, explainable, and fast. Multi-field embeddings with weighted scoring, continuous bias auditing, and transparent match explanations are how we approached it for Skillety.

The system is live, processing thousands of matches daily. But we treat every deployment as a hypothesis to be tested, every bias audit as a chance to improve, and every recruiter's feedback as training data for the next iteration.

#ai#recruitment#embeddings#machine-learning#bias-mitigation#vector-search#skillety

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.