← All articlesAI & Machine Learning

Building an AI Screening Pipeline With Embeddings

Build an AI resume screening pipeline with BGE-M3 embeddings, four-fifths rule bias detection, and EU AI Act compliance. Production code and legal citations.

TechSaaS Team

19 March 202612 min read read

Building an AI Screening Pipeline With Embeddings

Automated resume screening is one of the most impactful — and most dangerous — applications of embeddings in production. Get it right and you process thousands of candidates in minutes. Get it wrong and you deploy a system that systematically discriminates, with fines up to EUR 35 million starting August 2026.

This article walks through the technical architecture of an embedding-based screening pipeline: model selection, scoring, bias detection, and automated decision-making. Every code example uses current libraries and models. Every legal citation is specific. We built this for Skillety, and we're sharing what we learned.

The Embedding Pipeline: From Resume to Vector

The core idea is straightforward: encode job descriptions and resumes into the same vector space, then measure similarity. The devil is in the model choice.

Model Selection: Stop Using MiniLM for Production

The most common tutorial recommendation — all-MiniLM-L6-v2 — has a critical flaw for resume screening: a 256-token context window. A typical resume is 400-800 tokens. MiniLM silently truncates everything past token 256. No warning, no error. The second half of every resume simply disappears.

This isn't a minor issue. Candidates with extensive experience (more text) lose more information to truncation — creating a systematic bias against senior candidates.

Here's what the model landscape actually looks like in March 2026:

Model	Dimensions	Max Tokens	MTEB Score	Params	License	Best For
all-MiniLM-L6-v2	384	256	~56	22.7M	Apache 2.0	Prototyping only
e5-large-v2	1024	512	~62	335M	MIT	Budget production
BGE-M3	1024	8192	~64.6	568M	MIT	Recommended production
Snowflake Arctic-Embed-L-v2.0	1024	8192	~55.6	303M	Apache 2.0	Enterprise production
Jina-embeddings-v3	1024	8192	~65	570M	Open	Task-adaptive

BGE-M3 is our production recommendation. 8192-token context means you can embed entire resumes without chunking. MIT license. Supports 100+ languages. Published by BAAI with active maintenance.

The 8K context window changes the architecture fundamentally. With MiniLM, you'd need to chunk resumes into sections, embed each chunk, then aggregate scores — introducing complexity and losing cross-section context. With BGE-M3, you embed the whole document:

from sentence_transformers import SentenceTransformer
import numpy as np

# sentence-transformers v5.3.0 (March 2026)
model = SentenceTransformer('BAAI/bge-m3')

# BGE models require instruction prefixes for optimal performance
# Omitting these drops retrieval accuracy by 5-15%
jd_embedding = model.encode(
    "Represent this job description for retrieval: " + jd_text,
    normalize_embeddings=True
)
resume_embedding = model.encode(
    "Represent this document for retrieval: " + resume_text,
    normalize_embeddings=True
)

Instruction prefix trap: BGE and e5 models use specific prefixes ("Represent this..." or "query: "/"passage: ") for asymmetric retrieval. Forgetting these drops accuracy by 5-15% on retrieval benchmarks. Always check the model card.

When You Must Chunk

If you're constrained to shorter-context models (budget, latency requirements), use structure-aware section-level chunking rather than naive recursive splitting. Vecta's 2026 benchmark showed 87% accuracy for section-level chunking vs. 69% for recursive 512-token splitting. Parse resumes into sections (experience, education, skills) and embed each section separately.

Get more insights on AI & Machine Learning

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Scoring: Beyond Raw Cosine Similarity

The simplest approach — cosine similarity between JD and resume vectors — works as a baseline but has known failure modes in high-dimensional spaces. For embedding fundamentals, see our embedding models explainer: from Word2Vec to text-embedding-3.

The hubness problem (Radovanovic et al., 2010): some vectors become "nearest neighbor" to a disproportionate number of others. In resume screening, this means generic/boilerplate resumes score high against everything — they're discriminative of nothing.

Magnitude information loss (You 2025, arXiv:2504.16318): cosine similarity discards vector magnitude, which carries semantic signals like "confidence" and "specificity." A highly specific resume and a vague one can have identical cosine scores.

We use weighted skill-cluster scoring to decompose the match:

from sklearn.metrics.pairwise import cosine_similarity

def weighted_skill_score(model, jd_text, resume_text, skill_clusters):
    """
    Score a resume against a JD using weighted skill clusters.
    skill_clusters: [{'description': 'Python backend development',
                      'weight': 0.3}, ...]
    """
    resume_emb = model.encode(
        "Represent this document for retrieval: " + resume_text,
        normalize_embeddings=True
    ).reshape(1, -1)

    cluster_scores = []
    for cluster in skill_clusters:
        cluster_emb = model.encode(
            "Represent this query for retrieval: " + cluster['description'],
            normalize_embeddings=True
        ).reshape(1, -1)
        sim = cosine_similarity(cluster_emb, resume_emb)[0][0]
        cluster_scores.append(sim * cluster['weight'])

    return sum(cluster_scores)

Critical: embedding similarity scores are not probabilities. A cosine similarity of 0.75 does not mean "75% match." Scores from different models have completely different distributions. If you change models, all your thresholds need recalibration against a held-out dataset.

Bias Detection and the Four-Fifths Rule

This is where most AI recruitment articles stop. They mention "bias" abstractly and move on. We're going to be specific, because the legal consequences are specific.

The Evidence

University of Washington, October 2024: AI resume screening preferred White-associated names 85% of the time across three state-of-the-art LLMs — with identical qualifications.
Brookings/AAAI, 2024: GPT-4o, Claude 3.5, Gemini 1.5 Flash, and Llama 3-70b all systematically disadvantaged Black male applicants while favoring female candidates across 361,000 fictitious resumes.
Bolukbasi et al., 2016: Word2Vec encodes "computer programmer → man, homemaker → woman."
Caliskan et al., 2017 (Science): GloVe embeddings replicate the full spectrum of human implicit biases via the Word Embedding Association Test.
Deshpande et al., 2020: Even after removing names, writing style patterns (sociolinguistics) correlate with ethnicity and gender. You cannot just strip PII and call it fair.

These aren't edge cases. They're the default behavior of embedding spaces trained on internet text.

The Four-Fifths Rule

The legal standard comes from 29 CFR Part 1607 (Uniform Guidelines on Employee Selection Procedures, 1978), jointly adopted by the EEOC, DOL, DOJ, and Civil Service Commission. The rule: if the selection rate for any demographic group is less than 80% (four-fifths) of the highest group's selection rate, there is evidence of adverse impact.

Important: the four-fifths rule is a rule of thumb, not a safe harbor. The EEOC's May 2023 Technical Assistance explicitly states that smaller disparities may constitute adverse impact if statistically significant. Always pair the ratio with a statistical test.

→

How We Built AI Recruitment Matching for Skillety: Embeddings, Bias Handling, and Performance at Scale13 min read

→

Small Language Models at the Edge: The On-Device AI Revolution Changing Everything11 min read

→

The PostgreSQL Consolidation: Why 'Just Use Postgres' Is the 2026 AI Database Strategy11 min read

from scipy.stats import chi2_contingency

def adverse_impact_ratio(selection_rates: dict) -> dict:
    """
    selection_rates: {'white': 0.52, 'black': 0.31,
                      'hispanic': 0.44, 'asian': 0.58}
    Returns AIR for each group vs highest-performing group.
    Four-fifths rule: AIR < 0.80 indicates adverse impact.
    """
    max_rate = max(selection_rates.values())
    max_group = [k for k, v in selection_rates.items() if v == max_rate][0]
    results = {}
    for group, rate in selection_rates.items():
        air = rate / max_rate if max_rate > 0 else 0
        results[group] = {
            'selection_rate': rate,
            'air': round(air, 3),
            'adverse_impact': air < 0.80,
            'compared_to': max_group,
        }
    return results

def chi_squared_significance(pass_a, fail_a, pass_b, fail_b):
    """EEOC recommends statistical significance alongside 4/5 rule."""
    table = [[pass_a, fail_a], [pass_b, fail_b]]
    chi2, p_value, _, _ = chi2_contingency(table)
    return {'chi2': round(chi2, 3), 'p_value': round(p_value, 4),
            'significant': p_value < 0.05}

# Example: 200 applicants per group
rates = adverse_impact_ratio({
    'white': 0.52, 'black': 0.31,
    'hispanic': 0.44, 'asian': 0.58
})
print(rates['black'])
# {'selection_rate': 0.31, 'air': 0.534, 'adverse_impact': True,
#  'compared_to': 'asian'}
# AIR 0.534 < 0.80 — adverse impact detected

# Statistical significance check
sig = chi_squared_significance(
    pass_a=116, fail_a=84,   # asian: 58% selected
    pass_b=62,  fail_b=138   # black: 31% selected
)
print(sig)  # {'chi2': 29.45, 'p_value': 0.0, 'significant': True}

Real Enforcement

This isn't theoretical:

EEOC v. iTutorGroup (2023): First EEOC AI discrimination settlement — $365K for age-based auto-rejection by an AI screening tool.
Mobley v. Workday (N.D. Cal., July 2024): Court ruled AI vendors (not just employers) can be directly liable under Title VII. Landmark precedent — if you build the screening tool, you share liability.
NYC Local Law 144: In effect since July 2023. A December 2025 NY State Comptroller audit found 75% of complaints were misrouted and DCWP identified only 1 violation vs. auditors' 17.
Illinois HB 3773 (effective January 1, 2026): Explicitly prohibits using zip code as a feature in AI employment decisions — a specific proxy variable now legally banned.

Mitigation Strategies

Stripping names and demographic fields is necessary but insufficient — embedding spaces encode bias through proxies (writing style, university names, zip codes, activity patterns).

Approaches that work:

Feature masking: Remove known proxy features before embedding. But new proxies emerge.
Post-hoc score debiasing: QB-Norm adjusts similarity scores to reduce the hubness effect — no model retraining required.
Calibrated re-ranking: After initial scoring, compute AIR per threshold band. If adverse impact is detected, adjust thresholds per demographic group to equalize selection rates. Document the adjustment.
Embedding space auditing: Project embeddings onto demographic dimensions and measure clustering. If resumes from different groups cluster separately despite equal qualifications, the model is encoding bias.

Automated Screening: From Score to Decision

The scoring function produces a number. Turning that into a hiring decision requires thresholds, human oversight, and feedback loops. We detail explainability patterns in our AI candidate scoring guide.

Three-Band Threshold Pattern

THRESHOLDS = {
    'auto_advance': 0.82,  # Top candidates — advance to interview
    'human_review': 0.65,  # Middle band — recruiter reviews
    'auto_reject': 0.65,   # Below threshold — rejected
}

def screen_candidate(score: float) -> str:
    if score >= THRESHOLDS['auto_advance']:
        return 'advance'
    elif score >= THRESHOLDS['human_review']:
        return 'human_review'
    else:
        return 'reject'

# CRITICAL: Run AIR check on each band before deploying
# auto_advance band must pass four-fifths rule
# auto_reject band must pass four-fifths rule
# If either fails, adjust thresholds or add human review

EU AI Act Art. 14 — Human Oversight: Fully automated reject decisions without human review may violate the EU AI Act's human oversight requirement for high-risk AI systems. The safest architecture: auto-advance top candidates, but require human review for all rejections.

The EU AI Act Deadline

Recruitment AI is explicitly classified high-risk under Annex III of the EU AI Act (entered into force August 1, 2024). The core high-risk system requirements become enforceable August 2, 2026 — five months from today.

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist

Specific requirements:

Art. 9: Risk management system covering bias, accuracy, and robustness
Art. 10: Data governance — training data must be representative and free from systematic bias
Art. 11: Technical documentation of the entire pipeline
Art. 12: Automatic logging of all screening decisions for auditability
Art. 13: Transparency — candidates must be informed they're being assessed by AI
Art. 14: Human oversight — ability to override any AI decision

Penalties: up to EUR 35 million or 7% of global annual turnover, whichever is higher.

If you're building a screening pipeline today, compliance isn't optional and the deadline isn't far away.

Edge Cases That Will Bite You

Multi-language resume bias: BGE-M3 supports 100+ languages, but embedding quality isn't uniform across languages. Resumes from non-native English speakers may cluster differently, creating a proxy for nationality. Test cross-lingual retrieval quality if your candidate pool is multilingual.
Score calibration drift: When you update the embedding model, all thresholds break. A 0.75 on BGE-M3 means something completely different than 0.75 on MiniLM. Build a calibration step that maps cosine scores to percentiles using a held-out dataset, and re-run it on every model change.
The homogeneous top-of-funnel problem: If your historical hiring data reflects a homogeneous workforce (e.g., 90% male engineers), the model learns that male-signaling features predict "success" — not because they're better, but because that's all it's seen. This isn't detectable by the four-fifths rule alone; it requires causal analysis of the training data.
Token truncation is silent and discriminatory: MiniLM and e5 truncate without warning. Senior candidates with longer resumes lose more information, creating systematic bias against experience. Use 8K-token models or verify that your chunking strategy doesn't penalize length.

Conclusion

Embedding-based screening is powerful, fast, and deployable. It's also one of the highest-risk AI applications in production — legally, ethically, and technically.

Use production-grade models (BGE-M3, not MiniLM). Measure adverse impact ratios with statistical tests, not just the four-fifths rule. Maintain human oversight on every rejection. Document everything for the EU AI Act deadline in August 2026.

The technology works. The question is whether you've built the guardrails to use it responsibly.

Related reading:

Building AI-powered recruitment tools? Explore our AI and machine learning solutions and HR recruitment technology services.

#AI Recruitment#Embeddings#Bias Detection#NLP#EU AI Act#HR Tech#Machine Learning#Fairness

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Get a Consultation Chat on WhatsApp

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

Get a Free Consultation WhatsApp Us

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us

99.99% uptime

< 48hr response

No spam. No contracts. Just a free demo.

Building an AI Screening Pipeline With Embeddings

Building an AI Screening Pipeline With Embeddings

The Embedding Pipeline: From Resume to Vector

Model Selection: Stop Using MiniLM for Production

When You Must Chunk

Scoring: Beyond Raw Cosine Similarity

Bias Detection and the Four-Fifths Rule

The Evidence

The Four-Fifths Rule

You might also like

Real Enforcement

Mitigation Strategies

Automated Screening: From Score to Decision

Three-Band Threshold Pattern

The EU AI Act Deadline

Edge Cases That Will Bite You

Conclusion

Cloud Solutions

Need help with ai & machine learning?

We Will Build You a Demo Site — For Free

Related Articles

AIOps in Practice: How AI Is Transforming Incident Management in 2026

Running LLMs Locally: A DevOps Guide to Self-Hosted AI in 2026

Running Production AI Agents: Infrastructure Patterns That Actually Scale