← All articlesAI & Machine Learning

Building an AI Screening Pipeline With Embeddings

Build an AI resume screening pipeline with BGE-M3 embeddings, four-fifths rule bias detection, and EU AI Act compliance. Production code and legal citations.

T
TechSaaS Team
12 min read read

Building an AI Screening Pipeline With Embeddings

Automated resume screening is one of the most impactful — and most dangerous — applications of embeddings in production. Get it right and you process thousands of candidates in minutes. Get it wrong and you deploy a system that systematically discriminates, with fines up to EUR 35 million starting August 2026.

This article walks through the technical architecture of an embedding-based screening pipeline: model selection, scoring, bias detection, and automated decision-making. Every code example uses current libraries and models. Every legal citation is specific. We built this for Skillety, and we're sharing what we learned.

The Embedding Pipeline: From Resume to Vector

The core idea is straightforward: encode job descriptions and resumes into the same vector space, then measure similarity. The devil is in the model choice.

Model Selection: Stop Using MiniLM for Production

The most common tutorial recommendation — all-MiniLM-L6-v2 — has a critical flaw for resume screening: a 256-token context window. A typical resume is 400-800 tokens. MiniLM silently truncates everything past token 256. No warning, no error. The second half of every resume simply disappears.

This isn't a minor issue. Candidates with extensive experience (more text) lose more information to truncation — creating a systematic bias against senior candidates.

Here's what the model landscape actually looks like in March 2026:

Model Dimensions Max Tokens MTEB Score Params License Best For
all-MiniLM-L6-v2 384 256 ~56 22.7M Apache 2.0 Prototyping only
e5-large-v2 1024 512 ~62 335M MIT Budget production
BGE-M3 1024 8192 ~64.6 568M MIT Recommended production
Snowflake Arctic-Embed-L-v2.0 1024 8192 ~55.6 303M Apache 2.0 Enterprise production
Jina-embeddings-v3 1024 8192 ~65 570M Open Task-adaptive

BGE-M3 is our production recommendation. 8192-token context means you can embed entire resumes without chunking. MIT license. Supports 100+ languages. Published by BAAI with active maintenance.

The 8K context window changes the architecture fundamentally. With MiniLM, you'd need to chunk resumes into sections, embed each chunk, then aggregate scores — introducing complexity and losing cross-section context. With BGE-M3, you embed the whole document:

from sentence_transformers import SentenceTransformer
import numpy as np

# sentence-transformers v5.3.0 (March 2026)
model = SentenceTransformer('BAAI/bge-m3')

# BGE models require instruction prefixes for optimal performance
# Omitting these drops retrieval accuracy by 5-15%
jd_embedding = model.encode(
    "Represent this job description for retrieval: " + jd_text,
    normalize_embeddings=True
)
resume_embedding = model.encode(
    "Represent this document for retrieval: " + resume_text,
    normalize_embeddings=True
)

Instruction prefix trap: BGE and e5 models use specific prefixes ("Represent this..." or "query: "/"passage: ") for asymmetric retrieval. Forgetting these drops accuracy by 5-15% on retrieval benchmarks. Always check the model card.

When You Must Chunk

If you're constrained to shorter-context models (budget, latency requirements), use structure-aware section-level chunking rather than naive recursive splitting. Vecta's 2026 benchmark showed 87% accuracy for section-level chunking vs. 69% for recursive 512-token splitting. Parse resumes into sections (experience, education, skills) and embed each section separately.

Get more insights on AI & Machine Learning

Join 2,000+ engineers who get our weekly deep-dives. No spam, unsubscribe anytime.

Scoring: Beyond Raw Cosine Similarity

The simplest approach — cosine similarity between JD and resume vectors — works as a baseline but has known failure modes in high-dimensional spaces. For embedding fundamentals, see our embedding models explainer: from Word2Vec to text-embedding-3.

The hubness problem (Radovanovic et al., 2010): some vectors become "nearest neighbor" to a disproportionate number of others. In resume screening, this means generic/boilerplate resumes score high against everything — they're discriminative of nothing.

Magnitude information loss (You 2025, arXiv:2504.16318): cosine similarity discards vector magnitude, which carries semantic signals like "confidence" and "specificity." A highly specific resume and a vague one can have identical cosine scores.

We use weighted skill-cluster scoring to decompose the match:

from sklearn.metrics.pairwise import cosine_similarity

def weighted_skill_score(model, jd_text, resume_text, skill_clusters):
    """
    Score a resume against a JD using weighted skill clusters.
    skill_clusters: [{'description': 'Python backend development',
                      'weight': 0.3}, ...]
    """
    resume_emb = model.encode(
        "Represent this document for retrieval: " + resume_text,
        normalize_embeddings=True
    ).reshape(1, -1)

    cluster_scores = []
    for cluster in skill_clusters:
        cluster_emb = model.encode(
            "Represent this query for retrieval: " + cluster['description'],
            normalize_embeddings=True
        ).reshape(1, -1)
        sim = cosine_similarity(cluster_emb, resume_emb)[0][0]
        cluster_scores.append(sim * cluster['weight'])

    return sum(cluster_scores)

Critical: embedding similarity scores are not probabilities. A cosine similarity of 0.75 does not mean "75% match." Scores from different models have completely different distributions. If you change models, all your thresholds need recalibration against a held-out dataset.

Bias Detection and the Four-Fifths Rule

This is where most AI recruitment articles stop. They mention "bias" abstractly and move on. We're going to be specific, because the legal consequences are specific.

The Evidence

  • University of Washington, October 2024: AI resume screening preferred White-associated names 85% of the time across three state-of-the-art LLMs — with identical qualifications.
  • Brookings/AAAI, 2024: GPT-4o, Claude 3.5, Gemini 1.5 Flash, and Llama 3-70b all systematically disadvantaged Black male applicants while favoring female candidates across 361,000 fictitious resumes.
  • Bolukbasi et al., 2016: Word2Vec encodes "computer programmer → man, homemaker → woman."
  • Caliskan et al., 2017 (Science): GloVe embeddings replicate the full spectrum of human implicit biases via the Word Embedding Association Test.
  • Deshpande et al., 2020: Even after removing names, writing style patterns (sociolinguistics) correlate with ethnicity and gender. You cannot just strip PII and call it fair.

These aren't edge cases. They're the default behavior of embedding spaces trained on internet text.

The Four-Fifths Rule

The legal standard comes from 29 CFR Part 1607 (Uniform Guidelines on Employee Selection Procedures, 1978), jointly adopted by the EEOC, DOL, DOJ, and Civil Service Commission. The rule: if the selection rate for any demographic group is less than 80% (four-fifths) of the highest group's selection rate, there is evidence of adverse impact.

Important: the four-fifths rule is a rule of thumb, not a safe harbor. The EEOC's May 2023 Technical Assistance explicitly states that smaller disparities may constitute adverse impact if statistically significant. Always pair the ratio with a statistical test.

from scipy.stats import chi2_contingency

def adverse_impact_ratio(selection_rates: dict) -> dict:
    """
    selection_rates: {'white': 0.52, 'black': 0.31,
                      'hispanic': 0.44, 'asian': 0.58}
    Returns AIR for each group vs highest-performing group.
    Four-fifths rule: AIR < 0.80 indicates adverse impact.
    """
    max_rate = max(selection_rates.values())
    max_group = [k for k, v in selection_rates.items() if v == max_rate][0]
    results = {}
    for group, rate in selection_rates.items():
        air = rate / max_rate if max_rate > 0 else 0
        results[group] = {
            'selection_rate': rate,
            'air': round(air, 3),
            'adverse_impact': air < 0.80,
            'compared_to': max_group,
        }
    return results

def chi_squared_significance(pass_a, fail_a, pass_b, fail_b):
    """EEOC recommends statistical significance alongside 4/5 rule."""
    table = [[pass_a, fail_a], [pass_b, fail_b]]
    chi2, p_value, _, _ = chi2_contingency(table)
    return {'chi2': round(chi2, 3), 'p_value': round(p_value, 4),
            'significant': p_value < 0.05}

# Example: 200 applicants per group
rates = adverse_impact_ratio({
    'white': 0.52, 'black': 0.31,
    'hispanic': 0.44, 'asian': 0.58
})
print(rates['black'])
# {'selection_rate': 0.31, 'air': 0.534, 'adverse_impact': True,
#  'compared_to': 'asian'}
# AIR 0.534 < 0.80 — adverse impact detected

# Statistical significance check
sig = chi_squared_significance(
    pass_a=116, fail_a=84,   # asian: 58% selected
    pass_b=62,  fail_b=138   # black: 31% selected
)
print(sig)  # {'chi2': 29.45, 'p_value': 0.0, 'significant': True}

Real Enforcement

This isn't theoretical:

  • EEOC v. iTutorGroup (2023): First EEOC AI discrimination settlement — $365K for age-based auto-rejection by an AI screening tool.
  • Mobley v. Workday (N.D. Cal., July 2024): Court ruled AI vendors (not just employers) can be directly liable under Title VII. Landmark precedent — if you build the screening tool, you share liability.
  • NYC Local Law 144: In effect since July 2023. A December 2025 NY State Comptroller audit found 75% of complaints were misrouted and DCWP identified only 1 violation vs. auditors' 17.
  • Illinois HB 3773 (effective January 1, 2026): Explicitly prohibits using zip code as a feature in AI employment decisions — a specific proxy variable now legally banned.

Mitigation Strategies

Stripping names and demographic fields is necessary but insufficient — embedding spaces encode bias through proxies (writing style, university names, zip codes, activity patterns).

Approaches that work:

  • Feature masking: Remove known proxy features before embedding. But new proxies emerge.
  • Post-hoc score debiasing: QB-Norm adjusts similarity scores to reduce the hubness effect — no model retraining required.
  • Calibrated re-ranking: After initial scoring, compute AIR per threshold band. If adverse impact is detected, adjust thresholds per demographic group to equalize selection rates. Document the adjustment.
  • Embedding space auditing: Project embeddings onto demographic dimensions and measure clustering. If resumes from different groups cluster separately despite equal qualifications, the model is encoding bias.

Automated Screening: From Score to Decision

The scoring function produces a number. Turning that into a hiring decision requires thresholds, human oversight, and feedback loops. We detail explainability patterns in our AI candidate scoring guide.

Three-Band Threshold Pattern

THRESHOLDS = {
    'auto_advance': 0.82,  # Top candidates — advance to interview
    'human_review': 0.65,  # Middle band — recruiter reviews
    'auto_reject': 0.65,   # Below threshold — rejected
}

def screen_candidate(score: float) -> str:
    if score >= THRESHOLDS['auto_advance']:
        return 'advance'
    elif score >= THRESHOLDS['human_review']:
        return 'human_review'
    else:
        return 'reject'

# CRITICAL: Run AIR check on each band before deploying
# auto_advance band must pass four-fifths rule
# auto_reject band must pass four-fifths rule
# If either fails, adjust thresholds or add human review

EU AI Act Art. 14 — Human Oversight: Fully automated reject decisions without human review may violate the EU AI Act's human oversight requirement for high-risk AI systems. The safest architecture: auto-advance top candidates, but require human review for all rejections.

The EU AI Act Deadline

Recruitment AI is explicitly classified high-risk under Annex III of the EU AI Act (entered into force August 1, 2024). The core high-risk system requirements become enforceable August 2, 2026 — five months from today.

Free Resource

Free Cloud Architecture Checklist

A 47-point checklist covering security, scalability, cost optimization, and disaster recovery for production cloud environments.

Download the Checklist

Specific requirements:

  • Art. 9: Risk management system covering bias, accuracy, and robustness
  • Art. 10: Data governance — training data must be representative and free from systematic bias
  • Art. 11: Technical documentation of the entire pipeline
  • Art. 12: Automatic logging of all screening decisions for auditability
  • Art. 13: Transparency — candidates must be informed they're being assessed by AI
  • Art. 14: Human oversight — ability to override any AI decision

Penalties: up to EUR 35 million or 7% of global annual turnover, whichever is higher.

If you're building a screening pipeline today, compliance isn't optional and the deadline isn't far away.

Edge Cases That Will Bite You

  1. Multi-language resume bias: BGE-M3 supports 100+ languages, but embedding quality isn't uniform across languages. Resumes from non-native English speakers may cluster differently, creating a proxy for nationality. Test cross-lingual retrieval quality if your candidate pool is multilingual.

  2. Score calibration drift: When you update the embedding model, all thresholds break. A 0.75 on BGE-M3 means something completely different than 0.75 on MiniLM. Build a calibration step that maps cosine scores to percentiles using a held-out dataset, and re-run it on every model change.

  3. The homogeneous top-of-funnel problem: If your historical hiring data reflects a homogeneous workforce (e.g., 90% male engineers), the model learns that male-signaling features predict "success" — not because they're better, but because that's all it's seen. This isn't detectable by the four-fifths rule alone; it requires causal analysis of the training data.

  4. Token truncation is silent and discriminatory: MiniLM and e5 truncate without warning. Senior candidates with longer resumes lose more information, creating systematic bias against experience. Use 8K-token models or verify that your chunking strategy doesn't penalize length.

Conclusion

Embedding-based screening is powerful, fast, and deployable. It's also one of the highest-risk AI applications in production — legally, ethically, and technically.

Use production-grade models (BGE-M3, not MiniLM). Measure adverse impact ratios with statistical tests, not just the four-fifths rule. Maintain human oversight on every rejection. Document everything for the EU AI Act deadline in August 2026.

The technology works. The question is whether you've built the guardrails to use it responsibly.


Related reading:

Building AI-powered recruitment tools? Explore our AI and machine learning solutions and HR recruitment technology services.

#AI Recruitment#Embeddings#Bias Detection#NLP#EU AI Act#HR Tech#Machine Learning#Fairness

Related Service

Cloud Solutions

Let our experts help you build the right technology strategy for your business.

Need help with ai & machine learning?

TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.

We Will Build You a Demo Site — For Free

Like it? Pay us. Do not like it? Walk away, zero complaints. You will spend way less than hiring developers or any agency.

47+ companies trusted us
99.99% uptime
< 48hr response

No spam. No contracts. Just a free demo.