Building an AI Screening Pipeline With Embeddings
Build an AI resume screening pipeline with BGE-M3 embeddings, four-fifths rule bias detection, and EU AI Act compliance. Production code and legal citations.
# Building an AI Screening Pipeline With Embeddings
Automated resume screening is one of the most impactful — and most dangerous — applications of embeddings in production. Get it right and you process thousands of candidates in minutes. Get it wrong and you deploy a system that systematically discriminates, with fines up to EUR 35 million starting August 2026.
This article walks through the technical architecture of an embedding-based screening pipeline: model selection, scoring, bias detection, and automated decision-making. Every code example uses current libraries and models. Every legal citation is specific. We built this for SkilletySkilletyhttps://www.techsaas.cloud/blog/how-we-built-ai-recruitment-matching-skillety-embeddings-bias/, and we're sharing what we learned.
The Embedding Pipeline: From Resume to Vector
The core idea is straightforward: encode job descriptions and resumes into the same vector space, then measure similarity. The devil is in the model choice.
Model Selection: Stop Using MiniLM for Production
The most common tutorial recommendation — all-MiniLM-L6-v2 — has a critical flaw for resume screening: a 256-token context window. A typical resume is 400-800 tokens. MiniLM silently truncates everything past token 256. No warning, no error. The second half of every resume simply disappears.
This isn't a minor issue. Candidates with extensive experience (more text) lose more information to truncation — creating a systematic bias against senior candidates.
Here's what the model landscape actually looks like in March 2026:
|---|---|---|---|---|---|---|
BGE-M3 is our production recommendation. 8192-token context means you can embed entire resumes without chunking. MIT license. Supports 100+ languages. Published by BAAI with active maintenance.
The 8K context window changes the architecture fundamentally. With MiniLM, you'd need to chunk resumes into sections, embed each chunk, then aggregate scores — introducing complexity and losing cross-section context. With BGE-M3, you embed the whole document:
from sentence_transformers import SentenceTransformer
import numpy as np
# sentence-transformers v5.3.0 (March 2026)
model = SentenceTransformer('BAAI/bge-m3')
# BGE models require instruction prefixes for optimal performance
# Omitting these drops retrieval accuracy by 5-15%
jd_embedding = model.encode(
"Represent this job description for retrieval: " + jd_text,
normalize_embeddings=True
)
resume_embedding = model.encode(
"Represent this document for retrieval: " + resume_text,
normalize_embeddings=True
)Instruction prefix trap: BGE and e5 models use specific prefixes ("Represent this..." or "query: "/"passage: ") for asymmetric retrieval. Forgetting these drops accuracy by 5-15% on retrieval benchmarks. Always check the model card.
When You Must Chunk
If you're constrained to shorter-context models (budget, latency requirements), use structure-aware section-level chunking rather than naive recursive splitting. Vecta's 2026 benchmark showed 87% accuracy for section-level chunking vs. 69% for recursive 512-token splitting. Parse resumes into sections (experience, education, skills) and embed each section separately.
Scoring: Beyond Raw Cosine Similarity
The simplest approach — cosine similarity between JD and resume vectors — works as a baseline but has known failure modes in high-dimensional spaces. For embedding fundamentals, see our embedding models explainer: from Word2Vec to text-embedding-3embedding models explainer: from Word2Vec to text-embedding-3https://www.techsaas.cloud/blog/embedding-models-explained-word2vec-to-text-embedding-3/.
The hubness problem (Radovanovic et al., 2010): some vectors become "nearest neighbor" to a disproportionate number of others. In resume screening, this means generic/boilerplate resumes score high against everything — they're discriminative of nothing.
Magnitude information loss (You 2025, arXiv:2504.16318): cosine similarity discards vector magnitude, which carries semantic signals like "confidence" and "specificity." A highly specific resume and a vague one can have identical cosine scores.
We use weighted skill-cluster scoring to decompose the match:
from sklearn.metrics.pairwise import cosine_similarity
def weighted_skill_score(model, jd_text, resume_text, skill_clusters):
"""
Score a resume against a JD using weighted skill clusters.
skill_clusters: [{'description': 'Python backend development',
'weight': 0.3}, ...]
"""
resume_emb = model.encode(
"Represent this document for retrieval: " + resume_text,
normalize_embeddings=True
).reshape(1, -1)
cluster_scores = []
for cluster in skill_clusters:
cluster_emb = model.encode(
"Represent this query for retrieval: " + cluster['description'],
normalize_embeddings=True
).reshape(1, -1)
sim = cosine_similarity(cluster_emb, resume_emb)[0][0]
cluster_scores.append(sim * cluster['weight'])
return sum(cluster_scores)Critical: embedding similarity scores are not probabilities. A cosine similarity of 0.75 does not mean "75% match." Scores from different models have completely different distributions. If you change models, all your thresholds need recalibration against a held-out dataset.
Bias Detection and the Four-Fifths Rule
This is where most AI recruitment articles stop. They mention "bias" abstractly and move on. We're going to be specific, because the legal consequences are specific.
The Evidence
These aren't edge cases. They're the default behavior of embedding spaces trained on internet text.
The Four-Fifths Rule
The legal standard comes from 29 CFR Part 1607 (Uniform Guidelines on Employee Selection Procedures, 1978), jointly adopted by the EEOC, DOL, DOJ, and Civil Service Commission. The rule: if the selection rate for any demographic group is less than 80% (four-fifths) of the highest group's selection rate, there is evidence of adverse impact.
Important: the four-fifths rule is a rule of thumb, not a safe harbor. The EEOC's May 2023 Technical Assistance explicitly states that smaller disparities may constitute adverse impact if statistically significant. Always pair the ratio with a statistical test.
from scipy.stats import chi2_contingency
def adverse_impact_ratio(selection_rates: dict) -> dict:
"""
selection_rates: {'white': 0.52, 'black': 0.31,
'hispanic': 0.44, 'asian': 0.58}
Returns AIR for each group vs highest-performing group.
Four-fifths rule: AIR < 0.80 indicates adverse impact.
"""
max_rate = max(selection_rates.values())
max_group = [k for k, v in selection_rates.items() if v == max_rate][0]
results = {}
for group, rate in selection_rates.items():
air = rate / max_rate if max_rate > 0 else 0
results[group] = {
'selection_rate': rate,
'air': round(air, 3),
'adverse_impact': air < 0.80,
'compared_to': max_group,
}
return results
def chi_squared_significance(pass_a, fail_a, pass_b, fail_b):
"""EEOC recommends statistical significance alongside 4/5 rule."""
table = [[pass_a, fail_a], [pass_b, fail_b]]
chi2, p_value, _, _ = chi2_contingency(table)
return {'chi2': round(chi2, 3), 'p_value': round(p_value, 4),
'significant': p_value < 0.05}
# Example: 200 applicants per group
rates = adverse_impact_ratio({
'white': 0.52, 'black': 0.31,
'hispanic': 0.44, 'asian': 0.58
})
print(rates['black'])
# {'selection_rate': 0.31, 'air': 0.534, 'adverse_impact': True,
# 'compared_to': 'asian'}
# AIR 0.534 < 0.80 — adverse impact detected
# Statistical significance check
sig = chi_squared_significance(
pass_a=116, fail_a=84, # asian: 58% selected
pass_b=62, fail_b=138 # black: 31% selected
)
print(sig) # {'chi2': 29.45, 'p_value': 0.0, 'significant': True}Real Enforcement
This isn't theoretical:
Mitigation Strategies
Stripping names and demographic fields is necessary but insufficient — embedding spaces encode bias through proxies (writing style, university names, zip codes, activity patterns).
Approaches that work:
Automated Screening: From Score to Decision
The scoring function produces a number. Turning that into a hiring decision requires thresholds, human oversight, and feedback loops. We detail explainability patterns in our AI candidate scoring guideAI candidate scoring guidehttps://www.techsaas.cloud/blog/ai-candidate-scoring-fair-explainable-hiring/.
Three-Band Threshold Pattern
THRESHOLDS = {
'auto_advance': 0.82, # Top candidates — advance to interview
'human_review': 0.65, # Middle band — recruiter reviews
'auto_reject': 0.65, # Below threshold — rejected
}
def screen_candidate(score: float) -> str:
if score >= THRESHOLDS['auto_advance']:
return 'advance'
elif score >= THRESHOLDS['human_review']:
return 'human_review'
else:
return 'reject'
# CRITICAL: Run AIR check on each band before deploying
# auto_advance band must pass four-fifths rule
# auto_reject band must pass four-fifths rule
# If either fails, adjust thresholds or add human reviewEU AI Act Art. 14 — Human Oversight: Fully automated reject decisions without human review may violate the EU AI Act's human oversight requirement for high-risk AI systems. The safest architecture: auto-advance top candidates, but require human review for all rejections.
The EU AI Act Deadline
Recruitment AI is explicitly classified high-risk under Annex III of the EU AI Act (entered into force August 1, 2024). The core high-risk system requirements become enforceable August 2, 2026 — five months from today.
Specific requirements:
Penalties: up to EUR 35 million or 7% of global annual turnover, whichever is higher.
If you're building a screening pipeline today, compliance isn't optional and the deadline isn't far away.
Edge Cases That Will Bite You
1. Multi-language resume bias: BGE-M3 supports 100+ languages, but embedding quality isn't uniform across languages. Resumes from non-native English speakers may cluster differently, creating a proxy for nationality. Test cross-lingual retrieval quality if your candidate pool is multilingual.
2. Score calibration drift: When you update the embedding model, all thresholds break. A 0.75 on BGE-M3 means something completely different than 0.75 on MiniLM. Build a calibration step that maps cosine scores to percentiles using a held-out dataset, and re-run it on every model change.
3. The homogeneous top-of-funnel problem: If your historical hiring data reflects a homogeneous workforce (e.g., 90% male engineers), the model learns that male-signaling features predict "success" — not because they're better, but because that's all it's seen. This isn't detectable by the four-fifths rule alone; it requires causal analysis of the training data.
4. Token truncation is silent and discriminatory: MiniLM and e5 truncate without warning. Senior candidates with longer resumes lose more information, creating systematic bias against experience. Use 8K-token models or verify that your chunking strategy doesn't penalize length.
Conclusion
Embedding-based screening is powerful, fast, and deployable. It's also one of the highest-risk AI applications in production — legally, ethically, and technically.
Use production-grade models (BGE-M3, not MiniLM). Measure adverse impact ratios with statistical tests, not just the four-fifths rule. Maintain human oversight on every rejection. Document everything for the EU AI Act deadline in August 2026.
The technology works. The question is whether you've built the guardrails to use it responsibly.
---
Related reading:
Building AI-powered recruitment tools? Explore our AI and machine learning solutionsAI and machine learning solutionshttps://www.techsaas.cloud/services/ai-machine-learning-solutions/ and HR recruitment technology servicesHR recruitment technology serviceshttps://www.techsaas.cloud/services/hr-recruitment-technology/.
Need help with ai & machine learning?
TechSaaS provides expert consulting and managed services for cloud infrastructure, DevOps, and AI/ML operations.