The wrong AI tool wrote your resume — and an AI just rejected it
The first empirical study of AI resume screening bias finds a 42 percentage-point hire-rate gap between the same candidate with identical qualifications, depending solely on which AI tool wrote the resume. A "maybe" verdict in an automated pipeline is, in effect, a rejection.
42 pp | Largest hire-rate gap, same candidate |
1,576 | Valid data points (98.5% of 1,600) |
100 | Candidate profiles across 12 industries |
29 pts | Largest single-evaluator score gap |
Why "maybe" counts as rejection
In real applicant tracking systems, a "maybe" verdict effectively ends a candidate's journey, the resume never reaches a human recruiter. This study therefore applies a binary standard: only an explicit "hire" recommendation counts as a positive outcome. "Maybe" and "reject" are both treated as non-hire. This is not a conservative choice. It reflects reality.
Background and methodology
AI-based resume screening is already standard practice at most companies with 500 or more employees. At the same time, candidates increasingly use AI tools, ChatGPT, Claude, Gemini, to write and polish their applications. Until now, no one had systematically tested whether the choice of writing tool affects hiring outcomes.
i10X Research created 100 synthetic but realistic candidate profiles across 12 industries and four career levels. Each profile was matched to a tailored job posting. For every persona, four resume versions were generated by GPT-5.4, Claude Sonnet 4.6, Gemini 3 Pro and xAI Grok 4.3 — identical facts, different language and structure. Each of the 400 resumes was evaluated blind by all four models using an identical prompt and standardized scoring guide. Of 1,600 possible data points, 1,576 are valid (98.5%). Study conducted May 2026.
Five key findings
Finding 1: Claude is the strictest evaluator and shows the largest self-bias
For GPT-written resumes, Claude recommends "hire" in only 42% of cases. For its own Claude-style resumes, that figure climbs to 84%. Gemini-style resumes reach 90% under Claude. Same candidate, identical qualifications, a 42-point gap driven solely by which tool wrote the resume.
Finding 2: GPT is not a rubber stamp — and penalizes its own writing
GPT hires in 90.5% of cases and correctly rejects mismatched profiles. But it rates GPT-written resumes worst in its own row (82%), while Gemini resumes reach 97% and Claude resumes 95%, a 15-point negative self-bias.
Finding 3: Gemini-written resumes are the universally preferred format
Gemini resumes earn the highest hire rates across every evaluator: GPT 97%, xAI 96%, Gemini 95%, Claude 90% — an average of 94.5%. The structured, narrative format Gemini produces is rewarded by all screening models regardless of content.
Finding 4: A 29-point gap between two evaluators on the exact same document
GPT scores one candidate's Claude-written resume at 74 (maybe). Claude scores the identical document at 45 (reject). A 29-point swing — from borderline to clear rejection, caused only by which model is doing the evaluating.
Finding 5: The "maybe" trap disproportionately affects qualified candidates
Claude rates a qualified backend engineer's GPT-written resume at 78 (maybe). The same candidate's Claude, Gemini and xAI resumes all receive clear "hire" from other evaluators. In any pipeline where Claude is the sole screener, this candidate is filtered out, not because of qualifications, but because of formatting.
Score matrix I — average fit scores (0–100)
Rows = evaluating model. Columns = model that wrote the resume. Bold = highest per row. Italic = lowest per row.
Evaluator | GPT writes | Claude writes | Gemini writes | Grok writes |
GPT-5.4 | 88.0 | 93.7 | 94.2 | 89.8 |
Claude 4.6 | 79.0 | 88.2 | 89.7 | 91.2 |
Gemini 3 Pro | 89.2 | 94.2 | 94.7 | 90.6 |
Grok 4.3 | 85.0 | 87.1 | 91.0 | 87.0 |
Score matrix II — hire rates ("maybe" = rejection)
Binary standard: hire = 1, maybe or reject = 0. Reflects real-world ATS pipeline behavior.
Evaluator | GPT | Claude | Gemini | Grok | n |
GPT-5.4 | 82% | 95% | 97% | 88% | 398 |
Claude 4.6 | 42% | 84% | 90% | 89% | 397 |
Gemini 3 Pro | 81% | 96% | 95% | 85% | 397 |
Grok 4.3 | 79% | 88%* | 96% | 86% | 314 |
Three real cases
Case A — Backend engineer, 4.5 years experience (Persona 1)
Same candidate, identical qualifications, dramatically different outcomes by resume style.
Resume style | GPT | Claude | xAI | Gemini |
Claude-written | 95 — hire | 78 — maybe | 92 — hire | 95 — hire |
GPT-written | 75 — maybe | 76 — maybe | 75 — maybe | 72 — maybe |
Case B — Junior candidate, senior role (Persona 22)
All models correctly detect the qualification gap — disproving the idea that any model is an undiscriminating screener.
Resume style | GPT | Claude | xAI | Gemini |
GPT-written | 20 — reject | 5 — reject | 15 — reject | 15 — reject |
Case C — Identical profile, 29-point evaluator gap (Persona 60)
GPT scores this Claude-written resume at 74 (maybe). Claude scores the same document at 45 (reject). The largest single discrepancy in the complete dataset.
Resume style | GPT | Claude | Gemini |
Claude-written | 74 — maybe | 45 — reject | 72 — maybe |
What this means for HR and talent acquisition
No LLM screening without a bias audit. Before deploying any AI screening tool, test whether it systematically favors certain writing styles using synthetic resumes with identical qualifications.
Never use a single AI model as the sole screener. The 29-point gap between GPT and Claude on an identical profile is not an outlier, it is a structural model difference. Multi-model panels with averaged scores are the minimum viable standard.
Transparency toward applicants is required. Companies using AI in screening should disclose which systems are used, in line with EU AI Act Article 13 for high-risk AI in employment contexts.
Flag statistical outliers. Claude's 42-point self-bias and GPT's negative 15-point self-bias are audit triggers. Any model with a hire-rate spread above 10 percentage points between writing styles should be reviewed before production use.
What this means for candidates and career centers
"Use ChatGPT for your CV" is no longer sufficient advice. Each AI writing tool leaves a stylistic fingerprint that other models detect and score differently. The Gemini format currently scores highest across all evaluators (average 94.5%) and is a useful benchmark.
Career coaching must become model-agnostic. Students and job seekers should work with style-diverse revisions and compare multiple AI writing styles before submitting applications.
Statement
"We did not test whether AI evaluates fairly. We tested whether it evaluates consistently. The answer is no. The same person, the same qualifications, the same role, and a hire-rate difference of 42 percentage points depending on which tool was used to write the resume. That is not a technical detail. That is a question of fairness."
— i10X Research Team
Test your own resume
The free side-by-side comparison tool that underpins this study is publicly available. See how GPT, Claude, Gemini and Grok evaluate your resume , simultaneously, with full scores.
Related Posts

Ask YouTube: Gemini AI Makes Videos Interactive and Searchable
YouTube's Ask YouTube uses Gemini Omni to deliver timestamped answers from video content. See how this multimodal AI transforms viewing habits and creates new challenges for creators and monetization. Discover the details.

Gemini 3.5 Flash: Powering Efficient Autonomous Agents
Google's Gemini 3.5 Flash optimizes for speed and tool use in autonomous agents, reducing TCO for enterprise RAG and workflows. Discover how it shifts AI from chatbots to production-grade automation.

Gemini 3.5 Pro, Flash & Nano: Google's Tiered AI Models
Google's Gemini 3.5 lineup offers Pro, Flash, and Nano models for enterprise workflows to on-device tasks. Explore multimodal gains, cost-efficient inference, and hybrid AI strategies for developers and CTOs. Learn more.