The wrong AI tool wrote your resume — and an AI just rejected it

AI Bias CV

The first empirical study of AI resume screening bias finds a 42 percentage-point hire-rate gap between the same candidate with identical qualifications, depending solely on which AI tool wrote the resume. A "maybe" verdict in an automated pipeline is, in effect, a rejection.

42 pp

Largest hire-rate gap, same candidate

1,576

Valid data points (98.5% of 1,600)

100

Candidate profiles across 12 industries

29 pts

Largest single-evaluator score gap


Why "maybe" counts as rejection

In real applicant tracking systems, a "maybe" verdict effectively ends a candidate's journey, the resume never reaches a human recruiter. This study therefore applies a binary standard: only an explicit "hire" recommendation counts as a positive outcome. "Maybe" and "reject" are both treated as non-hire. This is not a conservative choice. It reflects reality.


Background and methodology

AI-based resume screening is already standard practice at most companies with 500 or more employees. At the same time, candidates increasingly use AI tools, ChatGPT, Claude, Gemini, to write and polish their applications. Until now, no one had systematically tested whether the choice of writing tool affects hiring outcomes.

i10X Research created 100 synthetic but realistic candidate profiles across 12 industries and four career levels. Each profile was matched to a tailored job posting. For every persona, four resume versions were generated by GPT-5.4, Claude Sonnet 4.6, Gemini 3 Pro and xAI Grok 4.3 — identical facts, different language and structure. Each of the 400 resumes was evaluated blind by all four models using an identical prompt and standardized scoring guide. Of 1,600 possible data points, 1,576 are valid (98.5%). Study conducted May 2026.


Five key findings

Finding 1: Claude is the strictest evaluator and shows the largest self-bias

For GPT-written resumes, Claude recommends "hire" in only 42% of cases. For its own Claude-style resumes, that figure climbs to 84%. Gemini-style resumes reach 90% under Claude. Same candidate, identical qualifications, a 42-point gap driven solely by which tool wrote the resume.

Finding 2: GPT is not a rubber stamp — and penalizes its own writing

GPT hires in 90.5% of cases and correctly rejects mismatched profiles. But it rates GPT-written resumes worst in its own row (82%), while Gemini resumes reach 97% and Claude resumes 95%, a 15-point negative self-bias.

Finding 3: Gemini-written resumes are the universally preferred format

Gemini resumes earn the highest hire rates across every evaluator: GPT 97%, xAI 96%, Gemini 95%, Claude 90% — an average of 94.5%. The structured, narrative format Gemini produces is rewarded by all screening models regardless of content.

Finding 4: A 29-point gap between two evaluators on the exact same document

GPT scores one candidate's Claude-written resume at 74 (maybe). Claude scores the identical document at 45 (reject). A 29-point swing — from borderline to clear rejection, caused only by which model is doing the evaluating.

Finding 5: The "maybe" trap disproportionately affects qualified candidates

Claude rates a qualified backend engineer's GPT-written resume at 78 (maybe). The same candidate's Claude, Gemini and xAI resumes all receive clear "hire" from other evaluators. In any pipeline where Claude is the sole screener, this candidate is filtered out, not because of qualifications, but because of formatting.


Score matrix I — average fit scores (0–100)

Rows = evaluating model. Columns = model that wrote the resume. Bold = highest per row. Italic = lowest per row.

Evaluator

GPT writes

Claude writes

Gemini writes

Grok writes

GPT-5.4

88.0

93.7

94.2

89.8

Claude 4.6

79.0

88.2

89.7

91.2

Gemini 3 Pro

89.2

94.2

94.7

90.6

Grok 4.3

85.0

87.1

91.0

87.0


Score matrix II — hire rates ("maybe" = rejection)

Binary standard: hire = 1, maybe or reject = 0. Reflects real-world ATS pipeline behavior.

Evaluator

GPT

Claude

Gemini

Grok

n

GPT-5.4

82%

95%

97%

88%

398

Claude 4.6

42%

84%

90%

89%

397

Gemini 3 Pro

81%

96%

95%

85%

397

Grok 4.3

79%

88%*

96%

86%

314


Three real cases

Case A — Backend engineer, 4.5 years experience (Persona 1)

Same candidate, identical qualifications, dramatically different outcomes by resume style.

Resume style

GPT

Claude

xAI

Gemini

Claude-written

95 — hire

78 — maybe

92 — hire

95 — hire

GPT-written

75 — maybe

76 — maybe

75 — maybe

72 — maybe

Case B — Junior candidate, senior role (Persona 22)

All models correctly detect the qualification gap — disproving the idea that any model is an undiscriminating screener.

Resume style

GPT

Claude

xAI

Gemini

GPT-written

20 — reject

5 — reject

15 — reject

15 — reject

Case C — Identical profile, 29-point evaluator gap (Persona 60)

GPT scores this Claude-written resume at 74 (maybe). Claude scores the same document at 45 (reject). The largest single discrepancy in the complete dataset.

Resume style

GPT

Claude

Gemini

Claude-written

74 — maybe

45 — reject

72 — maybe


What this means for HR and talent acquisition

No LLM screening without a bias audit. Before deploying any AI screening tool, test whether it systematically favors certain writing styles using synthetic resumes with identical qualifications.

Never use a single AI model as the sole screener. The 29-point gap between GPT and Claude on an identical profile is not an outlier, it is a structural model difference. Multi-model panels with averaged scores are the minimum viable standard.

Transparency toward applicants is required. Companies using AI in screening should disclose which systems are used, in line with EU AI Act Article 13 for high-risk AI in employment contexts.

Flag statistical outliers. Claude's 42-point self-bias and GPT's negative 15-point self-bias are audit triggers. Any model with a hire-rate spread above 10 percentage points between writing styles should be reviewed before production use.


What this means for candidates and career centers

"Use ChatGPT for your CV" is no longer sufficient advice. Each AI writing tool leaves a stylistic fingerprint that other models detect and score differently. The Gemini format currently scores highest across all evaluators (average 94.5%) and is a useful benchmark.

Career coaching must become model-agnostic. Students and job seekers should work with style-diverse revisions and compare multiple AI writing styles before submitting applications.


Statement

"We did not test whether AI evaluates fairly. We tested whether it evaluates consistently. The answer is no. The same person, the same qualifications, the same role, and a hire-rate difference of 42 percentage points depending on which tool was used to write the resume. That is not a technical detail. That is a question of fairness."

— i10X Research Team


Test your own resume

The free side-by-side comparison tool that underpins this study is publicly available. See how GPT, Claude, Gemini and Grok evaluate your resume , simultaneously, with full scores.

https://i10x.ai/side-by-side

Related Posts