The wrong AI tool wrote your resume — and an AI just…

Research · June 2026

The first empirical study of AI resume screening bias finds a 42 percentage-point hire-rate gap between the same candidate with identical qualifications, depending solely on which AI tool wrote the resume. A "maybe" verdict in an automated pipeline is, in effect, a rejection.


42 pp	Largest hire-rate gap, same candidate
1,576	Valid data points (98.5% of 1,600)
100	Candidate profiles across 12 industries
29 pts	Largest single-evaluator score gap

Why "maybe" counts as rejection

In real applicant tracking systems, a "maybe" verdict effectively ends a candidate's journey, the resume never reaches a human recruiter. This study therefore applies a binary standard: only an explicit "hire" recommendation counts as a positive outcome. "Maybe" and "reject" are both treated as non-hire. This is not a conservative choice. It reflects reality.

Background and methodology

AI-based resume screening is already standard practice at most companies with 500 or more employees. At the same time, candidates increasingly use AI tools, ChatGPT, Claude, Gemini, to write and polish their applications. Until now, no one had systematically tested whether the choice of writing tool affects hiring outcomes.

i10X Research created 100 synthetic but realistic candidate profiles across 12 industries and four career levels. Each profile was matched to a tailored job posting. For every persona, four resume versions were generated by GPT-5.4, Claude Sonnet 4.6, Gemini 3 Pro and xAI Grok 4.3 — identical facts, different language and structure. Each of the 400 resumes was evaluated blind by all four models using an identical prompt and standardized scoring guide. Of 1,600 possible data points, 1,576 are valid (98.5%). Study conducted May 2026.

Five key findings

Finding 1: Claude is the strictest evaluator and shows the largest self-bias

For GPT-written resumes, Claude recommends "hire" in only 42% of cases. For its own Claude-style resumes, that figure climbs to 84%. Gemini-style resumes reach 90% under Claude. Same candidate, identical qualifications, a 42-point gap driven solely by which tool wrote the resume.

Finding 2: GPT is not a rubber stamp — and penalizes its own writing

GPT hires in 90.5% of cases and correctly rejects mismatched profiles. But it rates GPT-written resumes worst in its own row (82%), while Gemini resumes reach 97% and Claude resumes 95%, a 15-point negative self-bias.

Finding 3: Gemini-written resumes are the universally preferred format

Gemini resumes earn the highest hire rates across every evaluator: GPT 97%, xAI 96%, Gemini 95%, Claude 90% — an average of 94.5%. The structured, narrative format Gemini produces is rewarded by all screening models regardless of content.

Finding 4: A 29-point gap between two evaluators on the exact same document

GPT scores one candidate's Claude-written resume at 74 (maybe). Claude scores the identical document at 45 (reject). A 29-point swing — from borderline to clear rejection, caused only by which model is doing the evaluating.

Finding 5: The "maybe" trap disproportionately affects qualified candidates

Claude rates a qualified backend engineer's GPT-written resume at 78 (maybe). The same candidate's Claude, Gemini and xAI resumes all receive clear "hire" from other evaluators. In any pipeline where Claude is the sole screener, this candidate is filtered out, not because of qualifications, but because of formatting.

Score matrix I — average fit scores (0–100)

Rows = evaluating model. Columns = model that wrote the resume. Bold = highest per row. Italic = lowest per row.

Evaluator	GPT writes	Claude writes	Gemini writes	Grok writes
GPT-5.4	88.0	93.7	94.2	89.8
Claude 4.6	79.0	88.2	89.7	91.2
Gemini 3 Pro	89.2	94.2	94.7	90.6
Grok 4.3	85.0	87.1	91.0	87.0

Score matrix II — hire rates ("maybe" = rejection)

Binary standard: hire = 1, maybe or reject = 0. Reflects real-world ATS pipeline behavior.

Evaluator	GPT	Claude	Gemini	Grok	n
GPT-5.4	82%	95%	97%	88%	398
Claude 4.6	42%	84%	90%	89%	397
Gemini 3 Pro	81%	96%	95%	85%	397
Grok 4.3	79%	88%*	96%	86%	314

Three real cases

Case A — Backend engineer, 4.5 years experience (Persona 1)

Same candidate, identical qualifications, dramatically different outcomes by resume style.

Resume style	GPT	Claude	xAI	Gemini
Claude-written	95 — hire	78 — maybe	92 — hire	95 — hire
GPT-written	75 — maybe	76 — maybe	75 — maybe	72 — maybe

Case B — Junior candidate, senior role (Persona 22)

All models correctly detect the qualification gap — disproving the idea that any model is an undiscriminating screener.

Resume style	GPT	Claude	xAI	Gemini
GPT-written	20 — reject	5 — reject	15 — reject	15 — reject

Case C — Identical profile, 29-point evaluator gap (Persona 60)

GPT scores this Claude-written resume at 74 (maybe). Claude scores the same document at 45 (reject). The largest single discrepancy in the complete dataset.

Resume style	GPT	Claude	Gemini
Claude-written	74 — maybe	45 — reject	72 — maybe

What this means for HR and talent acquisition

No LLM screening without a bias audit. Before deploying any AI screening tool, test whether it systematically favors certain writing styles using synthetic resumes with identical qualifications.

Never use a single AI model as the sole screener. The 29-point gap between GPT and Claude on an identical profile is not an outlier, it is a structural model difference. Multi-model panels with averaged scores are the minimum viable standard.

Transparency toward applicants is required. Companies using AI in screening should disclose which systems are used, in line with EU AI Act Article 13 for high-risk AI in employment contexts.

Flag statistical outliers. Claude's 42-point self-bias and GPT's negative 15-point self-bias are audit triggers. Any model with a hire-rate spread above 10 percentage points between writing styles should be reviewed before production use.

What this means for candidates and career centers

"Use ChatGPT for your CV" is no longer sufficient advice. Each AI writing tool leaves a stylistic fingerprint that other models detect and score differently. The Gemini format currently scores highest across all evaluators (average 94.5%) and is a useful benchmark.

Career coaching must become model-agnostic. Students and job seekers should work with style-diverse revisions and compare multiple AI writing styles before submitting applications.

Research statement

"We did not test whether AI evaluates fairly. We tested whether it evaluates consistently. The answer is no. The same person, the same qualifications, the same role, and a hire-rate difference of 42 percentage points depending on which tool was used to write the resume. That is not a technical detail. That is a question of fairness."

— i10X Research Team

Test your own resume

See how GPT, Claude, Gemini and Grok score the same CV — side by side.

Try free comparison →

The wrong AI tool wrote your resume — and an AI just rejected it