What assessments predict job performance?
Darren Bush asked whether candidates are gaming your assessments. The deeper question is what predicts job performance at all, and the 2022 re-analysis that quietly rewrote the answer.
AI summary
- The methods that best predict job performance are the ones candidates can't game. Sackett et al. (2022) put structured interviews at .42, work samples at .33, and cognitive ability at .31, all observed or scored. Off-the-shelf personality tests sit at .19.
- That .19 is the corrected number for conscientiousness, down from the .31 the industry quoted for 24 years. It ties the unstructured interview and explains about 3.6% of performance, so a trait score should never gate a hire.
- 'Are candidates gaming your assessments?' and 'what predicts job performance?' have one answer: move weight off what candidates say about themselves and onto evidence you watch them produce.
Darren Bush just published a piece asking whether candidates are gaming your assessments. His answer, roughly: probably, and the industry has it coming. He tells a story about gaming an investment-banking aptitude test in university, drilling past papers with classmates until the score climbed.
“Nothing about me had changed,” he writes. “My verbal reasoning hadn’t doubled. My familiarity with the format had.”
He’s right, and the worry is fair. But gaming is the symptom. The question underneath it is the one most hiring teams never ask: what predicts job performance in the first place, and is the assessment you’re worried about gaming even on the list?
For most off-the-shelf assessments, the honest answer is that they predict performance weakly. The methods that predict it best happen to be the hardest to game. So the two questions, what predicts performance and what resists gaming, have most of the same answer.
Start with the number Bush reached for. It does not say what it used to.
The table the industry quoted for 24 years
Bush anchors his case on Schmidt and Hunter. Good instinct. It’s the most cited paper in the field (Schmidt, F. L., & Hunter, J. E., 1998, “The validity and utility of selection methods in personnel psychology,” Psychological Bulletin, 124(2), 262-274). For a generation it was the table everyone quoted: general mental ability around .51, a work sample stacked on top pushing a combined .63, the unstructured interview trailing at .38.
The .63 is the number that built an industry. It’s on vendor decks and in HR certification courses. It’s the implicit promise behind every “science-backed” assessment you’ve ever been pitched.
Then in 2022 the field corrected its own homework.
What changed in 2022
Sackett, Zhang, Berry, and Lievens went back through those estimates and found a statistical adjustment the industry had been applying that often didn’t belong (Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F., 2022, “Revisiting meta-analytic estimates of validity in personnel selection,” Journal of Applied Psychology, 107(11), 2040-2068). The short version: the old numbers corrected for range restriction in a way that assumed you were hiring from the full population.
You aren’t. The correction inflated validities across the board, cognitive ability most of all. When they took it back out, most methods came down, and the ranking reshuffled.
Here’s the before and after for the common methods:
| Selection method | 1998 estimate (Schmidt & Hunter) | 2022 corrected (Sackett et al.) |
|---|---|---|
| Structured interviews | .51 | .42 |
| Work sample tests | .54 | .33 |
| Cognitive ability tests | .51 | .31 |
| Conscientiousness | .31 | .19 |
| Unstructured interviews | .38 | .19 |
Read the top and the bottom. The structured interview, the same questions for every candidate scored against fixed criteria, now sits at the top at .42. Cognitive ability, the old front-runner, fell to .31. And conscientiousness, the personality trait carrying the entire “science-backed assessment” pitch, landed at .19, level with the improvised phone call these tools were built to replace.
It gets starker when you turn correlation into explained variance. Square .19 and you get about .036. A standalone personality test accounts for roughly 3.6% of the difference in how people go on to perform. The other 96% comes from things a self-report never sees. (Integrity tests, another method the 1998 table ranked near the top, were revised down sharply too.)
The order at the bottom didn’t move. Self-report still trails, and the correction widened the gap between it and the methods built on watching people work.
The assessments you can game are the ones that barely predict
Bush’s gaming worry sits right on top of that table.
Two kinds of test get gamed. Self-report inventories, where a candidate under stakes answers as the person the posting described instead of the person they are. And format-coachable tests, the kind you can drill until the score moves but the ability behind it doesn’t.
That’s his investment-banking story: the drilling moved the score and left the ability where it was.
Those are the methods clustered at the bottom of the validity table.
The methods at the top are the ones you watch happen. A work sample, where the candidate produces the actual thing. A structured answer to a real scenario, scored the same way for everyone. You can rehearse a questionnaire about yourself. It’s much harder to fake a task you have to perform while the output gets recorded.
So “are candidates gaming your assessments?” and “what predicts job performance?” collapse into one answer. Move weight off what candidates say about themselves and onto what you watch them do. Validity and resistance to gaming come from the same thing: work you observe instead of claims you take on faith.
This is why Bush can sit on the fence and still be right. He gestures at the validity question without naming it. The 2022 numbers make it unavoidable. The gap between observed-work methods and self-report methods is wider than the 1998 table ever showed.
What predicts job performance, and what to fix first
If you’re rebuilding around what the evidence supports, four moves, roughly in order of payoff:
Structure and score the interview you’re already running. It’s the cheapest upgrade available. The unstructured version sits at .19. The structured version at .42, more than double, for the same half hour of everyone’s time. The only difference is fixed questions and a scorecard you fill in the same way for every candidate. If your question sets are still improvised, that’s the first thing to fix. (Our free interview question generator builds a role-specific set with a scorecard in about a minute.)
Add a work sample. It corrected to .33, and it’s close to un-gameable because the candidate has to produce the work. This is the Steve Yegge point Bush surfaces in his piece: post a real piece of work, let the candidate do it, look at what they actually produced. A scoped version of the job beats a proxy for the job almost every time.
Keep personality and judgment tests, but demote them. One signal, never a gate. A .19 score has no business auto-rejecting anyone. Where it earns its place is upstream of the decision, not in it: a low trait flag is a good prompt for a sharper interview question, not a verdict. That’s the full argument in our post on how Big Five assessments work in hiring, and it holds for branded instruments like Hogan and the Predictive Index too.
Audit like Bush says. Demand validity evidence specific to your role and market, and check for adverse impact. His sharpest line, by way of Jamie Betts: the same teams that run a 40-page review of an AI vendor will run a decade-old, unvalidated psychometric test on every candidate without a second look. Fix that asymmetry.
None of this is free, and pretending otherwise is how assessments got oversold in the first place. Structured scoring takes setup. Work samples take design and a slice of candidate time. At one or two hires a quarter, the loose old way survives. At a hundred applicants a position, it doesn’t, and that’s where an evidence-first process pays for itself.
Build on evidence you can watch
In a real funnel, that ordering is the design. The signals with the work in them carry the decision. The self-report sits beside them as context, read but never trusted alone. It’s the process we built Truffle around.
Truffle is a candidate screening platform that combines one-way video interviews, resume screening, and talent assessments. You set the criteria for the position. Every candidate answers the same questions on camera, which is the structured-interview method at the top of that table, run at a volume a one- or two-person team can handle.
AI transcribes each response, summarizes it, and scores it against the criteria you set, surfacing match scores and Candidate Shorts, the thirty seconds of an interview worth watching first. You can add a Personality assessment built on validated Big Five research, or a Situational Judgment Test scored against how your team prefers to handle real scenarios. Each one sits in the same view as the recorded answer and the resume, as one signal among several, never the gate.
AI surfaces the evidence. You make the call.
The validity belongs to the method, not the software. Truffle’s job is to make that high-signal method, structured and scored interviewing against criteria you set, cheap enough to run on everyone, not just the final three.
The gap is about to get wider
The 1998-to-2022 correction is less a statistical footnote than a preview. Every time the field rechecks its work, the methods built on watching people do the job hold up, and the methods built on self-description slip.
AI is about to widen that gap. A personality questionnaire was always self-report. Now the resume arrives pre-polished, the cover letter is generated, and the “tell me about a time” answer can be rehearsed with a chatbot the night before. Every signal built on self-description is getting easier to fake by the month.
The ones built on observed work are not. Screening processes that hold up are already being rebuilt around that line.
Bush asked whether candidates are gaming your assessments. The more useful version is the one the validity table answers: are your assessments measuring anything a candidate can’t just perform for you on demand? Score what you can watch. Decide on that.
Frequently asked questions about assessments and job performance
What assessment best predicts job performance?
Structured interviews, in the most recent meta-analysis. Sackett et al. (2022) put them at .42 corrected validity, ahead of work sample tests (.33) and cognitive ability tests (.31). A structured interview just means every candidate answers the same questions, scored against fixed criteria, instead of an improvised conversation that drifts with whoever’s in the room.
Do personality tests predict job performance?
Weakly. Conscientiousness, the strongest single Big Five trait, corrected to .19 in the 2022 re-analysis, which works out to about 3.6% of the variance in performance. That’s a real signal across thousands of hires and close to silent for any one decision, so a personality score shouldn’t carry a hire or reject on its own.
Is cognitive ability still the best predictor of job performance?
Not since 2022. The older Schmidt and Hunter (1998) estimates put general mental ability around .51 and treated it as the foundation of good selection. Sackett et al. corrected it to .31 after removing a range-restriction adjustment that had inflated it, which moved structured interviews to the top of the table.
What’s the hardest assessment for candidates to game?
A work sample, or any task where the candidate produces the actual work while you watch. Self-report tests like personality inventories and coachable aptitude tests are the easiest to game, and they also carry the lowest validity. The methods that predict best are also the ones you observe directly.
How many assessments should I use to predict performance?
Think in signals, not a single test. The research favors combining a structured, scored interview with a job-relevant work sample, then treating personality or judgment tests as one input rather than a pass or fail gate. The lift comes from stacking methods that measure different things, not from asking one instrument to carry the whole decision.