How do Big Five personality assessments work in hiring?
The Big Five is the best science in personality testing, and the strongest trait still ties the unstructured interview at .19. How the scoring works, what the validity numbers mean, and the jobs a trait score does well.
AI summary
- A Big Five assessment turns 50 to 120 self-report statements into five percentile scores against a norm group. No types, no pass marks, and no universally good profile.
- The 2022 Sackett et al. re-analysis put conscientiousness, the strongest trait, at .19 corrected validity. That ties the unstructured interview, explains about 3.6% of performance variance, and sits far below structured interviews at .42.
- Trait scores are valid for populations and nearly mute for one candidate. Use them to design sharper structured questions and smoother onboarding, and decide on role evidence: work samples, structured one-way interviews, and assessments scored against your criteria.
Every page selling Big Five personality assessments in hiring makes the same two claims: the framework is the most scientifically validated in personality psychology, and conscientiousness is the best personality predictor of job performance. Both claims are true. And almost none of those pages print the number that makes them true, because the number does the opposite of selling.
That gap is what this post is about. The Big Five is real science, valid at the population level, and nearly mute at the resolution where hiring happens: one candidate, one position, one decision. “Validated” is a statement about thousands of people. Your decision is about one.
Below: how the assessments actually work, the numbers the vendors leave out, and what a trait score is genuinely good for once you stop asking it to do a job it can’t.
What a Big Five personality assessment actually measures
Strip the branding off any Big Five product (you’ll also see it sold as the OCEAN personality test or the five-factor model) and the machine underneath is the same. You rate somewhere between 50 and 120 short statements, things like “I pay attention to details” or “I feel comfortable around strangers,” on a five-point agreement scale, usually in 10 to 20 minutes.
The commercial standard is the NEO-PI-R, and the public-domain IPIP item pool is what most affordable tools build on. Branded instruments like the Predictive Index measure their own adjacent trait sets, but the items-to-scores machinery is the same. Answers roll up into five continuous dimensions, each with narrower facets underneath (conscientiousness splits into things like orderliness, self-discipline, and deliberation).
The five dimensions in work terms
- Openness. Appetite for new methods, ideas, and ambiguity. High reads as curious, low as consistent.
- Conscientiousness. Organization, follow-through, impulse control. The trait with the strongest research link to job performance, with a catch we’ll get to.
- Extraversion. Where someone gets energy and how much social bandwidth feels comfortable.
- Agreeableness. The default setting between cooperating and challenging.
- Emotional stability. How fast pressure becomes stress, and how visibly. (Researchers call the inverse neuroticism. Vendors flip it because nobody buys a neuroticism report.)
A percentile locates you in a crowd
Here’s the part most explainers skim, and the part everything else turns on. Nobody receives a raw score. Your summed answers get compared against a norm group, a reference population of earlier test-takers, and what comes back is a percentile. Landing at the 73rd on conscientiousness means 73% of that norm group scored below you.
That’s the whole of Big Five test score interpretation: percentiles against a norm group. No pass marks, no types, no universally good profile. Our free work style profile is a 25-statement version that shows you this logic with your own scores.
Hold on to that mechanic. A percentile is not a private fact about a person. It locates them in a population. Every legitimate claim made for these assessments, and every limit you’re about to read, follows from that one design choice.
The validity numbers assessment vendors never print
The case for using trait scores to pick people rests on a validity coefficient: the correlation between scores and later job performance. So it’s fair to ask what that number is.
The corrected estimates from the 2022 re-analysis
In 2022, the Journal of Applied Psychology published the most consequential re-check of selection research in two decades. Sackett, Zhang, Berry, and Lievens went back through the meta-analytic estimates the industry had quoted since the 1990s and corrected a statistical assumption that had inflated them (Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F., 2022, “Revisiting meta-analytic estimates of validity in personnel selection,” Journal of Applied Psychology, 107(11), 2040-2068). Their corrected validities for predicting job performance:
| Selection method | Corrected validity (Sackett et al., 2022) |
|---|---|
| Structured interviews | .42 |
| Work sample tests | .33 |
| Cognitive ability tests | .31 |
| Conscientiousness | .19 |
| Unstructured interviews | .19 |
Conscientiousness, the headline trait, the one carrying the entire “science-backed hiring” pitch, came in at .19. Down from the old .31.
Conscientiousness ties the unstructured interview
Now read the bottom two rows together. The unstructured interview, the improvised, unscored, just-get-them-on-the-phone conversation, also landed at .19. That’s the method every assessment vendor’s homepage exists to mock, and the framework’s best trait just tied it.
It gets less flattering when you convert correlation to explained variance. Square .19 and you get about .036, meaning conscientiousness on its own accounts for roughly 3.6% of the differences in how people go on to perform. The other 96% lives somewhere the trait report isn’t looking.
Go read the pages ranking for this topic right now. “Scientifically validated” appears on every one of them, attached to no number at all. (The same silence holds across our guides to Hogan and the other branded instruments.) The validity coefficient is the spec sheet of an assessment, and the industry selling assessments has decided you’re better off not seeing it.
Valid for populations, mute for the one candidate in front of you
So is .19 worthless? No. And this is where both sides of the personality-test fight talk past each other.
Climate versus Tuesday’s weather
Score ten thousand warehouse hires on conscientiousness and track their performance for a year, and the high scorers will do modestly better on average, reliably, replication after replication. That’s real science, and it has earned the word “valid.” But you aren’t hiring ten thousand people. You’re hiring one.
A population-level correlation is climate. A hiring decision is weather on one specific Tuesday. Knowing your city gets 40 inches of rain a year is genuine knowledge, and it still can’t tell you whether to cancel Thursday’s barbecue. A .19 correlation is that kind of knowledge: true about the crowd, nearly silent about the individual in front of you.
Why both camps are half right
This is why the argument never settles. Recruiters keep telling us personality tests are pseudoscience, usually right after a trait report contradicted everything a work sample had just shown them. They’re wrong about the science and right about the experience: at the scale of one candidate, the score genuinely didn’t tell them much.
Vendors are the mirror image. Right about the science, wrong about the use, marketing a crowd-level signal as if it said something decisive about the person on your shortlist. (The resolution problem follows trait measurement into newer formats too, including games-based instruments like Pymetrics.)
The best case for using it anyway, taken seriously
The strongest version of the vendors’ case deserves a fair hearing. A pre-employment personality test is cheap: 10 to 20 minutes of candidate time and a few dollars a head at volume. It’s standardized, so every candidate answers the same items and gets scored the same way, which is more consistency than most resume reviews can claim. And conscientiousness adds signal a cognitive ability test doesn’t already capture (cognitive ability itself corrected to .31), so in a large stack it isn’t redundant.
Small edges compound, too. Across five thousand hires a year, 3.6% is real money, and at that scale, with proper norming and legal review, the population math can work.
Where the case comes apart at normal volume
Faking. A Big Five test is self-report, and screening is the most incentivized self-report environment there is. Candidates under stakes tend to answer as the person the posting described, and you can’t reliably tell the polished self-report from the accurate one. That holds for a refined instrument like Caliper as much as for a free quiz. Evidence you watch happen, a work sample or a structured answer to a real scenario, is far harder to fake than a self-description.
Range restriction. Validity estimates come from samples with wide trait spreads, and your applicant pool isn’t one. The people applying for your SDR position already skew social and driven, and all of them are presenting their best selves, which compresses the differences the test needs in order to separate anyone.
Legal exposure. In the US, personality testing is generally permitted when it’s job-relevant and doesn’t function as a medical exam, and some jurisdictions and public-sector contexts restrict it further. None of this is legal advice. The practical read: a trait score can be one signal among several, and the moment it becomes an automatic gate, you’re carrying risk a .19 signal can’t pay for.
Candidate goodwill. Attention is a budget. Candidates will give you twenty focused minutes when the task visibly belongs to the role. What burns them is the generic hoop, and twenty minutes on an off-the-shelf trait inventory is twenty minutes not spent on the structured interview at .42 or the work sample at .33, which carry roughly twice the signal.
As a filter, then, a Big Five assessment brings the weakest signal in the stack. Whatever value it has lives somewhere else.
What Big Five output is actually good for in a hiring process
Three places, in our experience.
Trait language turns vibes into named behaviors
Sit in enough debriefs and you notice the worst ones run on adjectives. “I just didn’t click with him.” “She seemed intense.” The pattern across hiring teams we work with is simple: descriptors invite bias, behaviors invite evidence.
Trait vocabulary is the bridge. “Low agreeableness” is still a descriptor, but it points at a behavior you can actually check: how did she handle the pushback scenario, and what did she say, exactly?
Turn a low score into a structured question
A flag in a trait report makes a better question than a verdict. A finalist at the 20th percentile on emotional stability hasn’t earned a rejection. They’ve earned a specific question about the last time a week fell apart on them, asked of every finalist in the same words and scored against the same criteria. (If your question sets are still improvised, fix that first: our free interview question generator builds a role-specific set with a scorecard in about a minute.)
Scores also age well after the offer, because the incentive to fake disappears on day one. A manager who knows a new hire runs low on openness introduces change with more runway.
One signal inside a process built on role evidence
Run the resolution argument through a real funnel and the design follows: the trait score enters as one signal you read, surrounded by evidence that’s actually about the work. That’s the shape we built Truffle around.
Truffle is a candidate screening platform that combines talent assessments with resume screening and one-way video interviews. You set the criteria for the position. Candidates answer your screening questions on camera, and you can add a Personality assessment built on validated Big Five research, a Situational Judgment Test scored against how your team prefers to handle real scenarios, or an Environment Fit assessment that surfaces whether a candidate’s preferences match the reality of the role.
AI transcribes each response, summarizes it, and scores it against the criteria you set. Candidate Shorts surface the thirty seconds of an interview worth watching first. The trait profile sits in the same view as the resume and the recorded answer to your working-under-pressure question, so no score gets read alone.
AI surfaces the evidence. You make the call.
Used this way, the 3.6% finally gets a seat sized to its contribution. The other 96% of your confidence comes from signals with the work in them.
Use traits to ask better questions, not to answer them
The instrument was honest the whole time. Five distributions, a norm group, a percentile: every piece of that design announces it describes populations. The overreach arrives in the marketing, when a population description gets dressed up as an individual verdict. You now have the number the vendor pages skip, and the question to bring to every assessment pitch from here on: validated against what, at what strength, for which decision?
Score climates. Decide on evidence.
That split is about to matter more, because AI has made polished self-description nearly free. Resumes arrive pre-polished and self-reports arrive rehearsed, which erodes every signal built on how candidates describe themselves and leaves standing the ones built on what you watch them do. Screening processes that hold up are already being rebuilt around that second kind.
A trait percentile will never tell you whether to hire the person in front of you. Used at the right resolution, it makes every question you ask them sharper.
Frequently asked questions about Big Five assessments in hiring
Can the Big Five predict job performance?
At the population level, modestly. The strongest trait, conscientiousness, carries a corrected validity of .19 in the 2022 Sackett et al. re-analysis, which works out to about 3.6% of performance variance. That’s a real signal across thousands of hires and a faint one for any single decision, so it should never carry a hire or reject on its own.
Can candidates fake a Big Five personality test?
Yes. The format is self-report, and candidates under stakes tend to answer as the person the posting describes, a pattern researchers call social desirability bias. Evidence you observe directly, like work samples and structured interview answers, is much harder to fake than a self-description.
Are personality tests legal in hiring?
In the US, generally yes, as long as the test is job-relevant and doesn’t function as a medical or clinical exam. Some jurisdictions and public-sector contexts restrict them further, and automatic pass/fail use raises the risk. None of this is legal advice. If an assessment carries real weight in your process, have counsel review it.
Is the Big Five better than MBTI for hiring?
As measurement, yes. The Big Five reports continuous traits that stay reasonably stable over time, while MBTI sorts people into 16 types and test-retest studies regularly hand people a different type. But beating MBTI is a low bar. Cleaner measurement doesn’t fix the population-versus-individual problem, so neither belongs in the decision seat.
Can you fail a Big Five assessment?
No, there’s no pass mark and no universally good profile. Scores are percentiles against a norm group, and the same profile can read strong for one position and weak for another. An employer using trait percentiles as a pass/fail gate is misusing the instrument.
Are free Big Five tests accurate enough for hiring?
Free short-form tests are fine for self-awareness, and a good way to learn the format. Our free work style profile takes a few minutes and shows you the percentile logic with your own scores. Selection-grade use needs a properly normed instrument, and even then the population-versus-individual limit stands.