Posted on

March 15, 2020

The Science Behind Making Software Engineering Interviews Truly Predictive of Job Performance

I’ve spent much of the past year researching a single topic—how companies can design an interview process for software engineers that’s truly predictive of their ability to deliver on the job. What I’ve discovered I’m still struggling to comprehend—there’s an enormous amount of research and empirical evidence out there that details exactly how you can build a more predictive hiring process for software engineers. But the tech industry and hiring managers at large have all but completely ignored this guidance, relying instead on antiquated hiring processes that have actually proven to be anything but effective.

Tikhon Jelvis, Principal AI Scientist at Target, seems similarly flabbergasted by our industry’s inexplicable ignorance. “The fact that more companies aren’t moving in this direction is an indictment of an industry that spends a remarkable amount of time and money on recruiting and selecting candidates,” Jelvis writes.

This paper will build a case for revamping your hiring process around a methodology that’s been proven to deliver better software engineers with the skills required to be successful on the job.

The business case for building a predictive hiring process

In the coming pages I’ll present concrete, research based evidence for how your company can build a more predictive hiring process for software engineers. But first, here’s why you should listen up.

“Employers must make hiring decisions; they have no choice about that,” write Frank Schmidt and John Hunter in their study The Validity and Utility of Selection Methods In Personnel Psychology. “But they can choose which methods to use in making those decisions. The research evidence summarized in this article shows that different methods and combinations of methods have very different validities for predicting future job performance.”

Schmidt and Hunter’s findings are so comprehensive and compelling that amongst all of the sources cited in the article, we’ll circle back to their findings most often.

“In economic terms, the gains from increasing the validity of hiring methods can amount over time to literally millions of dollars,” writes Schmidt regarding the importance of valid assessment methods. “However, this can be viewed from the opposite point of view: By using selection methods with low validity, an organization can lose millions of dollars in reduced production. In a competitive world, these organizations are unnecessarily creating a competitive disadvantage for themselves. By adopting more valid hiring procedures, they could turn this competitive disadvantage into a competitive advantage.”

While the economic and business benefits of making better hires seem obvious, it’s altogether shocking that so few companies in the tech industry have taken a deliberate approach to designing an interviewing process that’s truly valid and predictive.

A predictive hiring process must be deliberately designed

While the topic of assessment validity may feel academic and a long way from the day-to-day work of writing code and hiring software engineers, Erik Bernhardsson, CTO of Better.com, frames the challenge in a more human light.

“I have done roughly 2,000 interviews in my life. When I started recruiting, I had so much confidence in my ability to assess people,” writes Bernhardsson. “Let me just throw a couple of algorithm questions at a candidate and then I’ll tell you if they are good or not! Over time I’ve come to the (slightly disappointing) realization that knowing who’s going to be good at their job is an extremely hard problem.”

“The correlation between who did really well in the interview process and who performs really well at work is really weak.”

“Confronted by this observation I’ve started thinking about this process as an inherent noise reduction problem. Interviewing should be thought of as information gathering. You should consciously design the process to be the most predictive of future job performance.”

The problem, then, becomes identifying a clear methodology for assessing the future job performance of a candidate.

“A critical precursor to measuring future job performance is defining what successful performance looks like,” says Dr. Ferguson. “It may be different things to different people, depending on who the decision makers are and how they relate to the candidate (e.g., fellow engineer, manager, team leader, etc.). Unless there is agreement on what successful performance is, meaningful attempts to measure it will be impossible.”

Is job tenure a good measure? It’s problematic at best—we’ve all seen companies that hang on to underperforming software developers for far too long. Or consider great developers who get fired for consistently showing up to work late—the candidates clearly possessed the technical skills to excel in the role, but their tenure was cut short by circumstances beyond their ability to do the job well.

In psychometrics, the true output performance we wish to measure is called the criterion. In the case of recruiting, whatever the company uses to measure the success of a hire is the criterion. For example, using peer performance reviews to measure developer productivity might be how a company wishes to measure true performance. Pre-hire you can’t measure the criterion directly, but you can utilize hiring practices that identify underlying traits and abilities which, when measured, should result in meaningful predictions. For example, coding work samples can be utilized to test if a developer can write code, an activity critical to how productive a developer will be at developing new functionality.

Schmidt and Hunter’s study uses the criterion of supervisor performance ratings when assessing the validity of various hiring activities. The duo also effectively showcases the opportunity presented by building a hiring process with greater predictive validity by tying employee performance to the overall output of the employee and the dollar value of that output.

“The variability of employee job performance can be measured in a number of ways, but two scales have typically been used: dollar value of output and output as a percentage of mean output. The standard deviation across individuals of the dollar value of output has been found to be at minimum 40% of the mean salary of the job. The 40% figure is a lower bound value; actual values are typically considerably higher. Thus, if the average salary for a job is $40,000, then the standard deviation is at least $16,000. If performance has a normal distribution, then workers at the 84th percentile produce $16,000 more per year than average workers (those at the 50th percentile). And the difference between workers at the 16th percentile (below average’ workers) and those at the 84th percentile (superior workers) is twice that: $32,000 per year. Such differences are large enough to be important to the economic health of an organization.”

Before we look at which hiring activities are truly predictive, let’s look at some of the common pitfalls that the vast majority of the tech industry has succumbed to with their current hiring strategies.

Unstructured Interviews, Resumes, and Reference Checks Simply Don’t Work

Chances are your process for hiring software engineers follows a familiar pattern—you collect resumes, conduct a series of interviews, and check references before making a job offer. There’s no easy way to say this, but point blank this is exactly what doesn’t work! These activities have shown to be almost completely useless when it comes to predicting the future success of a hire. Let’s look at them one by one.

Unstructured interviews are plagued by our own biases

Interviews have long sat at the epicenter of the hiring process, and it makes sense that you’d want to meet and chat with a job candidate prior to hiring them. But while that’s the case, interviews have consistently been shown—from research studies to write-ups in The New York Times—to be extremely ineffective in predicting the future performance of a new hire.

Jason Dana writes in his article The Utter Uselessness of Job Interviews that the reliance on interviews is a widespread problem.

“Employers like to use free-form, unstructured interviews in an attempt to “get to know” a job candidate,” writes Dana. “But…interviewers typically form strong but unwarranted impressions about interviewees, often revealing more about themselves than the candidates.”

Worse yet, hiring managers show a strong tendency to believe that interviews are the most predictive of all hiring activities, while IQ tests are amongst the least. But the truth is actually the opposite.

Scott Highhouse of Bowling Green University found in his article Stubborn Reliance on Intuition and Subjectivity in Employee Selection that hiring managers are far too reliant on their own intuition throughout the interview process despite overwhelming evidence that there are far more predictive means of assessing candidates.

“People have an inherent resistance to analytical approaches to selection because they fail to view selection as probabilistic and subject to error,” writes Highhouse. “Another is the implicit belief that prediction of human behavior is improved through experience. This myth of expertise results in an overreliance on intuition and a reluctance to undermine one’s own credibility by using a selection decision aid.”

Said another way, hiring managers are reluctant to use more effective assessment tools because they feel like it undermines their own self worth and importance throughout the hiring process.

It’s worth noting that interviews do become slightly more predictive of future performance if they’re highly structured. Schmidt and Hunter define the difference between structured and unstructured interviews in this way:

Unstructured interviews

Unstructured interviews have no fixed format or set of questions to be answered. In fact, the same interviewer often asks different applicants different questions. Nor is there a fixed procedure for scoring responses; in fact, responses to individual questions are usually not scored, and only an overall evaluation (or rating) is given to each applicant based on summary impressions and judgments.

Sound familiar? Structured interviews are exactly the opposite.

Structured interviews

Questions to be asked are usually determined by a careful analysis of the job in question. Every candidate is then asked the same questions and their answers are scored objectively using the same rubric.

As a result, structured interviews are more costly to construct and use but are also more valid. In contrast, the much more commonly employed, somewhat free-formed nature of an unstructured interview quickly becomes highly prone to confirmation bias.

“We subconsciously form an opinion about things, and let that influence our decision making. This is dangerous!” writes Benhardsson. “You start to like a particular candidate a lot for whatever superficial reason, you drop your guard, and start giving them more hints or give them the benefit of the doubt in a way that some other candidate wouldn’t get.”

A highly structured interview is the main remedy, as the structure of the interview helps the hiring manager maintain a higher degree of objectivity. “This has been shown to make interviews more reliable and modestly more predictive of job success,” writes Dana. “Alternatively, you can use interviews to test job-related skills, rather than idly chatting or asking personal questions.”

Resumes hold little value beyond introducing a candidate's experience

Resumes usually represent the gateway to the hiring process, the very first piece of collateral that job candidates provide to hiring managers as a means of assessing their fit for a given role. The fact that a non-technical person is often first to assess a software engineer’s resume is often cited by rejected developer candidates as a significant problem, but the reasons for this practice are easy to explain.

“Why has pedigree become such a big deal in an industry that’s supposed to be a meritocracy?” writes Aline Lerner in her article Lessons From A Year’s Worth Of Hiring Data. “At the heart of the matter is scarcity of resources. When a company gets to be a certain size, hiring managers don’t have the bandwidth to look over every resume and treat every applicant like a unique and beautiful snowflake. As a result, the people doing initial resume filtering are not engineers.”

“Engineers are expensive and have better things to do than read resumes all day,” continues Lerner. “Enter recruiters or HR people. As soon as you get someone who’s never been an engineer making hiring decisions, you need to set up proxies for aptitude. Because these proxies need to be easily detectable, things like a CS (computer science) degree from a top school become paramount.”

“Bemoaning that non-technical people are the first to filter resumes is silly because it’s not going to change. What can change, however, is how they do the filtering.”

The problem is the information contained on a resume—years of job experience, degrees earned, and titles held in previous roles all perform poorly when it comes to predicting future job performance. As Lerner says we need to change our filtering processes, but doing so needs to include capturing information outside of what’s typically reported on a resume.

Reference checks are anything but objective

Reference checks have low validity, which makes perfect sense—you’re essentially asking the job candidate to present you with someone that has a favorable opinion of them. While it’s certainly a good indication that the candidate has such a relationship, reference checks are anything but objective and have little predictive power.

It’s worth noting that while we’re rallying against unstructured interviews, resumes, and reference checks—very likely the core elements of your existing hiring process—these activities aren’t necessarily without any purpose. A resume is a way for a candidate to introduce their experience, just as an interview gives you an opportunity to get to know a job candidate—there’s some value in both.

But the takeaway here is straightforward—the overarching objective of your hiring process should be to predict as accurately as possible which candidates will be successful once hired. Unstructured interviews, resumes, and reference checks are simply of little use in this regard.

The Secret Sauce: Aptitude Tests, Work Samples, and Highly Structured Interviews

Now that we’ve detailed the common hiring practices that hold little predictive power, it’s time to share what actually works. Tests of general mental ability (like IQ or other aptitude tests), work samples, and highly structured interviews have all been proven to have truly predictive power when it comes to making great hires.

The table below details the predictive validity of 19 different hiring activities that Schmidt and Hunter assessed candidates against.

The Predictive Validity of Hiring Activities

As you can see, work samples tests are the single most predictive activity throughout the hiring process, followed closely by general mental ability (GMA) tests and highly structured interviews. It’s also important to note that the table above looks at the validity of each of these personnel measures in isolation—most hiring processes will employ two or more of these activities, and each contributes to increasing the predictive validity of your hiring process.

In isolation all of the above measures are far from perfect, but generally using more measures—and specifically the more predictive measures (assuming they aren’t highly correlated to one another)—will improve the predictive validity of your hiring practices. Let’s take a look at the most predictive measures one by one.

General Mental Ability (GMA) tests are indicative of ability to learn job skills

While standardized aptitude tests like the IQ test often get a bad rap, the truth of the matter is the usefulness of these tests has been the subject of more studies than just about any other aspect of the hiring process. The evidence is overwhelming—general intelligence shows a very strong correlation with job performance and career success.

GMA tests have not only been shown to be highly predictive of job performance, but as Schmidt finds, “GMA is also an excellent predictor of job-related learning.” One of the major findings around the usefulness of GMA tests is that people that score highly on these tests have higher job performance simply because they’re able to learn the job related skills they need to be successful more rapidly—it’s this competency in knowledge acquisition that drives their job performance. People with higher intelligence simply learn faster, which in turn leads to increased job performance.

While GMA tests are highly predictive, they’re most likely not a great fit for companies in the United States when it comes to hiring software engineers.

“In the United States, GMA tests are a poor option for legal reasons: they open you up for discrimination lawsuits under the theory of “disparate impact,” writes Tikhon Jelvis. “Companies can protect themselves by performing an (expensive) study to validate the impact of IQ tests for their specific positions, but this is not worth the expense for most companies. IQ tests are also fraught culturally: I suspect many engineering candidates would be turned off by needing to go through an IQ test as part of hiring.”

Work sample tests are directly relevant to real world work

All of which brings us to work samples tests, the single most predictive measure of a job candidate’s future on-the-job performance. While software engineers may immediately think of algorithmic puzzles when they hear “work sample test,” that’s by no means a fair description of what we’re talking about so let’s start by defining the term.

“Work sample tests are hands-on simulations of part or all of the job that must be performed by applicants,” write Schmidt and Hunter.

For example, a restaurant owner that’s hiring a chef might ask the chef to cook them one of their menu items or someone hiring an architect might ask them to design a small building structure to be analyzed for correctness.

Frederick Smith offers a second definition in his paper Work Samples As Measures of Performance writing, “Work samples measure job skills by requiring an individual to demonstrate competency in a situation parallel to that at work, under realistic and standardized conditions. Their primary purpose is to evaluate what one can do rather than what one knows.”

The latter is a key point—while whiteboard coding interviews are useful in learning how a developer thinks through a problem (indicative of what one can do), they’re not at all useful when it comes assessing their ability to write code (indicative of what code they know in the moment). As a result, whiteboard interviews aren’t a good means of assessing coding ability because they just aren’t terribly relevant to real world programming.

“We do this by designing a comprehensive programming task ahead of time that faithfully represents the actual work somebody will be doing,” writes Tikon Jelvis. “The name gives it away: we evaluate candidates by looking at a representative sample of their actual work instead of trying to proxy this with undergraduate-style exam questions on a whiteboard.”

“The point of a work sample is, after all, to reduce the inferential leap that must be made between performance in a standardized testing situation and actual job performance,” continues Jelvis. “There is less of a leap needed between behavior in a work sample and behavior in the actual job situation than between performance or problem solving on a paper-and-pencil test and actual job behavior. Because of their close tie to actual work behaviors, work samples also allow an interaction of abilities and skills to occur, an interaction that is often artificially eliminated by rating forms with generalized dimensions of work behavior.”

It doesn’t take a PhD or a room full of smart software engineers to make this connection—it seems very intuitive that a work sample truly reflective of real world work would be a great predictor of how an engineer would actually do once hired. So the question then becomes:

“Why aren’t more companies focused on capturing relevant work samples from software engineers during the hiring process?”

I think there’s two primary reasons for this phenomenon. First, algorithmic puzzles have become unpopular with many developers. Erik Bernhardsson captures this challenge perfectly.

“I’ve read about 1,000 Hacker News comments complaining that interview questions about turning a binary tree upside down (or whatever) are stupid because no one would ever do that, or there’s already a library for it, or something else. I think that’s completely besides the point! The real question is: does solving a problem about turning a binary tree upside down predict future job performance?”

Weary of creating a poor impression with developer candidates, companies often then overcorrect and inject a lengthy, homegrown coding exercise into their hiring process. Oftentimes this activity is designed to capture a relevant work sample, but administering and scoring this style of exercise quickly becomes a hugely time consuming chore.

A great example of a company that’s effectively walked this tightrope is Slack, which designed a work sample test that assessed developers by giving them a code review activity—something they’d have to do on a nearly daily basis once hired. The result? Slack cut their time-to-hire for software engineers from over 200 days down to 83.

Structured interviews increase predictive validity and objectivity

Including highly structured interviews in your recruiting process is also likely a good way to increase your ability to predict future job performance. The structured nature of the interview helps interviewers retain a higher degree of objectivity, while also presenting an opportunity to make interviews more useful.

In a structured interview—and in fact this is true of all aspects of your hiring process—you have a limited amount of time during which to assess the candidate. Rather than spending a full hour working through a single problem with the candidate or talking freely, instead design a consistent process and a series of questions that probes into as many different areas of software development as are relevant to the role.

Bernhardsson calls this the signal to noise ratio—how much can you learn about a candidate relatively quickly?

“While I don’t like long problems that rely on knowing a certain trick, I think it’s great to have many short interview questions that rely on knowing particular things,” writes Bernhardsson. “If you can go through 20 such problems in one single interview, you increase the signal-to-noise ratio a lot!”

In the context of a software engineering interview, this might include asking questions about:

Specific segments of code
How they have handled past architectural decisions
What sort of recent technologies have excited them and why
Areas they want to improve as a developer
How they define “good quality” code
The key differences between objective-oriented programming and functional programming paradigms
What the MVC paradigm is and an example of some frameworks that use it

Use your time with candidates wisely—it’s worth the upfront effort to design a structured interview process that efficiently surfaces insights on developers across as many relevant dimensions as possible.

A final tip on making your interviews more predictive—interviews by committee (where more than one person is interviewing the applicant at a time) have been shown to make interviews more valid and predictive of future job performance. This remains true even if one interviewer is steering the questioning while another takes notes—the mere presence of multiple interviewers improves both objectivity and the validity of the interview.

Of course the costs associated with the interview in terms of resources also increases in tandem, so this will need to be weighed as you design your interviewing approach—but generally speaking having two or more interviewers present will increase the predictive ability of your interviews.

Hiring Cheat Codes

Designing a process for interviewing software engineers that includes a highly structured interview and focuses on capturing a relevant work sample will give you a huge leg up on the competition, who likely aren’t using such statistically predictive measures. So far we’ve presented a mass of evidence showing that this is the recipe for building a hiring process that’s truly predictive of future job performance.

While that’s the case, there are two other “cheat codes” worth mentioning that can be effectively folded into your hiring process.

Harder interviews result in more satisfied hires

Dr. Andrew Chamberlain conducted a study in conjunction with Glassdoor that looks at the relationship between interview difficulty and employee job satisfaction. His findings below present evidence that harder interviews result in more satisfied workers.

Glassdoor Hard Interviews, Happy Workers

This is not a direct measure of future job performance, but it’s at least worthy of consideration as you design your deliberately predictive interview process. The relationship between employee satisfaction and job performance is well documented—satisfied employees put in a greater level of effort, most often leading to a higher level of organizational performance.

Actively seek out signals the market undervalues when hiring

It’s also worth considering—especially in a hiring environment as competitive as the current landscape for software engineers—which signals or criteria you’re using to evaluate candidates and how much the hiring market at large values those same signals.

Bernhardsson covers this topic well in his article How To Hire Better Than The Market when he writes, “What I’m saying is that if you’re hiring, then you will be more successful going after candidates that the market undervalues. It turns out your preference versus the market’s preference matters more than your preference in itself.”

For example, consider the stock market—if everybody considers a stock to be particularly hot and buys shares of the stock, the opportunity that that stock presents erodes in value. Likewise in a hiring context, if the majority of the hiring market is focused on only hiring from Ivy League schools, the relative cost of those hires goes up and your likelihood of actually landing graduates with an Ivy League education goes down.

As a result, hiring managers should actively look for opportunities to hire based on attributes that the market at large undervalues—this is how you can hire better than the market.

Conclusion

Why the tech industry has largely ignored the plethora of evidence on this topic is up for debate, but it’s clear that highly structured interviews and an emphasis on capturing truly relevant coding samples is not the norm. Tech leaders are often quick to cite the importance of “team,” but their relative effort in building a truly predictive hiring process almost always pales in comparison to the effort they put forth in say, optimizing their customer acquisition process.

“You can see a rough progression in tech interviewing from brain teasers and other nonsense towards work-sample tests, but it hasn’t come nearly far enough,” says Jelvis.

We’ll let our Advisory Board Member, Frank Schimdt, have the final word.

“Use of hiring methods with increased predictive validity leads to substantial increases in employee performance as measured in percentage increases in output, increased monetary value of output, and increased learning of job-related skills.”

That’s enough to get my attention—is it enough to get yours?

Summary of Key Points

The most commonly employed hiring practices—unstructured interviews, resumes, and reference checks—are all but useless when it comes to predicting the future performance of a hire.
General Mental Ability (GMA) tests (like the IQ test) are very predictive of future job performance, largely because people with high intelligence can learn the skills needed to be successful on the job more rapidly. However, due to legal concerns they’re not recommended for companies hiring in the US.
Work samples tests are the single most valid hiring activity that’s truly predictive of on the job performance.
Interviews become more predictive of future performance when they’re highly structured, allowing the interviewer to maintain a higher degree of objectivity.
Any single personnel measure or hiring activity used in isolation is far from perfect. Companies must deliberately design a hiring process that utilizes multiple measures—doing so will increase the predictive validity of the process.
Oftentimes deciding how you will measure the future performance of a hire is more difficult and time consuming than deciding which personnel measures and hiring activities you’ll employ.
Harder, more rigorous interviews result in more satisfied employees who are more motivated, leading to higher organizational performance.
You can beat the market when hiring by looking for personnel measures and signals that the market at large undervalues.

Ready to build a more predictive hiring process for software engineers? ‍

Schedule a 1-on-1 review of your hiring process or request a free trial of Qualified here.

‍