Personal Reflections from Hearing the Oral Arguments in the Ricci v. DeStefano Supreme Court Case
By Dan A. Biddle, Ph.D.
CEO, Biddle Consulting Group, Inc.
On April 22, 2009 the U.S. Supreme Court (“USSC”) heard the oral arguments offered by each side in the Ricci v. Destefano testing case. In this case, 18 candidates (17 Whites and 1 Hispanic), who successfully passed two exams for promotion to Lieutenant and Captain positions, are suing the City of New Haven, Connecticut for refusing to certify the exams and make the promotions because the tests had adverse impact (whites scored higher than African-Americans) and were not justifiably valid. The plaintiffs argued that their rights under Title VII and the 14th Amendment Equal Protection Clause were violated.
One of the major questions being evaluated in this case is: Should the City be allowed to pull the exam results because they believed the tests themselves were flawed, and also knew of other lower-adverse impact alternatives?
From a technical standpoint, most personnel tests can be grouped into three categories: (1) solid and defensible, (2) defensible but with some limitations, and (3) fatally flawed. A “solid and defensible” test is one that addresses (at a minimum) the requirements of the federal Uniform Guidelines. The test can be shown to be job-related in the sense that it measures critical knowledge, skills, and abilities that are required for job success. The cutoff score methodology should be empirically driven and not be arbitrary in nature. A sample of job experts should be involved in the process of ensuring that the test is valid for the target positions.
Tests that are “defensible but with some limitations” are supported with only some degree of validity evidence. However, they also have one or more flaws that in the end call into question the validity of the scores. To the extent that the scores may not be valid, the decisions made from these scores will not be valid. This places employers in precarious legal situations where they could lose challenges to their tests.
“Fatally flawed” tests are the least likely to survive a legal challenge. Fatal flaws can typically be divided into three categories: design omissions, design flaws, and inappropriate use of the results. In this case, the test had flaws in all three categories. Regarding design omissions, the City’s combined process (a written test and an interview) left out some of the most important skills needed for the fire supervisory positions (such as “Command Presence”). Regarding design flaws, there was a lack of input from local job experts regarding the job relatedness of the exam. In fact, the written test included some content that was irrelevant to New Haven Fire Department, such as test questions borrowed from a New York Fire Department that asked the test-takers whether fire equipment should be parked “uptown, downtown or underground when arriving at a fire.” The city of New Haven has no “uptown” or “downtown.” Regarding the use of the test results, an arbitrary 70% cutoff score was used, the tests were combined using an arbitrary 60%-40% weighting scheme, and there was inadequate justification to use the results in a ranked fashion.
In the world of high-stakes testing that is governed by Title VII, an employer that uses a test with such “fatal flaws” would have little hope for mustering a successful defense, and a plaintiff team with even a mediocre attorney and testifying expert could easily take the victory. Indeed, legal databases are filled with examples where fire departments have lost validity cases based on flaws relevant to any one of these categories.
The flaws that were present in the validation of these tests were in fact serious, but each can be easily remedied. The arbitrary 70% cutoff score can be replaced by a cutoff score developed using a job-related process where the opinions of 7-10 subject-matter experts assign minimum passing scores to the various parts of the exam. This process—called the modified Angoff process—has been supported in dozens of litigation settings, and even by the USSC in US v. South Carolina (434 US, 1026, 1978). The arbitrary 60% - 40% weighting scheme can easily be replaced by weights derived from a survey given to the same subject-matter expert panel. Critical job skills like “Command Presence” can be included either in the Interview process or folded into an Assessment Center (as even some of the USSC Justices suggested).
The process of properly validating a test assures that the most qualified applicants will rise to the top of the list. This can only occur when the test properly represents the key success factors needed for the job (for instance, in this situation, “Command Presence”); measures these factors in a balanced and accurately-weighted fashion (which was only arbitrary in this case); and when its results accurately reflect the skill levels actually needed for the position (a validated cutoff score rather than an arbitrary one like 70%). Tests that don’t possess these characteristics can only provide hit-and-miss accuracy when it comes to sifting applicant qualification levels.
Discrimination is obviously a very serious issue. No City board member wants to be blamed for discriminating against any group—whites or African Americans. The trouble in this case—and one that is not often discussed in the news and on Internet blogs—is that adverse impact without validity justification is also discrimination. In this case, the City repealed the list because they didn’t believe the test would hold up to a validation challenge, and thus would have been shown as discriminatory against African Americans. The high road in this situation is to instate the “do-over” that Chief Justice Roberts mentioned—without a motivation to reshuffle the deck—but rather to properly develop and use the tests and then let the applicants score wherever they may. Faced with possible discrimination on either side—a “do-over” may be the “only best option” given the alternatives. But after a new exam process is given (one that addresses the fatal flaw issues), the results need to stand “as is.”
We have yet to see how the USSC will decide on these complex and politically-loaded issues. As I see it, however, there may be no clear way out other than supporting the City’s choice to throw out the faulty exam results. While it’s safe to say that most Americans want the hardest-working and most qualified applicants to take their new promotional positions, there would be nation-wide ramifications if the USSC ruled that the test results must stand “as is” and, in doing so, indirectly endorsed an exam process that had serious flaws.
Does anyone know if an amicus brief was filed that outlines what type of pre-administration validation assessments were conducted? From my quick review of the Majority's opinion, they appeared to be convinced that ample validity evidence was present. If there was indeed adequate validity I think the implications for the I-O field, is 1) hopefully practitioners will be more interested in using well developed tests and decision aids; 2) I-O practitioners need to make the evaluation of various combinations of test components common practice and develop 'best-practice' guidelines for weighing AI and validity results for these various combinations; and 3) we need to put more focus on the recruitment process in order to increase the number of 'quality' applicants from various backgrounds in the pool.
Posted by: Taylor P. | July 02, 2009 at 11:20 AM
Now that the Supreme Court has ruled against the city's decision to throw out the test, what insight do you think this brings to I/O psychologists that develop tests in similar circumstances?
As you pointed out, the test contained fatal flaws from all three categories. This ruling seems to suggest that a test does not have to be flawless, and possibly that the availability of alternatives with lower adverse impact aren't necessarily required. Do you agree or disagree?
Posted by: Matt Smith | June 29, 2009 at 03:25 PM
Well reasoned analysis. I can only hope that the USSC's decision is similarly sophisticated. I'm concerned that instead we will get a very narrow decision that won't be enough to shed light on what really to do in this situation--other than develop good tests!
Posted by: BryanB | June 24, 2009 at 03:12 PM