Test Fairness in the New Generation of Large-Scale Assessment
reviewed by Hongli Li & Jacquelyn Bialo
Test Fairness in the New Generation of Large Scale Assessment Author(s):
Hong Jiao & Robert W. Lissitz (Eds.)Publisher:
Information Age Publishing, CharlotteISBN:
2017Search for book at Amazon.com
The concept of fairness has been a central issue in the field of educational measurement. Fairness is a fundamental validity issue and requires attention throughout all stages of test development and use (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 2014, p. 49). According to the 2014 Standards, there are four general views of fairness: fair treatment during the testing process, lack of measurement bias in scores, fairness in access to the construct, and validity of interpreting individual scores for intended uses.
Edited by Hong Jiao and Robert W. Lissitz, Test Fairness in the New Generation of Large-Scale Assessment presents a collection of nine chapters covering issues related to test fairness in the context of technology and large-scale assessment. All chapters are based on invited presentations at the fifth Annual Maryland Assessment Research Center Conference in October 2015. The authors include prominent scholars and researchers from major testing companies, universities, and other organizations.
In Chapter One, Mislevy discusses the paradox of performance tasks. Various conditions and situations used in rich performance tasks can strengthen the inferences made from the tasks, but such conditions can also cause construct-irrelevant variance. Using real cases, Mislevy demonstrates that performance tasks are well-suited to learning and assessment when inferences are contextualized in regard to students learning targets and backgrounds. However, performance tasks are not well-suited to learning and assessment when inferences are not contextualized in regard to students learning targets and backgrounds
In Chapter Two, Obregon and Yan report results from a simulation study about the effect of item preknowledge on classification accuracy in credentialing exams. Overall, they find that classification accuracy is not strongly affected by item preknowledge when the true pass rate is high (such as greater than 60%) and the proportion of candidates who benefit from item preknowledge is not excessive (such as 25% or less).
In Chapter Three, Zimmerman provides an overview of the factors and challenges that need to be considered when designing accessible assessments. She offers many practical tips and suggests that accessibility needs to be inherent in the design of the content and not bolted on at the end (p. 83). She also provides appendices outlining relevant legislation as well as challenges associated with particular disabilities, accessibility, and possible assistive technology solutions.
In Chapter Four, Miller, Walker, and Letukas report how the College Board used principles of fairness and equity to redesign the SAT. They discuss how the College Board determined what knowledge and skills to measure on the new SAT and the test development process. They also describe the statistical approaches used to demonstrate and ensure test fairness, such as differential item functioning (DIF), differential test functioning (DTF), and differential prediction analysis (DPA). In addition, Miller, Walker, and Letukas identify and dispel four often-cited test fairness rumors or misconceptions about the SAT (p. 107).
In Chapter Five, Oliveri and von Davier evaluate whether item parameter estimates used to establish the score scales of the Program for International Student Assessment (PISA) 2009 were invariant for all participating countries and across test administration cycles (e.g., years 2006 and 2009). They suggest that the illustrated quasi-international scale procedure should be used to improve the current operational processes in international large-scale assessments.
In Chapter Six, Sato reviews the role of culture in fair assessment practices. She presents a conceptual framework describing four culture-based, construct-relevant experiential factors and/or cognitive operations that influence students learning and assessment performance. Sato also lays out general principles and a heuristic model for designing culturally sensitive assessments. Such guidance is intended to help reduce the tension between test reliability (i.e., standardizing assessments) and validity (i.e., culturally sensitive).
In Chapter Seven, Bolt, Dowling, Shih, and Loh argue that just because an item demonstrates differential item functioning (DIF) does not necessarily mean it is unfair. They introduce the idea of using the Blinder-Oaxaca decomposition method to explore and explain underlying causes of DIF in test items. Results from the simulated analyses indicate that the Blinder-Oaxaca decomposition effectively identified DIF effects as well as the influence of variables. The authors also apply this procedure to examine gender differences in PISA reading scores.
In Chapter Eight, Zhang, Dorans, Li, and Rupp investigate fairness issues in automated essay scoring. They first describe e-rater®, an automated scoring system developed by the Educational Testing Service (ETS), and then demonstrate how they evaluated gender fairness in automated scoring using a differential feature functioning (DFF) method. They found that females tended to outperform males with comparable overall scores on some essay features such as usage, mechanics, grammar, and collocation-preposition.
In Chapter Nine, Schneider, Egan, and Gong describe the challenges of designing fair tests for students with dyslexia. They give an overview of dyslexia and describe the types of accommodations that are often offered to students with dyslexia. The authors also report findings from interviews with four adults with dyslexia about their experiences growing up with dyslexia and how they use technology to help overcome challenges. They conclude the chapter by discussing how fair testing for people with dyslexia may require non-standardization
and [understanding] what limits the standardization places on the ultimate intended inferences (p. 226).
In summary, this edited book addresses significant and timely issues related to test fairness, particularly as technology plays an increasingly important role in assessment. This book involves rather complicated measurement and statistical procedures, such as differential item functioning, item invariance, and item response theory, and is therefore better suited for readers who already have considerable knowledge in educational assessment and psychometrics. It can be used as a reference book for graduate students in advanced educational measurement courses or professionals who are interested in test development, test validation, and scoring procedures.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Cite This Article as: Teachers College Record, 2018, p. -
http://www.tcrecord.org ID Number: 22558, Date Accessed: 11/20/2018 11:52:35 PM