Summary: | <p> In general, newer educational assessments are deemed more demanding challenges than students are currently prepared to face. Two types of factors may contribute to the test scores: (1) factors or dimensions that are of primary interest to the construct or test domain; and, (2) factors or dimensions that are irrelevant to the construct, causing residual covariance that may impede the assessment of psychometric characteristics and jeopardize the validity of the test scores, their interpretations, and intended uses. To date, researchers performing item response theory (IRT)-based model simulation research in educational measurement have not been able to generate data, which mirrors the complexity of real testing data due to difficulty in separating different types of errors from multiple sources and due to comparability issues across different psychometric models, estimators, and scaling choices.</p><p> Using the context of the next generation K-12 assessments, I employed a computer simulation to generate test data under six test configurations. Specifically, I generated tests that varied based on the sample size of examinees, the degree of correlation between four primary dimensions, the number of items per dimension, and the discrimination levels of the primary dimensions. I also explicitly modeled the potential nuisance dimensions in addition to the four primary dimensions of interest, for which (when two nuisance dimensions were modeled) I also used varying degrees of correlation. I used this approach for two purposes. First, I aimed to explore the effects that two calibration strategies have on the structure of residuals of such complex assessments when the nuisance dimensions are not explicitly modeled during the calibration processes and when tests differ in testing configurations. The two calibration models I used included a unidimensional IRT (UIRT) model and a multidimensional IRT (MIRT) model. For this test, both models only considered the four primary dimensions of interest. Second, I also wanted to examine the residual covariance structures when the six test configurations vary. The residual covariance in this case would indicate statistical dependencies due to unintended dimensionality. </p><p> I employed Luecht and Ackerman’s (2017) expected response function (ERF)-based residuals approach to evaluate the performance of the two calibration models and to prune the bias-induced residuals from the other measurement errors. Their approach provides four types of residuals that are comparable across different psychometric models and estimation methods, hence are ‘metric-neutral’. The four residuals are: (1) e0, which comprises the total residuals or total errors; (2) e1, the bias-induced residuals; (3) e2, the parameter-estimation residuals; and, (4) e3, the estimated model-data fit residuals.</p><p> With regard to my first purpose, I found that the MIRT model tends to produce less estimation error than the UIRT model on average (e2MIRT is less than e2UIRT) and tends to fit the data better than the UIRT model on average (e3MIRT is less than e3UIRT). With regard to my second research purpose, my analyses of the correlations of the bias-induced residuals provide evidence of the large impact of the presence of nuisance dimension regardless of its amount. On average, I found that the residual correlations increase with the presence of at least one nuisance dimension but tend to decrease with high item discriminations.</p><p> My findings shed light on the need to consider the choice of calibration model, especially when there are some intended and unintended indications of multidimensionality in the assessment. Essentially, I applied a cutting-edge technique based on the ERF-based residuals approach (Luecht & Ackerman, 2017) that permits measurement errors (systematic or random) to be cleanly partitioned, understood, examined, and interpreted—in-context and in relative to difference-that-matters criteria—regardless of the choice of scaling, calibration models, and estimation methods. For that purpose, I conducted my work based on the context of the complex reality of the next generation K-12 assessments and based on my effort to maintain adherence to the established educational measurement standards (American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME), 1999, 2014); International Test Commission (ITC) (ITC, 2005a, 2005b, 2013a, 2013b, 2014, 2015)).</p><p>
|