A comparison study of IRT calibration methods for mixed-format tests in vertical scaling
The purpose of this dissertation is to investigate how using different IRT calibration methods may affect student achievement growth pattern recovery. In this study, 96 vertical scales (4 × 2 × 2 × 2 ×3) are constructed using different combinations of IRT calibration methods (separate, pair-wise con...
Main Author: | |
---|---|
Other Authors: | |
Format: | Others |
Language: | English |
Published: |
University of Iowa
2007
|
Subjects: | |
Online Access: | https://ir.uiowa.edu/etd/338 https://ir.uiowa.edu/cgi/viewcontent.cgi?article=1523&context=etd |
Summary: | The purpose of this dissertation is to investigate how using different IRT calibration methods may affect student achievement growth pattern recovery. In this study, 96 vertical scales (4 × 2 × 2 × 2 ×3) are constructed using different combinations of IRT calibration methods (separate, pair-wise concurrent, semi-concurrent, & concurrent), lengths of common-item set (10 vs. 20 common items), types of common item set (dichotomous only vs. dichotomous and polytomous), and numbers of polytomous item (6 vs. 12) for 3 simulated datasets which differ in sample size (500, 1000, 5000 per grade). Three criteria (RMSE, SE and bias) are used to evaluate the performance of these calibration methods on proficiency score distribution recovery over 40 replications. The results suggest that for data used in this study, when parameters of interest are related to measuring students' growth (i.e., proficiency score mean and effect size), pair-wise concurrent calibration overall produced the most accurate results. When parameters of interest are related to performance variability (i.e., standard deviation), concurrent calibration in general produced the most stable and accurate estimates. When the emphasis is to classify students' performance accurately, with the increase of sample size, taken collectively, pair-wise concurrent and semi-concurrent calibration outperformed concurrent and separate calibration. Overall, pair-wise concurrent was more effective than the other methods in constructing a vertical scale and use of either separate or concurrent calibration to create a vertical scale seems least warranted.
In addition, it is observed that (1) Larger sample size stabilized estimation results and reduced error; (2) Compared to tests containing 10 common items, errors and biases were in general smaller for tests with 20 common items; (3) Compared to tests containing a mixed-format common-item set, errors and biases were usually smaller for tests containing a dichotomous-only common-item set; (4) For tests containing a mixed-format common-item set, errors and biases were in general smaller for tests containing more polytomous items; and (5) For tests containing a dichotomous-only common-item set, increasing the number of polytomous items did not necessarily either reduce or increase errors and biases. |
---|