A comparison study of IRT calibration methods for mixed-format tests in vertical scaling

The purpose of this dissertation is to investigate how using different IRT calibration methods may affect student achievement growth pattern recovery. In this study, 96 vertical scales (4 × 2 × 2 × 2 ×3) are constructed using different combinations of IRT calibration methods (separate, pair-wise con...

Full description

Bibliographic Details
Main Author: Meng, Huijuan
Other Authors: Vispoel, Walter P.
Format: Others
Language:English
Published: University of Iowa 2007
Subjects:
Online Access:https://ir.uiowa.edu/etd/338
https://ir.uiowa.edu/cgi/viewcontent.cgi?article=1523&context=etd
id ndltd-uiowa.edu-oai-ir.uiowa.edu-etd-1523
record_format oai_dc
spelling ndltd-uiowa.edu-oai-ir.uiowa.edu-etd-15232019-10-13T04:55:24Z A comparison study of IRT calibration methods for mixed-format tests in vertical scaling Meng, Huijuan The purpose of this dissertation is to investigate how using different IRT calibration methods may affect student achievement growth pattern recovery. In this study, 96 vertical scales (4 × 2 × 2 × 2 ×3) are constructed using different combinations of IRT calibration methods (separate, pair-wise concurrent, semi-concurrent, & concurrent), lengths of common-item set (10 vs. 20 common items), types of common item set (dichotomous only vs. dichotomous and polytomous), and numbers of polytomous item (6 vs. 12) for 3 simulated datasets which differ in sample size (500, 1000, 5000 per grade). Three criteria (RMSE, SE and bias) are used to evaluate the performance of these calibration methods on proficiency score distribution recovery over 40 replications. The results suggest that for data used in this study, when parameters of interest are related to measuring students' growth (i.e., proficiency score mean and effect size), pair-wise concurrent calibration overall produced the most accurate results. When parameters of interest are related to performance variability (i.e., standard deviation), concurrent calibration in general produced the most stable and accurate estimates. When the emphasis is to classify students' performance accurately, with the increase of sample size, taken collectively, pair-wise concurrent and semi-concurrent calibration outperformed concurrent and separate calibration. Overall, pair-wise concurrent was more effective than the other methods in constructing a vertical scale and use of either separate or concurrent calibration to create a vertical scale seems least warranted. In addition, it is observed that (1) Larger sample size stabilized estimation results and reduced error; (2) Compared to tests containing 10 common items, errors and biases were in general smaller for tests with 20 common items; (3) Compared to tests containing a mixed-format common-item set, errors and biases were usually smaller for tests containing a dichotomous-only common-item set; (4) For tests containing a mixed-format common-item set, errors and biases were in general smaller for tests containing more polytomous items; and (5) For tests containing a dichotomous-only common-item set, increasing the number of polytomous items did not necessarily either reduce or increase errors and biases. 2007-12-01T08:00:00Z dissertation application/pdf https://ir.uiowa.edu/etd/338 https://ir.uiowa.edu/cgi/viewcontent.cgi?article=1523&context=etd Copyright 2007 Huijuan Meng Theses and Dissertations eng University of IowaVispoel, Walter P. Lee, Won-Chan Education
collection NDLTD
language English
format Others
sources NDLTD
topic Education
spellingShingle Education
Meng, Huijuan
A comparison study of IRT calibration methods for mixed-format tests in vertical scaling
description The purpose of this dissertation is to investigate how using different IRT calibration methods may affect student achievement growth pattern recovery. In this study, 96 vertical scales (4 × 2 × 2 × 2 ×3) are constructed using different combinations of IRT calibration methods (separate, pair-wise concurrent, semi-concurrent, & concurrent), lengths of common-item set (10 vs. 20 common items), types of common item set (dichotomous only vs. dichotomous and polytomous), and numbers of polytomous item (6 vs. 12) for 3 simulated datasets which differ in sample size (500, 1000, 5000 per grade). Three criteria (RMSE, SE and bias) are used to evaluate the performance of these calibration methods on proficiency score distribution recovery over 40 replications. The results suggest that for data used in this study, when parameters of interest are related to measuring students' growth (i.e., proficiency score mean and effect size), pair-wise concurrent calibration overall produced the most accurate results. When parameters of interest are related to performance variability (i.e., standard deviation), concurrent calibration in general produced the most stable and accurate estimates. When the emphasis is to classify students' performance accurately, with the increase of sample size, taken collectively, pair-wise concurrent and semi-concurrent calibration outperformed concurrent and separate calibration. Overall, pair-wise concurrent was more effective than the other methods in constructing a vertical scale and use of either separate or concurrent calibration to create a vertical scale seems least warranted. In addition, it is observed that (1) Larger sample size stabilized estimation results and reduced error; (2) Compared to tests containing 10 common items, errors and biases were in general smaller for tests with 20 common items; (3) Compared to tests containing a mixed-format common-item set, errors and biases were usually smaller for tests containing a dichotomous-only common-item set; (4) For tests containing a mixed-format common-item set, errors and biases were in general smaller for tests containing more polytomous items; and (5) For tests containing a dichotomous-only common-item set, increasing the number of polytomous items did not necessarily either reduce or increase errors and biases.
author2 Vispoel, Walter P.
author_facet Vispoel, Walter P.
Meng, Huijuan
author Meng, Huijuan
author_sort Meng, Huijuan
title A comparison study of IRT calibration methods for mixed-format tests in vertical scaling
title_short A comparison study of IRT calibration methods for mixed-format tests in vertical scaling
title_full A comparison study of IRT calibration methods for mixed-format tests in vertical scaling
title_fullStr A comparison study of IRT calibration methods for mixed-format tests in vertical scaling
title_full_unstemmed A comparison study of IRT calibration methods for mixed-format tests in vertical scaling
title_sort comparison study of irt calibration methods for mixed-format tests in vertical scaling
publisher University of Iowa
publishDate 2007
url https://ir.uiowa.edu/etd/338
https://ir.uiowa.edu/cgi/viewcontent.cgi?article=1523&context=etd
work_keys_str_mv AT menghuijuan acomparisonstudyofirtcalibrationmethodsformixedformattestsinverticalscaling
AT menghuijuan comparisonstudyofirtcalibrationmethodsformixedformattestsinverticalscaling
_version_ 1719265024803864576