Comparison of different item types in terms of latent trait in mathematics assessment

The question of whether multiple-choice (MC) and constructed-response (CR) items measure the same construct has not been satisfactorily addressed after many years of investigation. Previous studies comparing different formats in the domain of writing, reading, mathematics, and science have provid...

Full description

Bibliographic Details
Main Author: Wang, Zhen
Language:English
Published: 2009
Online Access:http://hdl.handle.net/2429/12796
Description
Summary:The question of whether multiple-choice (MC) and constructed-response (CR) items measure the same construct has not been satisfactorily addressed after many years of investigation. Previous studies comparing different formats in the domain of writing, reading, mathematics, and science have provided inconsistent results. The purpose of the present study was to find out i f format differences lead to multidimensionality in mathematics assessments. The study also added further to our knowledge about dimensionality. In addition, the study examined whether different formats assessed similar latent constructs based on Bloom's learning taxonomy. Because it is possible that the format differences may occur in one ability group and not in another ability group of examinees, the effects of format differences on the performances of students with different ability levels were also examined. The four analyses reported in this investigation focused on mathematics assessments across different grade levels and time points of the school year. The four data sets were: (1) The Third International Mathematics and Science Survey (Grade 3 and 4, 1995); (2) The Third International Mathematics and Science Survey (Grade 7 and 8,1995); (3) British Columbia Grade 12 Provincial Mathematics Examination (April 1998); and (4) British Columbia Grade 12 Provincial Mathematics Examination (August 1998). Two different psychometric models were applied to address the research questions. First, Full-Information Item Factor Analysis (FIFA) (Bock & Aitkin, 1981), a combination of factor analysis and item response model analysis, was applied as an exploratory approach to detect the dimensionality of the test structures comprising different formats. Second, the Multidimensional Random Coefficient Multinomial Logit Model (MRCMLM) (Adams, Wilson, & Wang, 1997), a type of multidimensional latent trait model, was used as a confirmatory approach. In addition, analyses of cognitive demand and content were used to assist with the interpretation of the factor analyses. Two IRT based models (FIFA and MRCMLM), as well as examination of factor loadings, provided richer and stronger evidence than the single method applied in the previous studies on the investigation of format differences. The findings indicated that one dominant trait was present in each data set. The existence of item local dependence and different cognitive demand (e.g., computation skill) were the most reasonable explanations for the existence of the non-significant minor dimensions. It appeared that the differences between MC and CR items did not affect the test structures. Most MC and CR items had high loadings on the same factor in the one-factor solution. In addition, MC and CR items correlated highly in the four sets of analyses. Therefore, the hypothesis that MC and CR items measured different mathematical proficiency was rejected. Additionally, MC and CR items did not differ in assessing students' cognitive ability beyond knowledge level. Such findings confirmed Hancock's (1996) conclusion that MC and CR items measured the same ability at each level of Bloom's cognitive framework. High and low ability students differed in dealing with MC and CR item types. It appeared that the data set was two-dimensional for low ability students and unidimensional for high ability students in the three data sets (Data One, Data Three, Data Four). For the test (Data Two) that appeared to be more difficult than the above three tests, the data set was twodimensional for high ability students and unidimensional for low ability students. In addition, low ability students performed better on MC items than on CR items, whereas high ability students performed similarly or better on CR items than on MC items in the three data sets. It appeared that unidimensional IRT model can be used to calibrate those mathematics tests incorporating both MC and CR item types. However, the test structure may be twodimensional (MC vs. CR) for the subgroups (high and low ability students). Therefore, reporting of two scores (MC vs. CR) sometimes may be useful for teachers and parents to diagnose students' weaknesses in the two formats. Teachers can thus find ways to help certain groups of students to improve their performances. The implications of the findings might be useful to teachers, test developers, and assessment specialists in making decisions in terms of item format selection, model selection, test development, scoring, and score reporting.