Summary: | The question of whether multiple-choice (MC) and constructed-response (CR) items
measure the same construct has not been satisfactorily addressed after many years of
investigation. Previous studies comparing different formats in the domain of writing, reading,
mathematics, and science have provided inconsistent results.
The purpose of the present study was to find out i f format differences lead to
multidimensionality in mathematics assessments. The study also added further to our knowledge
about dimensionality. In addition, the study examined whether different formats assessed similar
latent constructs based on Bloom's learning taxonomy. Because it is possible that the format
differences may occur in one ability group and not in another ability group of examinees, the
effects of format differences on the performances of students with different ability levels were
also examined.
The four analyses reported in this investigation focused on mathematics assessments
across different grade levels and time points of the school year. The four data sets were: (1) The
Third International Mathematics and Science Survey (Grade 3 and 4, 1995); (2) The Third
International Mathematics and Science Survey (Grade 7 and 8,1995); (3) British Columbia
Grade 12 Provincial Mathematics Examination (April 1998); and (4) British Columbia Grade 12
Provincial Mathematics Examination (August 1998).
Two different psychometric models were applied to address the research questions. First,
Full-Information Item Factor Analysis (FIFA) (Bock & Aitkin, 1981), a combination of factor
analysis and item response model analysis, was applied as an exploratory approach to detect the
dimensionality of the test structures comprising different formats. Second, the Multidimensional
Random Coefficient Multinomial Logit Model (MRCMLM) (Adams, Wilson, & Wang, 1997), a
type of multidimensional latent trait model, was used as a confirmatory approach. In addition,
analyses of cognitive demand and content were used to assist with the interpretation of the factor
analyses. Two IRT based models (FIFA and MRCMLM), as well as examination of factor
loadings, provided richer and stronger evidence than the single method applied in the previous
studies on the investigation of format differences.
The findings indicated that one dominant trait was present in each data set. The existence
of item local dependence and different cognitive demand (e.g., computation skill) were the most
reasonable explanations for the existence of the non-significant minor dimensions. It appeared
that the differences between MC and CR items did not affect the test structures. Most MC and
CR items had high loadings on the same factor in the one-factor solution. In addition, MC and
CR items correlated highly in the four sets of analyses. Therefore, the hypothesis that MC and
CR items measured different mathematical proficiency was rejected. Additionally, MC and CR
items did not differ in assessing students' cognitive ability beyond knowledge level. Such
findings confirmed Hancock's (1996) conclusion that MC and CR items measured the same
ability at each level of Bloom's cognitive framework.
High and low ability students differed in dealing with MC and CR item types. It
appeared that the data set was two-dimensional for low ability students and unidimensional for
high ability students in the three data sets (Data One, Data Three, Data Four). For the test (Data
Two) that appeared to be more difficult than the above three tests, the data set was twodimensional
for high ability students and unidimensional for low ability students. In addition,
low ability students performed better on MC items than on CR items, whereas high ability
students performed similarly or better on CR items than on MC items in the three data sets.
It appeared that unidimensional IRT model can be used to calibrate those mathematics
tests incorporating both MC and CR item types. However, the test structure may be twodimensional
(MC vs. CR) for the subgroups (high and low ability students). Therefore, reporting
of two scores (MC vs. CR) sometimes may be useful for teachers and parents to diagnose
students' weaknesses in the two formats. Teachers can thus find ways to help certain groups of
students to improve their performances. The implications of the findings might be useful to
teachers, test developers, and assessment specialists in making decisions in terms of item format
selection, model selection, test development, scoring, and score reporting.
|