1RUNNING HEAD: IRT and Classical Polytomous and Dichotomous MethodsScoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous andDichotomous MethodsChristine E. DeMarsJames Madison UniversityAuthor NoteCorrespondence concerning this manuscript should be addressed to Christine DeMars,Center for Assessment and Research Studies, MSC 6806, James Madison University,Harrisonburg VA 22807. E-mail: demarsce@jmu.edu

2AbstractFour methods of scoring multiple-choice items were compared: Dichotomous classical (number-correct), polytomous classical (classical optimal scaling – COS), dichotomous IRT (3 parameter logistic – 3PL), and polytomous IRT (nominal response – NR). Data were generated to followeither a nominal response model or a non-parametric model, based on empirical data. The polytomous models, which weighted the distractors differentially, yielded small increases inreliability compared to their dichotomous counterparts. The polytomous IRT estimates were less biased than the dichotomous IRT estimates for lower scores. The classical polytomous scoreswere as reliable, sometimes more reliable, than the IRT polytomous scores. This wasencouraging because the classical scores are easier to calculate and explain to users.

3Scoring Multiple Choice Items: A Comparison of IRT and Classical Polytomous andDichotomous MethodsMultiple choice items are often scored dichotomously by treating one option choice ascorrect and treating the distractors as equally wrong. In item response theory (IRT), the test canthen be scored using the one-, two-, or three-parameter logistic model (1PL, 2PL or 3PL). Inclassical test theory, one point can be given for each correct answer and the test score can be thesum or mean of these points. These approaches do not take into account

which

incorrectdistractor was selected by an examinee who failed to choose the most correct answer. Itemresponse theory approaches to modeling each distractor individually include the nominalresponse (NR) model (Bock, 1972) and multiple-choice (MC) modifications of the NR modelwhich take guessing into account (Samejima, 1981, Section X.7; Thissen & Steinberg, 1984).The NR and MC models require large samples and can be difficult to estimate for responsecategories with few respondents (De Ayala & Sava-Bolesta, 1999; DeMars, 2003; Thissen,Steinberg, & Fitzpatrick, 1989). Sympson and Haladyna (1988) developed a simple, non- parametric method of determining scoring weights for each category, which they termed

polyweighting

. Polyweighting is similar to Guttman’s (1941) method of weighting categories tomaximize internal consistency, which is a special case of generalized optimal scaling(McDonald, 1983). The purpose of this study is to compare the accuracy and precision of dichotomous and polytomous scoring using both IRT and classical models.

Scoring Models

The most common method of scoring items in classical test theory is number-correct (or percent-correct). This procedure gives all incorrect responses a score of zero. Alternatively,when the option score is the mean score of the examinees who chose that option, and theexaminee score is the mean of the option scores selected, item-total correlations (Guttman, 1941)

4and coefficient alpha reliability (Haladyna & Kramer, 2005; Lord, 1958) are maximized. Thisscoring method has been called

classical optimal scaling

(COS) or

dual scaling

(McDonald,1983; Warrens, de Gruijter, & Heiser, 2007), as well as

polyweighting

(Sympson & Haladyna,1988) or

polyscoring

(Haladyna, 1990). The label

COS

will be used in what follows because

polyweighting

and

polyscoring

could easily be confused with other types of polytomous scoringsuch as the polytomous IRT models.Sympson and Haladyna (1988) detailed a simple procedure to obtain the responseoptions’ different weights. In Sympson and Haladyna’s algorithm, each examinee’s initial scoreis the number-correct score. Based on these scores, the response option score (including thecorrect option) is calculated as the mean percentile rank of the examinees who chose that option.Total scores can then be re-computed with these new weights, followed by re-computation of theoption scores, continuing until there is little change in the examinee or option scores. This procedure is often followed using the percent-correct scores or z-score in place of the percentileranks (Haladyna, 1990; Haladyna & Kramer, 2005; Hendrickson, 1971; Warrens et al. 2007), inwhich case the option and examinee scores are equivalent to Guttman’s (1941) procedure.Crehan and Haladyna (1994) advocated using the percentile rank because the category weightdepends on the difficulty of the other test items when the total score is used in estimating theweights. Within a given test form, both methods should produce similar weights because of themonotonic relationship between total score and percentile rank.In IRT, the 1PL, 2PL and 3PL models treat all incorrect options as a single category. InBock’s (1972) nominal response (NR) model and Thissen and Steinberg’s (1984) multiple choicemodels, the probability of each option is modeled separately, without imposing an a-prioriordering to the options as the graded response (Samejima, 1969) and partial credit (Masters,1982) models do. The NR model is: