All Psychometric Models Are Wrong

The British statistician George Box is credited with the quote, “All models are wrong but some are useful.” As psychometricians, it is important that we never forget this perspective. We cannot be so haughty as to think that our psychometric models actually represent the true underlying phenomena and any data that does not fit nicely is just noise. We need to remember that everything we do is an approximation, and respect the balance between parsimony and parameterization.

Really… all psychometric models are wrong?

Yeah, there is no TRUE model that perfectly describes the interaction between an examinee and a test item. Obviously the probability of a correct response is primarily due to important factors such as examinee ability, item difficulty, item quality, the presence of guessing, and the scoring function of the item. There are also additional factors, such as student motivation, timing factors, lighting in the room, screen size, whether they broke up with their girlfriend/boyfriend the previous day, whether their mom made their favorite breakfast that morning… you get the picture. Attempting to model all those factors is certainly overparameterization.

Wikipedia as has a lengthier quote on that aspect:

Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so overelaboration and overparameterization is often the mark of mediocrity.

Most, if not all psychometricians, would agree that my earlier description of overparameterization is valid. The controversy in the field of Psychometrics is which of those “important factors” I mentioned qualify as overparameterization. The Rasch model famously boils down the interaction to a single item parameter (difficulty) and a single person parameter (ability). Many psychometricians consider this to be underparameterization since, for example, items are known widely differ in their quality (discrimination). The Rasch cohort would consider the 2 and 3 parameter item response theory (IRT) models to be overparameterization, especially since they necessitated the development of new parameter estimation algorithms in the 1970s. There are some practitioners in each camp who would claim that the other is the “mark of mediocrity.”

IRT continues to add more and more parameters, such as multidimensionality, response time, and upper asymptote. For the most part, these are only academic curiosities, existing only to publish papers on new research, even though most assessments in the world still struggle to apply the Rasch model from 1960.

On the other end of the spectrum is classical test theory, which is based on simple mathematics like averages, proportions, and correlations. This greatly underparameterizes what is actually going on. The point-biserial coefficient, for example, assumes that the relation of ability to getting an item correct is linear, which is blatantly false since the probability cannot go above 1.0 or below 0.0.

Sooo… How do I select a psychometric model?

Well, try to be cognizant of that tradeoff, which is one of several tradeoffs when selecting an IRT model. There is no right answer all the time, it is more a matter of whether your data fits a model and whether it satisfies your requirements for a particular situation. That is, whether it is truly useful, which is Box’s original point. But don’t forget that all the models are wrong!

Bio
Latest Posts

Nathan Thompson, PhD

Nathan Thompson earned his PhD in Psychometrics from the University of Minnesota, with a focus on computerized adaptive testing. His undergraduate degree was from Luther College with a triple major of Mathematics, Psychology, and Latin. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. Dr. Thompson has published over 100 journal articles and conference presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/ .

Latest posts by Nathan Thompson, PhD (see all)

What is a T score? - April 15, 2024
Item Review Workflow for Exam Development - April 8, 2024
Likert Scale Items - February 9, 2024

All Psychometric Models Are Wrong

Really… all psychometric models are wrong?

Sooo… How do I select a psychometric model?

Nathan Thompson, PhD

Latest posts by Nathan Thompson, PhD (see all)

Company

Online Testing Solutions

Psychometrics