Psychometric woes (with the 2PL in particular)

It is claimed that the Rasch model does not fit very well, and that it has to be improved for this reason. Improvement is typically seen as adding parameters to the equation of the trace line for each item. The 2PL model gives each trace line a different slope for extra flexibility, the 3PL model tries to be more realistic by introducing a non-zero lower asymptote for random guessing, the 4PL model adds an upper asymptote for random slipping, and so on. The price to pay is dubious mathematical properties of the models and highly unacceptable scoring rules. Already the 3PL model is known to be unidentified, estimation of the asymptotes often needs to be helped with tricks that largely predetermine the outcomes, and the scoring rule punishes everyone for guessing, regardless of whether they guessed or not. I won’t even consider the 4PL model but I will try to demonstrate that the 2PL model is possibly the worst choice for high stakes assessment because it fares particularly badly in terms of both fit and fairness.

First we need to agree on what we mean by fit and what we mean by fairness. If we give up as many degrees of freedom as there are items, goodness of fit as measured by some chi squared statistic is bound to increase – sometimes trivially, sometimes not. I would like to know more. Item for item, where does the model fit: all over the ability continuum, in the lower part, in the upper part, or perhaps in the middle where the vast majority of examinees are to be found? These are all questions well answered by the item-total regressions produced by dexter. A good way to approach fairness is by drawing parallels with some other areas where it is of paramount importance, such as sports. Every sports federation has a thick book of rules that try to cover all imaginable situations before any contest has even started. They might say, for example, that athletes may compete barefoot or wearing one or two shoes, and then devote a few pages to the definition of what constitutes acceptable shoes.

Testing does have rules of this kind but they apply mostly to the testing situation: may examinees use pocket calculators? What if they need to go to the toilet? How do we detect and handle cheating? But if we take the 2PL model seriously, the most important rule: how much credit is given for answering any particular item correctly, is only determined after the test is over and the data has been collected and processed. Even worse, we cannot give a clear answer why one item should give three times more credit than some other item – because we don’t know ourselves. We can’t tell a highly discriminating item from a less discriminating one by its content alone, the way we can distinguish between easy and hard items. Item writers cannot produce items with low or high discrimination if asked to. Among several explanations (or guesses) as to why the estimated discrimination parameters in a 2PL model differ, arguably the most popular one is that the test is not perfectly unidimensional. However, we usually have no clear idea about the dimensions, and trying to read them off from the item parameter estimates is a bit like predicting the future from coffee sediments.

I will propose my own guess about what can make the estimated discrimination parameters in a 2PL model different, and I will test it with simulated data. My theory is quite simple and, I believe, realistic:

the test is unidimensional because we made every effort to have it that way;
the item difficulties vary because we always try to cover the ability continuum uniformly in order to achieve a uniform error of measurement;
the item discriminations are all the same because we aim to have them this way and, even if we didn’t, we would not know how to make them different;
because it is a high stakes test, at least some examinees will try to guess the correct answer if the item is too difficult for them – but, of course, they would not do this if they find the item to be easy.

Starting with these simple assumptions, I spaced 20 difficulties uniformly over the [-2,+2] continuum, and I sampled 500 abilities from the standard normal distribution. For each person-item combination, I simulated two responses, one based on random guessing with a success probability of .25 and the other based on the Rasch probability, and I retained the response that had the higher probability of success, as a rational person can be expected to do. The 1PL, 2PL, and 3PL models will be fit for this data with the popular package mirt, but first let us examine the item-total regressions for four items equally spaced from the easiest (Item 1) to the most difficult but two:

We see where the data is and looks as expected, and that the Rasch model fits quite well, particularly in the white area where the central 90% of all observations are concentrated, and always in the upper part. Only in the very difficult item 18 does it struggle a bit in the lower part. This is not surprising because I did not put different slopes in the simulated data. The surprises are yet to come as we fit the 1PL, 2PL and 3PL models.

The trace lines for the 1PL are not particularly interesting especially as, unlike on the item-total regressions, we cannot see where the data is situated. It is not even easy to see that, in spite of all the guessing taking place, the items are correctly arranged by difficulty (although it is evident in the estimates themselves). Will the 2PL do better?

So here is the problem: the 2PL model detects the guessing in the more difficult items and tries to accommodate it by making the trace lines progressively flatter. I would not call this good fit, and it is disastrous for test fairness: the more difficult the item, the less credit will be given for answering it correctly. The problem is seen even better on a plot of estimated slopes against true difficulties:

Finally, the 3PL model:

The 3PL seems to handle the situation better: because it has a more appropriate way to handle guessing, it correctly finds the slopes to be equal (approximately). However, it comes with its own problems:

While the 1PL model converged in 15 iterations and the 2PL model needed 14, the 3PL model took 500 iterations. This is usually seen as an indication of instability.
The lower asymptotes for the easy items are badly estimated because few persons would guess on an easy item so there is not much data for a precise estimate; the problem is often ‘solved’ by using a prior and, in the absence of data, the prior dominates the estimate.
If the model is used for grading, it will impose a penalty for guessing. The appropriate instructions to the examinees would be then “Dear students, this test is important for your life, think hard and if you don’t know the correct answer by all means guess because you will be punished for guessing anyhow”.

The assumptions under which I simulated the data set are close to the so-called 1PL model with guessing parameter (1PL-G), which may explain reality well but unfortunately shares all the disadvantages of the 3PL model, from problematic identification to scores penalized for guessing.

Under realistic conditions, it seems that we cannot do much better than rely on the Rasch model. It retains the same scoring rule as CTT, simple, known apriori and widely seen as fair. Even in the presence of massive guessing, it fits well for most of the data, especially in the upper ability range where the important selection and placement decisions tend to be taken. The most promising way to improvement is to physically reduce the incentive to guess through a well-considered multi-stage test of the kind implemented in dexterMST.

Is there nothing good to be said about the 2PL model then? Of course there is. A better fitting model may be very useful in research situations (to which high-stakes testing does not belong). Modern software typically allows the slopes to be negative, overcoming an old and naive practice to constrain them to be positive “on theoretical grounds”. This is particularly important when analyzing opinion data. People do not simply like Biden better than Trump (or vice versa) – if they like one, they typically dislike the other. A 2PL where the slopes can be negative acts like a poor man’s unfolding model, while a 2PL with slopes constrained to be positive would at best tell us only half of the story. In testing, a negative slope would identify an item where the correct answer key is wrong – in spite of maximum care, this happens in about 2% of all items.

Our real problem in high-stakes assessment is guessing, not different slopes. It defeats our models because it defeats us as practitioners, not the other way round. Unlike cheating proper: copying, consulting smuggled books or notes, it cannot be controlled, and there is no perfect way to account for it. Among the three options mentioned here: let low ability guessers walk away with a bonus, punish high ability examinees for answering the most difficult items correctly, and punishing everyone for other people’s guessing, I certainly prefer the first.

PS When I said that I don’t know of any ways to produce items with predictable discrimination, I was not 100% honest. Items with low discriminations can be obtained easily by sloppy writing: for example, if two of the three distractors are so obviously wrong that nobody in their perfect mind would ever choose them. The recipe for highly discriminating items is: place them in the end, speed the test, and score the not reached items as wrong. But I doubt that these pieces of advice will be appreciated in practice.

Ivailo Partchev

2022-07-16