vignettes/blog/2020-11-23-hot-on-the-trail-of-warm.Rmd
2020-11-23-hot-on-the-trail-of-warm.Rmd
Back in 1989, Thomas A. Warm (1937–2019) published a paper in Psychometrika that was to have an important influence on practical testing. It described a way to reduce the inherent bias in the maximum likelihood estimate of the person’s ability, given the responses and the item parameters. All testing is ultimately about estimating ability, so the paper naturally got a lot of attention.
We are not aware of any subsequent publications by T.A., although we did find a highly readable IRT primer on the Internet. Sadly, we also found an obituary. It says that he was in the army and that he enjoyed Japanese drumming (who doesn’t?), but there is no mention of psychometrics at all. Looking further, we found this picture on LinkedIn:
We don’t know whether this elusiveness was due to the extreme modesty of a private person, or to the military status of some of his employers, such as the U.S. Coastal Guard Institute. But we feel that T.A. deserves some kind of tribute by our community, and Timo proposed to write one. Predictably, the formulae started pouring out immediately, but what nicer tribute for a scientist?
Starting with the simplest non-trivial case, we assume that the test consists of equally difficult Rasch items with the probability to find the correct answer on any of the items. The ML estimate (MLE) of is where is the number of items answered correctly. From the invariance property of the MLE, it follows that is the MLE of ability.
Unfortunately, is infinite for extreme scores; that is, when or when . One could say that is extremely biased and (under) over-estimates the ability corresponding to () . To reduce bias, Haldane (1956) suggested that we replace by If we do so, we find which we will call Haldane’s estimate. As Haldene’s estimate is finite even for the extreme scores, it is obviously less biased than the MLE.
The Warm estimate was similarly developed to prevent small-sample
bias but, apparently, using a very different approach. Specifically, the
Warm estimator is a weighted maximum likelihood estimate (WMLE) that
maximizes a weighted likelihood function
.
We obtain the Warm estimate when
,
where
is the test information function. Here,
As the right-hand side is the original
likelihood function with
replaced by
and
replaced by
,
it follows that the Warm estimate of ability equals Haldane’s estimate;
see also Verhelst, Verstralen, and Jansen
(1997).
This blog intends to explain the Warm estimate. We assume that the
IRT model is a nominal response model. That is, a person with ability
answers independently to different items and has a probability
to earn an item score of
on item
where
is an item category parameter with
,
and
.
We focus on the estimation of
and we assume both the
and
to be fixed. The IRT model is then a member of the exponential family
with canonical parameter
where the test score is a sufficient statistic for
.
This includes the Rasch model and the OPLM as special case. If the
are fixed, it also includes the 2PL.
We have a scalar parameter and a statistic . When estimating , is called an estimator and its observed value an estimate. For simplicity, we use the single term estimate and a single symbol; i.e., for the MLE and for the Warm estimate.
The MLE is the value that maximizes the likelihood function or, equivalently, its natural logarithm the log-likelihood function . It is defined as the root of the score function:
The WMLE is the root of the weighted score function where is the log-weight function. The Warm estimate is a special kind of WMLE where . Note that .
Bias is the difference between the expected value of the estimate and the true value where the expectation is taken with respect to the distribution of the data given the true value of ability. In the example with equivalent Rasch items, the MLE, , is an unbiased estimate of . However, the estimate , although it is the MLE of , was not unbiased. This is not surprising given that An estimate of would only be unbiased if is a linear function of . To obtain the Warm estimate, we essentially ‘biased’ the MLE of to obtain an estimate of that was less biased.
In general, the MLE is not unbiased but the bias is of order : So the bias vanishes as . Unfortunately, the number of items is typically small which is why we have to worry about bias. The term is called the first-order bias.
Note that the mean-squared error, is also of order with its main contribution the variance since is of order . Thus, a biased estimator may be better than an unbiased one if it has smaller variance and hence smaller mean-squared error.
Using a Taylor-series expansion to approximate around the true value , Lord (1983) found: where . Lord’s result is an instance of a general result derived, for example, in Cox and Hinkley (1974) or, slightly easier, in Mardia, Southworth, and Taylor (1999).
Note that with the item information function. If we write for the average information per item, we find that Thus, the first-order bias is zero at the value of where the (average) information function reaches its maximum value. It follows that first-order bias is smallest for those persons whose ability matches the difficulty of the test. For example, in the case of equally difficult Rasch items, which is zero when and the ability of the test taker matches exactly the difficulty of the items.
Lord suggests to correct the MLE for first-order bias. That is, to use whose bias . Efron et al. (1975) shows that this still holds if we substitute for the true value. The problem is that this ‘bias-corrected’ estimate would be undefined when the ML estimates are infinite.
Like Haldane, Warm suggests instead a method to prevent bias. Namely, to weight (or bias) the score function to reduce the bias in the MLE. He found that: such that if we define .
Warm also succeeded in showing that the asymptotic distribution of the WMLE equals that of the MLE. As grows to infinity, both estimates are normally distributed around the true value with variance .
If is a prior density function of , the weighted likelihood is (proportional to) a posterior density function. It follows that is a Bayesian modal estimate of when the prior . This prior is is known as Jeffreys prior.
Jeffreys prior was named after the statistician who introduced it as a prior distribution that results in a posterior that is invariant under reparametrization Jeffreys (1946). Warm notes the connection but doesn’t seem to believe that Jeffreys prior makes sense. He writes:
Suppose the same test was given to two groups of examinees, whose distributions of are different and known. Bayesians will be obliged to use different priors for the two groups. For WLE the same would be used for both groups, because is a function of the test only. Similarly, if two different tests of the same ability are given to a single group, Bayesians would use the same prior for both tests, whereas for WLE a different would be used for each test.
Personally, we see no objection to being ignorant of group differences and base our ability estimate on the test only.
The Warm estimate is not easy to calculate.
For the Rasch model it is relatively simple. The log-likelihood function is The corresponding score function is As a property of the exponential family, this has the form . Hence, finding the MLE of ability means finding the ability such that the expected test score equals the observed one. The information function is where . Finally, $$ J(\theta) = \sum_i P'_i(\theta)\left[1-2P_i(\theta)\right]\\ $$ The log-weighted score is thus: This leads to a very simple function.
Warm_Rasch = function(test_score, delta)
{
Sw = function(theta)
{
Pi = 1/(exp(delta - theta)+1)
dPi = Pi*(1 - Pi)
I = sum(dPi)
J = sum(dPi*(1 - 2*Pi))
return(test_score - sum(Pi) + J/(2*I))
}
return(uniroot(Sw, interval=c(-10,10))$root)
}
It is more complex for the full NRM allowing for polytomous items. We have already discussed the calculation of the information function in an earlier blog entry. As before, let be a dummy coded response where if the response to item was scored in category was picked. Specifically, we derived:
We now need where Noting that we find, after some tedious but straightforward algebra, that we can write:
All we need can be written in terms of sums , for . which inspired the following function to calculate the Warm estimate using a parms object or data.frame of item parameters.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Warm = function(test_score, parms)
{
if(!inherits(parms,'data.frame'))
parms = coef(parms)
weighted_score = function(theta)
{
s = parms |>
rename(a = 'item_score') |>
group_by(item_id) |>
mutate(P = exp(a*theta - cumsum(beta*(a-lag(a,default=0))))) |>
mutate(P = P/(sum(P)+1)) |> #normalize
summarise(E = sum(a*P),
I = sum(a^2*P) - E^2,
J = sum(a^3*P) - 3 * E * sum(a^2*P) + 2 * E^3) |>
summarise(E=sum(E), I=sum(I), J=sum(J))
test_score - s$E + s$J/(2*s$I)
}
uniroot(weighted_score, interval=c(-10,10))$root
}
The user-function ability in dexter will of course do the job using a dedicated method to locate the root of the weighted (log-) likelihood.
In closing, we mention that Firth (1993), independently of Warm, developed the idea beyond the exponential family. In these very capable hands, Warm’s legacy was taken up and generalized by Kosmidis and Firth (2009) and implemented in an R package called brgl.
We derive an expression for the first-order bias term . To this effect, we use a Taylor-series expansion to approximate around the true ability . By definition, . Thus, if we take expectations on both sides we find: Using:
we find that Rearranging shows that: Since , it follows that .