It is claimed that the Rasch model does not fit very well, and that it has to be improved for this reason. Improvement is typically seen as adding parameters to the equation of the trace line for each item. The 2PL model gives each trace line a different slope for extra flexibility, the 3PL model tries to be more realistic by introducing a non-zero lower asymptote for random guessing, the 4PL model adds an upper asymptote for random slipping, and so on. The price to pay is dubious mathematical properties of the models and highly unacceptable scoring rules. Already the 3PL model is known to be unidentified, estimation of the asymptotes often needs to be helped with tricks that largely predetermine the outcomes, and the scoring rule punishes everyone for guessing, regardless of whether they guessed or not. I won’t even consider the 4PL model but I will try to demonstrate that the 2PL model is possibly the worst choice for high stakes assessment because it fares particularly badly in terms of both fit and fairness. First we need to agree on what we mean by fit and what we mean by fairness. If we give up as many degrees of freedom as there are items, goodness of fit as measured by some chi squared statistic is bound to increase – sometimes trivially, sometimes not. I would like to know more. Item for item, where does the model fit: all over the ability continuum, in the lower part, in the upper part, or perhaps in the middle where the vast majority of examinees are to be found? These are all questions well answered by the item-total regressions produced by **dexter**. A good way to approach fairness is by drawing parallels with some other areas where it is of paramount importance, such as sports. Every sports federation has a thick book of rules that try to cover all imaginable situations before any contest has even started. They might say, for example, that athletes may compete barefoot or wearing one or two shoes, and then devote a few pages to the definition of what constitutes acceptable shoes. Testing does have rules of this kind but they apply mostly to the testing situation: may examinees use pocket calculators? What if they need to go to the toilet? How do we detect and handle cheating? But if we take the 2PL model seriously, the most important rule: how much credit is given for answering any particular item correctly, is only determined after the test is over and the data has been collected and processed. Even worse, we cannot give a clear answer why one item should give three times more credit than some other item – because we don’t know ourselves. We can’t tell a highly discriminating item from a less discriminating one by its content alone, the way we can distinguish between easy and hard items. Item writers cannot produce items with low or high discrimination if asked to. Among several explanations (or guesses) as to why the estimated discrimination parameters in a 2PL model differ, arguably the most popular one is that the test is not perfectly unidimensional. However, we usually have no clear idea about the dimensions, and trying to read them off from the item parameter estimates is a bit like predicting the future from coffee sediments. I will propose my own guess about what can make the estimated discrimination parameters in a 2PL model different, and I will test it with simulated data. My theory is quite simple and, I believe, realistic:

I am having a bad day, dear readers. People seem to all run into me, or try to knock me down with their bicycles, or stand in the way, or whatever. A million times. So I will rant on some Linux-related topics for a while, and at the end I will say something about installing **dexter** and **dextergui** on Ubuntu Linux for the first time. Those in a hurry please scroll down. Good or bad, it is a great day for me because I finally got rid of a Linux distribution called Elementary OS. Hallelujah! I installed it because it promised to look and feel like MacOS – not that I am so partial to Apple but my taste in design seems to have much in common with that of Steve Jobs and, of course, of Dieter Rams. But Elementary OS! As it is based on Ubuntu but not an official clone, it lags behind the latest versions. Various things did not work properly – most recently, the sound broke. But the most horrid thing about it is the “installation process”. Everyone who has updated Ubuntu recently knows that it is a breeze. Your home folder is preserved, and so is the software that you have installed. With that sort of expectation, I was disappointed to discover that Elementary OS insists on wiping any previous version entirely and starting from scratch. So far so good, but then it failed to recognize the main disc of a brand new computer and happily installed itself on a huge external USB hard drive without the hint of a warning, killing all data on it in the process! I am now running Ubuntu Budgie, which is 22.04 LTS, runs beautifully, and it is a pleasure to look at. Don’t tell me that it is not important – otherwise we would not decorate our homes and would still live in cages.

In a recent post, Jesse has given a somewhat involved example of simulating Rasch / PCM data with **dexter**. In spite of the rather complex design involving planned missingness and adaptivity, the models at the heart of it are well-known. In this post, we show how to simulate from Haberman’s interaction model (Haberman (2007)), which is very interesting, quite useful in practice, and to our knowledge only available in **dexter**. And, as an Easter promotion, we give you two methods for the price of one: one based on rejection sampling (**dexter**’s ‘official’ version), and one based on sampling without rejection. But we start with a short discussion of this fascinating IRT model. ## The interaction model While not even mentioned at the career award session dedicated to Shelby Haberman at NCME 2019, his interaction model (IM) must be one of the brightest, practically most salient ideas in psychometrics since 2000. It plays a strategic role in **dexter**, and we have posted several times about it and will continue to do so in the future. The IM can be represented in several ways, each of them highlighting different aspects of interest and applicability: * as a generalization of the Rasch model that relaxes conditional independence while retaining sufficient statistics and the immediate connection to the observable

Recently, we published a new package dexterMML to the dexter github site, it will probably also be available from CRAN at some time in the future. What follows is the package vignette as it is at the time of writing. In dexter we use an extended (polytomous) Rasch model to equate test forms and classical test analysis, distractor plots and the interaction model for quality control of items. For calibration of the NRM and interacton model we can use conditional maximum likelihood (CML). CML calibration is not possible for fully adaptive tests or for Multi Stage tests with probabilistic routing rules. CML is currently also impractical for random tests (i.e. tests with items randomly selected from a bank). Classical analysis is also not feasible in these situations. So in these cases we must turn to Marginal Maximum Likelihood (MML) where we can use a marginal Rasch model (1PL) for equating and a 2PL for quality control of the items. As the name implies, in MML estimation the likelihood that is maximized is marginal, over an assumed distribution of ability. The likelihood for a response vector \(x\) for a single person is defined as: \[ P(x) = \int P(x|\theta, \omega)f(\theta|\phi,Y)d\theta \]

We are busy people. We spend most of our time analyzing the large-scale national and international assessments for which **dexter** was developed. So far, we have not had the time to discuss at length the facilities for simulating item response data with **dexter**. Of course they are there, since simulation plays a vital role in our approach to generating plausible values and modelling in general – see Marsman et al. (2017). But they are somewhat hidden among several ‘functions of theta’ – anyway, here is the overdue introduction to data simulation with **dexter**. Simulating IRT data for a single booklet of Rasch items is quite easy – it can be done with a single line in plain R:

For a survey statistician, there can hardly be anything more helpful than Thomas Lumley’s `survey`

(Lumley 2004, @lumley). Not surprisingly, it has been downloaded over 3.1 million times at the time of writing, which makes it one of the more popular packages for R (R Development Core Team 2005). Unfortunately, `survey`

does not have functions to estimate two popular statistics, the standard deviation and the Pearson correlation, and their standard errors. Packages like `srvyr`

(Freedman Ellis and Schneider 2021) or `jtools`

(Long 2020) have attempted to fill this lacune, but they only provide the estimates, not the standard errors. Consider an object, `est`

, containing `survey`

estimates for the variances of three variables. The variables happen to be plausible values (Marsman et al. 2016) from a large educational survey, and the estimates have been produced by a replication method (BRR, see Lumley (2010), Ch.2.3). Note that I could have had any variables instead of plausible values (remember petal length? sepal width?), and that we are discussing the standard error under repeated sampling form a population. The question about the meaning of the correlation between two sets of plausible values, or the variability seen in a whole bunch of them, is psychometrically relevant, but it will be discussed in a different post. Applying the generic print method, I get:

There is a long-standing interest in combining item response theory (IRT) and classical test theory (CTT) rather than treat them as mere alternatives (Bechger et al. 2003). The **theta functions** in dexter are particularly helpful in this approach: - `expected_score()`

: The expected score given theta. - `information()`

: The Fisher information about theta in the test score. - `p_score()`

: The distribution of test scores given theta.

Merry Christmas and a happy, healthy and prosperous 2022 to all! There have not been many changes to **dexter** in 2021, but overall it has been a happy year, professionally, and we have many ideas for the near future. **dexter** is known in 145 countries now. It has taken 5 months to get from 140 to 145, so it may be a while until the next increase. Prize question: which density function approximates best the unique shape of Dexter’s ear?

Just a few lines to boast that (i) **dexter** has been downloaded in 140 countries now, and (ii) that I have learned how to make nice maps with `tmap`

and `leaflet`

.

*Our most distinguished colleague, Norman D. Verhelst, had his birthday a couple of days ago. As a tiny tribute, I offer a lecture I gave a couple of months ago to a bunch of students. They had moderate exposure to statistics, and I have edited it somewhat hastily, so the level is not particularly uniform — sorry, folks.* The RaschSampler (Verhelst, Hatzinger, and Mair 2007) is an R package that allows the user to test a vast array of statistical hypotheses about the Rasch model. It is also a good example of how the statistical testing of hypotheses works, especially in its nonparametric variety. The theory behind the RaschSampler seems to go back to Georg Rasch himself. In a nutshell: Since the row sums and the column sums in a 0/1 matrix are sufficient statistics for the person parameters and the item parameters in the Rasch model, correspondingly, then any other matrix having the same marginal sums will be just as compatible with the Rasch model in terms of measurement. If we compute a statistic of interest on all matrices having the same marginal sums as our observed matrix, then we can place the statistic for the observed matrix in the distribution of all statistics. The statistic itself can be anything: if you compute the matrix of tetrachoric correlations between all items, divide its seventh eigenvalue by the first, take the log and multiply it by Boltzmann’s constant, it would still work (now, try and derive the asymptotic distribution of this statistic). Of course, we would like to use sensible statistics — Ponocny (2001) gives many useful examples. And we wouldn’t like to work with *all* matrices having the same marginal sums, because there are so many of them. But, if we can have a program that produces, efficiently, a representative sample of them, the practical interest is obvious. The key to the problem was given by Verhelst (2008).

When we apply IRT to score tests, we most often use a model to map patterns of TRUE/FALSE responses onto a real-valued latent variable, the great advantage being that responses to different test forms can be represented on the same latent variable relatively easy. When all observed responses are continuous, most psychometricians will probably think first of factor analysis, structural equations and friends. Yet the world of IRT is more densely populated with continuous responses than anticipated. To start with, computer-administered testing systems such as MathGarden readily supply us with pairs of responses and response times. It is a bit awkward to model them simultaneously, since responses are usually discrete (with few categories) and response times continuous – yet there exist a myriad of models, as the overview in Boeck and Jeon (2019) shows. To apply the same type of model to both responses and response times, Partchev and De Boeck (2012) have chosen to split the latter at the median, either person- or item-wise. Second, item responses increasingly get scored by machine learning algorithms (e.g. Settles, T. LaFlair, and Hagiwara (2020)), whose output is typically a class membership probability. If classification is to be into two classes, the boundary is usually drawn at 0.5. One of the questions we ask ourselves in this blog is whether there can be a better way. Within more traditional testing, there have been attempts to avoid modeling guessing behavior with *ad hoc* models by asking test takers about their perceived certainty that the chosen response is correct (Finetti (1965), Dirkzwager (2003)). These approaches also lead to responses that can be considered as continuous.

*Back in 1989, Thomas A. Warm (1937–2019) published a paper in Psychometrika that was to have an important influence on practical testing. It described a way to reduce the inherent bias in the maximum likelihood estimate of the person’s ability, given the responses and the item parameters. All testing is ultimately about estimating ability, so the paper naturally got a lot of attention.* *We are not aware of any subsequent publications by T.A., although we did find a highly readable IRT primer on the Internet. Sadly, we also found an obituary. It says that he was in the army and that he enjoyed Japanese drumming (who doesn’t?), but there is no mention of psychometrics at all. Looking further, we found this picture on LinkedIn:* *We don’t know whether this elusiveness was due to the extreme modesty of a private person, or to the military status of some of his employers, such as the U.S. Coastal Guard Institute. But we feel that T.A. deserves some kind of tribute by our community, and Timo proposed to write one. Predictably, the formulae started pouring out immediately, but what nicer tribute for a scientist?*

A typical task in adaptive testing is to select, out of a precalibrated item pool, the most appropriate item to ask, given an interim estimate of ability, \(\theta_0\). A popular approach is to select the item having the largest value of the item information function (IIF) at \(\theta_0\). When the IRT model is the Rasch model, the item response function (IRF) is \[P(X=1|\theta,b)=\frac{1}{1+\exp(b-\theta)}\] where \(b\) is the item difficulty, and the IIF can be computed as \(P(X=1|\theta,b)[1-P(X=1|\theta,b)]\). For all items, this function reaches a maximum of 0.25, encountered exactly where \(\theta=b\) and \(P(X=1|\theta,b)=0.5\). So picking the item with the maximum IIF is the same as picking the item for which a person of ability of \(\theta_0\) has a probability of 0.5 to produce the correct answer. If we use the two-parameter logistic (2PL) model instead, the IRF is \[P(X=1|\theta,a,b)=\frac{1}{1+\exp[a(b-\theta)]}\] where the new parameter \(a\) is called the discrimination parameter, and the IIF can be computed as \(a^2P(X=1|\theta,a,b)[1-P(X=1|\theta,a,b)]\). Note that while the contribution of the product, \(P(1-P)\), remains bounded between 0 and 0.25, the influence of \(a^2\) can become quite large – for example, if \(a=5\), \(P(1-P)\) gets multiplied by 25, ten times the maximum value under the Rasch model. Selecting such an informative item gives us the happy feeling that our error of measurement will decrease a lot, but what happens to the person’s probability to give the correct response? In our game, we have one person and two items. The person’s ability is 0, represented with a vertical gray line. One of the item is fixed to be a Rasch item with \(b=0\); its IRF is shown as a solid black curve, and its IIF as a dotted black curve. The second item, shown in red, is 2PL, and you can control its two parameters with the sliders. Initially, \(a=1\) and \(b=1\).

**[A muggle preface by Ivailo]** *As every fantasy lover knows, IRT people belong to two towers, or is it schools of magic. The little wizards and witches of one school learn to condition on their sufficient statistics before they can even fly or play quidditch, while at the other school they will integrate out anything thrown at them.
The two approaches have their subtle differences. To predict the score distribution conditional on ability, wizards at the first school apply the same powerful curse, elementary symmetric functions, that they use for all kinds of magic – from catching dragons over estimating parameters, up to item-total and item-rest regressions, to mention but a few. At the second school, computing the score distribution used to be a frustrating task until Lord and Wingersky (1984) came up with a seemingly unrelated jinx that did the trick. In what follows, Timo explains how the two kinds of magic relate to each other.* Consider the Rasch model. Let \(X_{i}\) denote the binary-coded response to item \(i\) and assume that:

**dexter** has a function, `fit_domains`

, that deals with subtests. Domains are subsets of the items in the test defined as a nominal variable: an item property. The items in each subset are transformed into a large partial credit item whose item score is the subtest score; the two models, ENORM and the interaction models, are then fit on the new items. To illustrate, let us go back to our standard example, the verbal aggression data. Each of the 24 items pertains to one of four frustrating situations. Treating the four situations as subtests or domains, we obtain for one of them: To the left we see how the two models compare in predicting the category probabilities; to the right are the item-total regressions for the item (i.e. domain) score. Everything looks nice for this high-quality data set, but note in particular how closely the two models agree for the domain score. For all four situations:

Your psychometric model should fit the data, they keep telling me, or else you are in trouble. I find the idea bizarre. I am certainly not out there to *explain* the exam with a model, I just want to *grade* it. My (Rasch) model fits the scoring rule, and it is there basically to serve as an equating tool for multiple test forms, an alternative to equipercentile or kernel equating. If I am modeling anything, it is not the data but a particular social situation: mutual agreement that the sum score is a reasonable, acceptable, sensible, optimal way to grade the exam. Some theory but, ultimately, decades of collective experience keep me convinced that the model will also generally fit the data, as there is precious little useful information that is not already in the sum score. After all, the test was *made* this way. Hence, in the realm of practical testing (as opposed to research, which is a different story altogether), item fit is primarily a quality control thing. If an item has no correlation with the sum score, it is badly written. If the correlation is negative, the scoring key is wrong. Such situations can usually be identified and corrected quite easily. When the items are decently written and correctly scored, the Rasch model will fit the data *in grosso modo*, and differences in discrimination will cancel – certainly at test level but possibly even when we put together a small number of items as a subscale. ## How to measure As a quality control measure or otherwise, it is of course a good idea to look at item fit. In **dexter**, our preferred approach is a visual inspection of the item-total regressions. These provide a detailed picture of fit over the whole ability range, involving the observed data, the calibration model, and the interaction model. An overall number might be useful, if anything, to sort the plots for the individual items, such that we look at the worst (or best) fitting items first.

I have just received an email labelled *fyi* from our new maintainer. From it I deduce that a new version of **dexter** is on CRAN. I count three new features, two improvements, and three bug fixes. You may want to check it out, as I most certainly will.

Back in February 2018, we wrote about the first birthday of **dexter**. Our little dear has now reached yet another stage of maturity: **dexter 1.0.0** is available on CRAN. A new version of **dextergui** to match is on its way. There are no dramatic changes or additions, but the whole package has been revised, and all non-R parts rewritten in C++ for improved consistency and yet more speed! And even the remotest corners have been cleaned by the new maintainer, Jesse Koops, in the most ~~obsessive~~ thorough way. Same package, totally new experience.

I have just put this nice picture on one of the social networks. **dexter**, our R package for psychometric analysis of tests, has been downloaded in 120 different countries now. And each of the companion packages, **dexterGUI** and **dexterMST**, have been downloaded in 100 countries! Thanks to authors and users alike!

I have found out that, regardless of the immediate topic, it is usually easiest and most helpful to start my talks with the same slide. It shows a primitive roadmap of our discipline, like this: Nothing that you didn’t know. I knew it too, but it took me quite a pilgrimage to develop a feel for how important it is, to realize that the criteria for what is desirable, appropriate or even admissible are largely local. Now, that matters a lot. In assessment, we are focused on the individual. Regardless of how much statistical thinking is involved, it remains essentially idiographic. In producing a score for the individual, we try to remove as much uncertainty as possible – typically, we go for the central tendency of the person’s ability distribution, not for a random sample from it. Research has an extra statistical layer, being interested primarily in populations rather than individuals. Substantive research, which I see mostly in the shape of large-scale studies, is deeply concerned with two sampling processes: reproducing a finite population of individuals, which involves sampling methods and designs, variance estimation, sampling weights and the like; and sampling from the individual ability distributions with a realistic representation of their variability – in other words, reliance on plausible values and scores.

*Predicates* are a useful and powerful feature in **dexter** allowing users to filter the data passed to almost any function on arbitrarily complex combinations of the variables in the data base. Depending on the package infrastructure, some users may have experienced problems with predicates recently. This is caused by changes in the most recent version of a dependency, **dbplyr 1.4**. We are talking to the developers of dbplyr. In the meanwhile, problems can be avoided by downgrading dbplyr to version 1.3. Start a new R session and execute:

```
remove.packages(dbplyr)
library(devtools)
install_version("dbplyr", version="1.3.0")
```

In my post on DIPF from yesterday, I had plots with an arbitrary aspect ratio chosen automatically by R. Rather than change them behind the scenes, I revisit them because I feel the issue deserves attention. To start with the MDS plot: it is supposed to reconstruct a map from a distance matrix, so there is no discussion that the aspect ratio should be set to 1 by adding `asp=1`

to the arguments of the `plot`

function. Like this:

Bechger and Maris (2015) pointed out that, the way DIF is defined in psychometrics, it can be more sensibly related to pairs of items than to the individual item. Starting from the idea that all relevant information can be captured in the (group-specific) distance matrix between item difficulties, I try to visualise the subtle differences between a relatively large number of such distance matrices. This is an improvised meal using products already in the fridge, instant dinner, kitchen express. I am sure it can be vastly improved. First, how do we measure the similarity between two distance matrices? A look at the cuisine of other peoples, notably ecologists, reminds us of the Mantel test. This should not be confused with the Mantel-Haenszel statistic prominent in traditional DIF methodology: a chi-squared test testing the hypothesis that an odds-ratio between two dichotomous variables estimated across the levels of a third discrete variable is significantly different from 1. No, the Mantel test computes, simply, Pearson’s correlation between the two distance matrices, taken as vectors. What is not so simple is establish the statistical significance of the result. Distances are not independent (changing even one would distort the map), so Mantel devises a permutation-based method. I happily ignore that because I am searching for a proximity matrix, not tests of significance. Ingredients: * A PISA data set – I used 2012 Mathematics, available here

The title says it all, and here are some details from the NEWS file: * new function `design_info()`

returns extensive information about incomplete test designs. Functions `design_as_network()`

and `design_is_connected()`

are deprecated. * correction for a bug which caused NA’s in plausible values for booklets with 1 respondent and nPV>1

I have just uploaded version 0.8.4 of **dexter**. It has the same functionality as version 0.8.3 except for some minor changes to accommodate the upcoming version of **tibble**.

On behalf of the **dexter** team, best wishes for a happy, healthy and prosperous New Year to all our users! They seem to be in 113 countries for **dexter**, 87 countries for **dextergui**, and 80 countries for **dexterMST**, according to the current Rstudio stats. I look at these numbers with awe, and I am sure we’ll all keep giving our best!

In an enormously influential short paper, Embretson (1996) sums up five most important differences between classical test theory (CTT) and item response theory (IRT). The first among them is that, in CTT, the standard error of measurement applies to all scores in a particular population, while in IRT the standard error of measurement differs across scores, but generalizes across populations. To see how this works, we first take a leisurely, informal look at some simple examples with the Rasch model; we then examine the information functions more formally, and we explain how they are implemented in **dexter**. The Rasch model predicts that, given an item and its difficulty parameter, \(\beta\), the probability of a correct response is a logistic function of ability, \(\theta\), namely, \[P(\theta;\beta)=\frac{\exp(\theta-\beta)}{1+\exp(\theta-\beta)}.\] The information function for the same item happens to be \(P(\theta;\beta)[1-P(\theta;\beta)]\), so it is, again, a function of \(\theta\). Below we show the item response curve (IRF) for the item, i.e., the function \(P(\theta;\beta)\) along with the corresponding item information function (IIF), shown in red. When \(\theta=\beta\), \(P(\theta;\beta)=0.5\); this is also the point where the IIF peaks, and the maximum is of course equal to 0.25.

We have just published a new version of **dexter**. Among the new features are functions to compute the test and information functions, and expected scores. Of course, these have always been computed internally, but we had not thought of adding user-level functions. Information functions are quite interesting – expect a special entry on them very soon. Please ignore the short-lived version 0.8.2. It is the same as 0.8.3 except that we found a bug – a very small one, a mere buglet, but we didn’t like its sharp little teeth so it had to go.

Profile plots are a novel graphical display in dexter designed to visualize a certain kind of measurement invariance. A profile plot is useful when there are: * a population classified in two or more groups, * a test with items classified into two groups – let us call them domains, and * a working hypothesis that groups differ in their response to domains, given the same total score on the test.

Meet Tommy Redd and Jimmy Greene, two school friends who are inseparable and only argue about tests. When they take a test, Jimmy Greene always gets the two easiest items correct, while Tommy Redd somehow manages to answer the two hardest ones correctly but falters on the easy ones. Under classical test theory, both get the same total score of 2, and the Rasch model also gives them the same theta value, as shown on the plot in its initial state. This makes Tommy a bit unhappy because he feels that he deserves more credit for solving more difficult items than Jimmy. With the discovery of the 2PL, 3PL, 4PL … models we can fulfil Tommy’s wish or, indeed, any wish. Play with the controls for the slope parameters of the five items to give: * more credit to Tommy * more credit to Jimmy

Perhaps not important, but on the other hand rather overwhelming – according to the Rstudio logs, **dexter** has been downloaded in 100 different countries. Makes one wonder…

In this part, we show how to use **dexter** to apply the market basket approach (Mislevy 1998) to the mathematics domain in PISA 2012, following the logic in Zwitser, Glaser, and Maris (2017). Once we have read in the data and created the **dexter** database, as shown here, this takes only several lines of code. Still, the article will be a bit longer as the market basket approach deserves some attention on its own. The market basket approach has been borrowed from economics – more precisely, from the study of inflation and purchasing power. To confuse matters further, the machine learning field has appropriated the term to denote something completely different – we can ignore this except that their version is the more likely to show up on top in a Google search. A market basket contains a mix of goods and services in given proportions that, by social consensus, mirrors the prevalent consumption patterns; the cost of obtaining it, in money or work time, can be compared across countries or over time. The idea is simple but the implementation is not trivial. In comparisons over time, the contents of the basket cannot remain constant forever. In the 1970s, your teenager vitally needed a cassette recorder, today it is a smartphone. The commodity that you buy – domestic peace – is the same, but the physical carrier has changed. In comparisons across countries or social groups, one may ask: whose basket? For example, the basket of older people contains mostly food and health care; minimal changes in the prices of these may be barely perceptible for richer, active people, but they may have a large impact on the welfare of the specific group. Because of this, many national statistical services maintain separate price indices for the retired.

We could have issued a **WARNING: This bag package is not a toy!** But we can do better. In a series of posts, we will discuss how to use

We have just submitted dexter 0.8.1 to CRAN. The main difference with regard to 0.8.0 is speed in estimating person abilities. Thanks to C, ML estimation of ability is about 40 times faster than before. Even more important, in absolute gain of time, is the 15x speed up in the computation of plausible values and plausible scores. This could be achieved by implementing the clever recycling algorithms discussed by Maarten Marsman in his PhD dissertation, Plausible Values in Statistical Inference (2014). The speed-up factors cited above are approximate. We have not done a state-of-the-art measurement. In fact, we strongly believe that speed considerations should not be allowed to prevail when discussing methodology. However, when a method has been shown to be optimal in the context of *large scale* surveys, as is the case with plausible values, and when that method turns out to be computationally intensive, a considerable acceleration does have some relevance. We will return to this in one of the future posts. Other novelties in 0.8.1 are a new plot method for item parameters, and the possibility to read in response data in a format described variously as long, tidy, or normalized.

In a previous post, I wrote about three kinds of item-total regressions available in **dexter**: the empirical one, and the smoothed versions under the Rasch model and the interaction model. In fact, there is one more item-total regression, available through the `distractor_plot`

command and the Shiny interfaces in **dexter** and **dextergui**. This will be the topic today. Unlike the latter two regressions, this one does not involve a global model for the data (Rasch or interaction model): it is local. We use the `density`

function in R (R Development Core Team (2005)) to estimate the density of the total scores twice over the same support: for all persons, and for the persons who have given a certain response to the item. Together with the marginal frequency of the response, this is all we need to apply the Bayes rule and compute the density of the response given the total score. This is the item-total regression we need. We call this a distractor plot because we apply it to all possible responses to the item, including non-response, and not just to the (modelled) correct response. This provides valuable insights into the quality of item writing, including trivial annoyances such as a wrong key. We don’t have to believe that multiple choice items are the pinnacle of creation but, if we do use them, we must make sure that they are written well and graded correctly. Good writing means, among other things, that along with the correct response(s) the item must contain a sufficient number of sufficiently plausible wrong alternatives (‘distractors’). Moses (2017) gives a nice historical overview of the use of similar graphics, from the first item-total regressions drawn by Thurstone in 1925 to the graphs used routinely at ETS. He also provides examples (drawn from Livingston and Dorans (2004)) of items that are too easy, too difficult, or simply not appropriate for a given group of examinees. Let us also mention the computer program TestGraf98 (Ramsay (2000)) whose functionality has been reproduced in the R package, **KernelSmoothIRT** (Mazza, Punzo, and McGuire (2014)).

**dexterMST** is a new R package acting as a companion to **dexter** (Maris et al. 2018) and adding facilities to manage and analyze data from **multistage tests (MST)**. It includes functions for importing and managing test data, assessing and improving the quality of data through basic test and item analysis, and fitting an IRT model, all adapted to the peculiarities of MST designs. It is currently the only package that offers the possibility to calibrate item parameters from MST using Conditional Maximum Likelihood (CML) estimation (Zwitser and Maris 2015). **dexterMST** will accept designs with any number of stages and modules, including combinations of linear and MST. The only limitation is that routing rules must be score-based and known before test administration. ## What does it do? Multi-stage tests (MST) must be historically the earliest attempt to achieve *adaptivity* in testing. In a traditional, non-adaptive test, the items that will be given to the examinee are completely known before testing has started, and no items are added until it is over. In adaptive testing, the items asked are, at least to some degree, contingent on the responses given, so the exact contents of the test only becomes known at the end. (Bejar 2014) gives a nice overview of early attempts at adaptive testing in the 1950s. Other names for adaptive testing used in those days were tailored testing or response-contingent testing. Note that MST can be done without any computers at all, and that computer-assisted testing does not necessarily have to be adaptive. When computers became ubiquitous, full-scaled computerized adaptive testing (CAT) emerged as a realistic option. In CAT, the subject’s ability is typically reevaluated after each item and the next item is selected out of a pool, based on the interim ability estimate. In MST, adaptivity is not so fine-grained: items are selected for administration not separately but in bunches, usually called *modules*. In the first stage of a MST, all respondents take a *routing test*. In subsequent stages, the modules they are given depend on their success in previous modules: test takers with high scores are given more difficult modules, and those with low scores are given easier ones – see e.g., Zenisky, Hambleton, and Luecht (2009), Hendrickson (2007), or Yan, Lewis, and Davier (2014).

A critical user of **dexter** might simulate data and then plot estimates of the item parameters against the true values to check whether **dexter** works correctly. Her results might look like this: After a moment of thought, the researcher finds that she is looking at item easiness, while **dexter** reports item difficulties. After the sign has been reversed, the results look better but still not quite as expected:

Educational tests are often equipped with a threshold to turn the test score into a pass-fail decision. When a new test of the same kind is developed, we need a threshold for it that will be, in some sense, equivalent to the threshold for the old — let us call it *reference* — test. This is a special case of *test equating*, and it is similar to some well-studied problems in epidemiology; for the comfort of our predominantly psychometric audience, we start with an outline of those. Consider a test for pregnancy. The state of nature is a binary variable (positive / negative). So is the outcome of the test, although the decision is possibly produced by dichotomizing the quantitative measurement of some hormone or other substance, just like we want to do with our educational test scores. There are four possible outcomes in all, which can be represented in a two-by-two table

State of nature | |||

Positive | Negative | ||

Prediction | Positive | True positive (TP) | False positive (FP) |

Negative | False negative (FN) | True negative (TN) |

Regression is the conditional expectation of a variable given the value of another variable. It can be estimated from data, plotted, and modeled (smoothed) with suitable functions. An example is shown on the figure below. This regression is completely based on observable quantities. On the x-axis we have the possible sum scores from a test of 18 dichotomous items. The expected item score on one of the items (actually, the third), given the sum score, is shown on the y-axis. It is little to say that item response theory (IRT) is *interested* in such regressions: in fact, it is *made* of them – except that it regresses the expected item score on a latent variable rather than the observed sum score. Assuming that each person’s item scores are conditionally independent given the person’s value on the latent variable, and that examinees work independently, we multiply what needs to be multiplied, and we end up with a likelihood function to optimize. Having found estimates for the model parameters, we would like to see whether the model fits the data. In a long tradition going back to Andersen’s test for the fit of the Rasch model, model fit is evaluated by comparing observed and expected item-total regression functions. The item-fit statistics in the OPLM software package (corporate software at Cito developed by Norman Verhelst and Cees Glas), are also of this nature. But the comparison is not necessarily easy – at least, not for every kind of model.

Today is exactly a year since we presented our R package, dexter, for the first time at the 2017 Psychoco workshop in Vienna. Having partaken of the birthday cake , I (Ivailo) now concentrate on this year’s edition of Psychoco, where I will present a talk on item-total regressions in dexter. The slides, or a short informal paper, will be posted here after the workshop. I wish to close, once and for all, the topic of the package’s name. **dexter** is named after my ex-neighbours’ dog . I was searching for an acronym, preferably not mirt (there are at least three programs under that name already), I looked out of the window, and the name was found. Unfortunately, I could not take a better picture before they moved out, but yes, this is the Real Dexter. We have absolutely nothing to do with that other character. We meant **dexter** as in dexterous and dexterity, and the Urban Dictionary is also full of praise: >>A sweet, caring, out going guy that is a good friend. Dexters are good boyfriends.

Cito is a Dutch educational testing company founded about 50 years ago and situated in Arnhem. Cito is responsible for a large part of the educational assessments in the Netherlands, participates in numerous international projects such as PISA or ESLC, and offers consultancy around the globe. In a somewhat unprecedented move, Cito have decided to publish their latest psychometric software as open source, even before it is completely finished. The R package, **dexter**, has been available on CRAN since February 2017, and is currently at version 0.5.4. In this blog, we would like to share news about the software, tutorials, and futher information useful to the users. We might also discuss broader topics in educational assessment and its social impact. The views expressed in the blog are our own. **dexter** is developed at Cito, the Netherlands, with subsidy from the Dutch Ministry of Education, Culture, and Science. Timo Bechger •

Bechger, Timo M., and Gunter Maris. 2015. “A Statistical Test for Differential Item Pair Functioning.” *Psychometrika* 80 (2): 317–40.

Bechger, Timo M., Gunter Maris, Huub H. F. M. Verstralen, and Anton A. Béguin. 2003. “Using Classical Test Theory in Combination with Item Response Theory.” *Applied Psychological Measurement* 27 (5): 319–34.

Bejar, Isaac I. 2014. “Past and Future of Multistage Testing in Educational Reform.” *Computerized Multistage Testing: Theory and Applications. New York: Chapman & Hall*.

Boeck, P. De, and Minjeong Jeon. 2019. “An Overview of Models for Response Times and Processes in Cognitive Tests.” *Frontiers in Psychology* 10. https://doi.org/10.3389/fpsyg.2019.00102.

Dirkzwager, Arie. 2003. “Multiple Evaluation: A New Testing Paradigm That Exorcizes Guessing.” *International Journal of Testing* 3 (4): 333–52.

Embretson, Susan. 1996. “New Rules for Measurement.” *Psychological Assessment* 8 (4): 341–49.

Finetti, Bruno de. 1965. “Methods for Discriminating Levels of Partial Knowledge Concerning a Test Item.” *British Journal of Mathematical and Statistical Psychology* 18 (1): 87–123.

Freedman Ellis, Greg, and Ben Schneider. 2021. *Srvyr: ’Dplyr’-Like Syntax for Summary Statistics of Survey Data*. https://CRAN.R-project.org/package=srvyr.

Haberman, Shelby J. 2007. “The Interaction Model.” In *Multivariate and Mixture Distribution Rasch Models: Extensions and Applications*, edited by M. von Davier and C. H. Carstensen, 201–16. New York: Springer.

Hendrickson, Amy. 2007. “An Ncme Instructional Module on Multistage Testing.” *Educational Measurement: Issues and Practice* 26 (2): 44–52.

Livingston, Samuel A., and Neil J. Dorans. 2004. “A Graphical Approach to Item Analysis.” *ETS Research Report Series*, no. 1: i–17. https://doi.org/10.1002/j.2333-8504.2004.tb01937.x.

Long, Jacob A. 2020. *Jtools: Analysis and Presentation of Social Scientific Data*. https://cran.r-project.org/package=jtools.

Lord, Frederic M, and Marilyn S Wingersky. 1984. “Comparison of Irt True-Score and Equipercentile Observed-Score Equatings.” *Applied Psychological Measurement* 8 (4): 453–61.

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” *Journal of Statistical Software* 9 (1): 1–19.

———. 2010. *Complex Surveys: A Guide to Analysis Using R: A Guide to Analysis Using R*. John Wiley; Sons.

Maris, Gunter, Timo Bechger, Jesse Koops, and Ivailo Partchev. 2018. *Dexter: Data Management and Analysis of Tests*. https://CRAN.R-project.org/package=dexter.

Marsman, Maarten, Gunter Maris, Timo Bechger, and Cees Glas. 2017. “Turning Simulation into Estimation: Generalized Exchange Algorithms for Exponential Family Models.” *PloS One* 12 (1): e0169787.

Marsman, M., G. Maris, T. M. Bechger, and C. A. W. Glas. 2016. “What can we learn from plausible values?” *Psychometrika*, 1–16.

Mazza, Angelo, Antonio Punzo, and Brian McGuire. 2014. “KernSmoothIRT: An R Package for Kernel Smoothing in Item Response Theory.” *Journal of Statistical Software* 58 (6): 1–34. http://www.jstatsoft.org/v58/i06/.

Mislevy, R. J. 1998. “Implications of Market-Basket Reporting for Achievement-Level Setting.” *Applied Psychological Measurement* 11 (1): 49–63.

Moses, Tim. 2017. “A Review of Developments and Applications in Item Analysis.” In *Advancing Human Assessment: The Methodological, Psychological and Policy Contributions of ETS*, edited by M. von Davier and C. H. Carstensen, 19–46. New York: Springer.

OECD. 2014. “PISA 2012: Technical Report.” http://www.oecd.org/pisa/pisaproducts/pisa2012technicalreport.htm.

Partchev, Ivailo, and Paul De Boeck. 2012. “Can Fast and Slow Intelligence Be Differentiated?” *Intelligence* 40 (1): 23–32. https://doi.org/https://doi.org/10.1016/j.intell.2011.11.002.

Ponocny, Ivo. 2001. “Nonparametric Goodness-of-Fit Tests for the Rasch Model.” *Psychometrika* 66: 437–59.

Ramsay, J. O. 2000. *TestGraf: A Program for the Graphical Analysis of Multiple Choice Test and Questionnaire Data*. McGill University.

R Development Core Team. 2005. *R: A Language and Environment for Statistical Computing*. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Settles, Burr, Geoffrey T. LaFlair, and Masato Hagiwara. 2020. “Machine Learning–Driven Language Assessment.” *Transactions of the Association for Computational Linguistics* 8: 247–63.

Verhelst, Norman D. 2008. “An Efficient MCMC Algorithm to Sample Binary Matrices with Fixed Marginals.” *Psychometrika* 73: 705–28.

Verhelst, Norman D., Reinhold Hatzinger, and Patrick Mair. 2007. “The Rasch Sampler.” *Journal of Statistical Software* 20.

Yan, Duanli, Charles Lewis, and Alina A von Davier. 2014. “Overview of Computerized Multistage Tests.” *Computerized Multistage Testing: Theory and Applications. New York: Chapman & Hall*.

Zenisky, April, Ronald K Hambleton, and Richard M Luecht. 2009. “Multistage Testing: Issues, Designs, and Research.” In *Elements of Adaptive Testing*, 355–72. Springer.

Zwitser, R., and G. Maris. 2015. “Conditional Statistical Inference with Multistage Testing Designs.” *Psychometrika* 80 (1): 65–84. https://doi.org/10.1007/s11336-013-9369-6.

Zwitser, Robert J., S. Sjoerd F. Glaser, and Gunter Maris. 2017. “Monitoring Countries in a Changing World: A New Look at Dif in International Surveys.” *Psychometrika* 82 (1): 210–32. https://doi.org/10.1007/s11336-016-9543-8.