`vignettes/blog/2022-03-26-simulate_with_dexter.Rmd`

`2022-03-26-simulate_with_dexter.Rmd`

We are busy people. We spend most of our time analyzing the
large-scale national and international assessments for which
**dexter** was developed. So far, we have not had the time
to discuss at length the facilities for simulating item response data
with **dexter**. Of course they are there, since simulation
plays a vital role in our approach to generating plausible values and
modelling in general – see Marsman et al.
(2017). But they are somewhat hidden among several ‘functions of
theta’ – anyway, here is the overdue introduction to data simulation
with **dexter**.

Simulating IRT data for a single booklet of Rasch items is quite easy – it can be done with a single line in plain R:

The **dexter** way would be:

```
library(dexter)
library(dplyr)
items = tibble(item_id = paste0('item', 1:20),
item_score = 1,
beta = runif(20, -2, 2))
Rasch_data = r_score(items)(rnorm(500))
```

We just construct a tibble with the item parameters in the expected format, after which it is a single line again. We now discuss a more interesting and complicated case. The example is by Jesse and the comments by me, so the perception of a certain age difference is inevitable even if not deliberate.

```
library(mvtnorm)
set.seed(42)
domains = c('geometry','algebra','measurement')
nit = 50
pop_theta = tribble(
~gender, ~school_type, ~geometry, ~algebra, ~measurement, ~sd, ~n,
'boys', 'sbo', -.3, -.5, -.5 , .6, 500,
'girls', 'sbo', -.5, -.5, -.2 , .6, 400,
'boys', 'bao', .7, .5, .2 , 1, 1200,
'girls', 'bao', .5, .5, .5 , 1, 1100
)
```

The `tribble`

, ot transposed tibble, defines the
combinations of gender, two Dutch school types, and three domains that
will be treated as three highly correlated but distinct abilities. The
hypothesized abilities will have a multivariate normal distribution with
means varying across domains, sexes, and school types, and standard
deviations differing just across school types. The last column has the
number of fictional students in each cell.

```
items = tibble(domain = sample(domains, nit, TRUE),
item_id = sprintf('item%03i-%s', 1:nit, substring(domain,1,3)),
item_score = 1,
beta = runif(nit, -2, 2))
```

Here we constructed the tibble that defines the items, just as in the
initial example. `sprintf`

is a much better choice than
`paste0`

because it produces sortable names: we don’t get
items 10 through 19 sorted before item 2, for example. Next, define the
correlation matrix for the three domains and use the
**mvtnorm** package to generate the abilities:

```
rho = matrix(c(1, .8, .8,
.8, 1, .8,
.8, .8, 1),
ncol=3, byrow=TRUE)
theta = pop_theta |>
group_by(gender, school_type) |>
do({
res = rmvnorm(.$n,
mean=as.double(.[,domains]),
sigma=rho * .$sd^2)
colnames(res) = domains
as_tibble(res)
})
```

This happens in a **dplyr** pipe that groups by gender
and school type, effectively calling `rmvnorm`

four times and
combining the results in a tibble. Now that we have the items and the
thetas, call `r_score`

three times to generate the simulated
item responses, and stack the output:

```
dat = cbind(
r_score(filter(items, domain=='geometry'))(theta$geometry),
r_score(filter(items, domain=='algebra'))(theta$algebra),
r_score(filter(items, domain=='measurement'))(theta$measurement)
)
```

Add the two person properties:

```
dat = as_tibble(dat)
dat$gender = rep(pop_theta$gender, pop_theta$n)
dat$school_type = rep(pop_theta$school_type, pop_theta$n)
```

The simulated data is ready now, but we are going to put it in a
**dexter** database and have some more fun with generating
an incomplete design. The creation of a **dexter** database
(a.k.a. project) always starts with a set of scoring rules. We shall
make some trivial rules, where responses can be only 0 or 1, and will be
scored as 0 or 1:

`dummy_rules = tibble(item_id=rep(items$item_id,2), item_score=rep(0:1,each=nit), response=item_score)`

From the rules, we generate an empty database, announcing the two person properties included with the data:

```
db = start_new_project(dummy_rules,
':memory:',
person_properties=list(gender='<unknown>', school_type='<unknown>'))
```

Now, let us have an incomplete design, 30 items per booklet. SBO students will get the easier booklet, and BAO students will get the more difficult one. In real life, this would be determined by a pretest with a small sample of students, but we will simply sort items on their true difficulty parameters. The easy booklet will contain all items except the 20 most difficult ones, and the difficult booklets all items except the 20 easiest ones; thus, the two booklets have an overlap of 10 items of average difficulty:

```
items_SBO = arrange(items,beta) |> slice(1:20) |> pull(item_id)
items_BAO = arrange(items,beta) |> slice(31:50) |> pull(item_id)
add_booklet(db,
select(dat, !any_of(items_BAO)) |> filter(school_type=='sbo'),
'sbo_booklet')
add_booklet(db,
select(dat, !any_of(items_SBO)) |> filter(school_type=='bao'),
'bao_booklet')
```

Finally, we add the domains as an item property:

`add_item_properties(db, select(items, item_id, domain))`

We are now ready to check how the models available in
**dexter** will cope with this simulated data set, or how
the diagnostic facilities might help us investigate its particular
properties.