Simulating data with dexter: a commented example

We are busy people. We spend most of our time analyzing the large-scale national and international assessments for which dexter was developed. So far, we have not had the time to discuss at length the facilities for simulating item response data with dexter. Of course they are there, since simulation plays a vital role in our approach to generating plausible values and modelling in general – see Marsman et al. (2017). But they are somewhat hidden among several ‘functions of theta’ – anyway, here is the overdue introduction to data simulation with dexter.

Simulating IRT data for a single booklet of Rasch items is quite easy – it can be done with a single line in plain R:

Rasch_data = matrix(as.integer(rlogis(outer(rnorm(500), runif(20,-2,2) ,'-'))>0), 500, 20)

The dexter way would be:

library(dexter)
library(dplyr)
items = tibble(item_id = paste0('item', 1:20),
               item_score = 1,
               beta = runif(20, -2, 2))
Rasch_data = r_score(items)(rnorm(500))

We just construct a tibble with the item parameters in the expected format, after which it is a single line again. We now discuss a more interesting and complicated case. The example is by Jesse and the comments by me, so the perception of a certain age difference is inevitable even if not deliberate.

library(mvtnorm)
set.seed(42)

domains = c('geometry','algebra','measurement')
nit = 50

pop_theta = tribble(
  ~gender, ~school_type, ~geometry, ~algebra, ~measurement, ~sd, ~n,
  'boys',  'sbo',        -.3,       -.5,      -.5         ,  .6,  500,    
  'girls', 'sbo',        -.5,       -.5,      -.2         ,  .6,  400,  
  'boys',  'bao',         .7,        .5,       .2         , 1,   1200,
  'girls', 'bao',         .5,        .5,       .5         , 1,   1100
)

The tribble, ot transposed tibble, defines the combinations of gender, two Dutch school types, and three domains that will be treated as three highly correlated but distinct abilities. The hypothesized abilities will have a multivariate normal distribution with means varying across domains, sexes, and school types, and standard deviations differing just across school types. The last column has the number of fictional students in each cell.

items = tibble(domain = sample(domains, nit, TRUE),
               item_id = sprintf('item%03i-%s', 1:nit, substring(domain,1,3)),
               item_score = 1,
               beta = runif(nit, -2, 2))

Here we constructed the tibble that defines the items, just as in the initial example. sprintf is a much better choice than paste0 because it produces sortable names: we don’t get items 10 through 19 sorted before item 2, for example. Next, define the correlation matrix for the three domains and use the mvtnorm package to generate the abilities:

rho = matrix(c(1, .8, .8,
              .8, 1,  .8,
              .8, .8, 1), 
             ncol=3, byrow=TRUE)

theta = pop_theta |>
  group_by(gender, school_type) |>
  do({
    res = rmvnorm(.$n, 
                  mean=as.double(.[,domains]),
                  sigma=rho * .$sd^2)
    colnames(res) = domains
    as_tibble(res)
  })

This happens in a dplyr pipe that groups by gender and school type, effectively calling rmvnorm four times and combining the results in a tibble. Now that we have the items and the thetas, call r_score three times to generate the simulated item responses, and stack the output:

dat = cbind(
  r_score(filter(items, domain=='geometry'))(theta$geometry),
  r_score(filter(items, domain=='algebra'))(theta$algebra),
  r_score(filter(items, domain=='measurement'))(theta$measurement)
)

Add the two person properties:

dat = as_tibble(dat)
dat$gender = theta$gender
dat$school_type = theta$school_type

The simulated data is ready now, but we are going to put it in a dexter database and have some more fun with generating an incomplete design. The creation of a dexter database (a.k.a. project) always starts with a set of scoring rules. We shall make some trivial rules, where responses can be only 0 or 1, and will be scored as 0 or 1:

dummy_rules = tibble(item_id=rep(items$item_id,2), item_score=rep(0:1,each=nit), response=item_score)

From the rules, we generate an empty database, announcing the two person properties included with the data:

db = start_new_project(dummy_rules, 
                       ':memory:', 
                       person_properties=list(gender='<unknown>', school_type='<unknown>'))

Now, let us have an incomplete design, 30 items per booklet. SBO students will get the easier booklet, and BAO students will get the more difficult one. In real life, this would be determined by a pretest with a small sample of students, but we will simply sort items on their true difficulty parameters. The easy booklet will contain all items except the 20 most difficult ones, and the difficult booklets all items except the 20 easiest ones; thus, the two booklets have an overlap of 10 items of average difficulty:

items_SBO = arrange(items,beta) |> slice(1:20)  |> pull(item_id)
items_BAO = arrange(items,beta) |> slice(31:50) |> pull(item_id)

add_booklet(db, 
            select(dat, !any_of(items_BAO)) |> filter(school_type=='sbo'),
            'sbo_booklet')

add_booklet(db, 
            select(dat, !any_of(items_SBO)) |> filter(school_type=='bao'),
            'bao_booklet')

Finally, we add the domains as an item property:

add_item_properties(db, select(items, item_id, domain))

We are now ready to check how the models available in dexter will cope with this simulated data set, or how the diagnostic facilities might help us investigate its particular properties.

References

Marsman, Maarten, Gunter Maris, Timo Bechger, and Cees Glas. 2017. “Turning Simulation into Estimation: Generalized Exchange Algorithms for Exponential Family Models.” PloS One 12 (1): e0169787.

Jesse Koops and Ivailo Partchev

3/26/2022

References