vignettes/blog/2018-08-21-dexter-meets-pisa-1.Rmd
2018-08-21-dexter-meets-pisa-1.Rmd
We could have issued a WARNING: This bag package
is not a toy! But we can do better. In a series of posts, we
will discuss how to use dexter to analyze data from
large scale educational assessments, such as PISA. The task is a rather
imposing one, not only in the amount of data to be crunched, but mostly
in the number of issues to discuss.
In this first part, we will analyze the mathematics domain in PISA 2012 (OECD 2014). With 65 participating countries, 21 booklets and over half a million of students, this is anything but a toy example. We will focus on the basics: how to obtain the data, get it into dexter, estimate the parameters of the IRT model, compute plausible values, and apply survey weights and variance estimation. With these technical aspect under the belt, we can concentrate on more conceptual issues in further blogs.
As of the time of writing, the following works. We believe that the OECD will keep the data available forever; however, minute changes in the exact location cannot be excluded, and that would necessitate changes in the code.
We need two files: the scored cognitive item response data file, and the student questionnaire data file. We will first focus on the responses to the cognitive items. The student questionnaire file will be needed later for the survey weights and plausible values it contains.
The first step is to download the scored cognitive item response data file from the PISA website. This is a compressed text file; to unzip it, we follow the steps suggested by Koohafkan (2014).
library(dexter)
library(dplyr)
library(tidyr)
library(readr)
library(SAScii)
url1 = "http://www.oecd.org/pisa/pisaproducts/INT_COG12_S_DEC03.zip"
url2 = "http://www.oecd.org/pisa/pisaproducts/PISA2012_SAS_scored_cognitive_item.sas"
zipfile = tempfile(fileext='.zip')
download.file(url1, zipfile)
fname = unzip(zipfile, list=TRUE)$Name[1]
unzip(zipfile, files = fname, overwrite=TRUE)
unlink(zipfile)
Next, parse the data into a data frame (a tibble, to be more precise)
by using the SAS control file and the parse.SAScii()
function from the SAScii
package. For the sake of
simplicity, we recode both codes for missing values (i.e., 7 for N/A,
and 8 for ‘not reached’) to NA
, although we admit that it
could be worth to pay more attention to the structure of the missing
values. This is, however, beyond the scope of this paper.
dict_scored <- parse.SAScii(sas_ri = url2)
data_scored <- read_fwf(
file = fname,
col_positions = fwf_widths(dict_scored$width, col_names = dict_scored$varname)) %>%
select(CNT, SCHOOLID, STIDSTD, BOOKID, starts_with('PM'))
unlink(fname)
data_scored$BOOKID = sprintf('B%02d', data_scored$BOOKID)
data_scored[data_scored==7] = NA
data_scored[data_scored==8] = NA
Note that we selected a subset of the data (all math items have
variable names starting with PM). Now, start a new dexter project and
add the data. First we need to declare all items and their admissible
scores in a rules
object. We pass that to the
start_new_project
function.
rules = gather(data_scored, key='item_id', value='response', starts_with('PM')) %>%
distinct(item_id, response) %>%
mutate(item_score = ifelse(is.na(response), 0, response))
head(rules)
## # A tibble: 6 x 3
## item_id response item_score
## <chr> <dbl> <dbl>
## 1 PM00FQ01 NA 0
## 2 PM00FQ01 0 0
## 3 PM00FQ01 1 1
## 4 PM00GQ01 NA 0
## 5 PM00GQ01 0 0
## 6 PM00GQ01 1 1
db = start_new_project(rules, "/Rdatasets/pisa2012.db", person_properties=list(
cnt = '<unknown country>',
schoolid = '<unknown country>',
stidstd = '<unknown student>'
)
)
Now we can add the data, booklet by booklet:
for(bkdata in split(data_scored, data_scored$BOOKID))
{
# remove columns tahta only have NA values
bkrsp = bkdata[,apply(bkdata,2,function(x) !all(is.na(x)))]
add_booklet(db, bkrsp, booklet_id = bkdata$BOOKID[1])
}
rm(data_scored)
At some later stage, we will discuss a market basket approach to the definition and estimation of competence (Mislevy 1998; Zwitser, Glaser, and Maris 2017). For our purposes, the market basket is the subset of items that is administered in every country (though not to every student, as we have a multi-booklet design within each country). We are not dealing with the market basket approach now, but we will compute an item property, ‘item belongs / does not belong to the basket’, and we will add it to the data base for later use:
item_by_cnt = get_responses(db, columns=c('item_id', 'cnt')) %>%
distinct()
market_basket = Reduce(intersect, split(item_by_cnt$item_id, item_by_cnt$cnt))
add_item_properties(db,
tibble(item_id = market_basket, in_basket = 1),
default_values = list(in_basket = 0L))
## 1 new item_properties defined
## 1 item properties for 59 items added or updated
## [1] 0
Our data base is complete now (took only a couple of minutes), and we
can, and should, do some exploratory analysis to assess the quality of
the items and identify possible problems. In a PISA study, this step is
usually done at national level, and involves the computation of various
statistics derived from classical test theory (CTT).
dexter has excellent facilities for this kind of
analysis, but they are best employed in interactive mode. The two
interactive functions, iTIA
and iModels
, are
set up such that all vital computations are performed at the beginning.
This means that, with a survey of this size, one must wait for about a
minute for a CTT analysis of all booklets, including test and item level
statistics and distractor plots; after that, results can be browsed
without noticeable delay. The waiting time for a deeper analysis of a
separate booklet, including the three item-total regressions, is about a
couple of seconds.
Unfortunately, interactive analysis is difficult to demonstrate in a static paper. The interested readers can try them out by themselves. Alternatively, they could use the new package, dextergui, which makes interactive analysis even easier.
The IRT model in dexter is not exactly the same as the one used in the original PISA study of 2012. PISA used the Mixed Coefficients Multinomial Logit Model (Adams, Wilson, and Wang (1997)), which is a multidimensional model fit by marginal maximum likelihood (MML). The extended nominal response model (ENORM) in dexter is a unidimensional model fit by conditional maximum likelihood (CML). Still, there is an essential structural similarity: both models default to the Rasch model for the dichotomous items, and to the partial credit model (PCM) items for the polytomous items.
CML estimation tends to be fast:
system.time({item_parms = fit_enorm(db)})
## user system elapsed
## 12.14 0.31 12.13
Plausible values (PV) are random draws from the posterior distribution of a person’s ability, given the item parameters and the person’s responses (Maarten Marsman (2014)). Nowadays, they are the standard method of ability estimation when studying populations (when testing individuals in a high stakes exam, we typically use the expectation of the posterior rather than a random draw).
The plausible values in the original PISA study are drawn with a prior distribution that reflects a large number of background variables measured on the person. We don’t do this. A wonderful property of PV values is that they eventually reproduce the true ability distribution, even when the prior is misspecified (M. Marsman et al. (2016)). Using the wrong prior comes with a penalty: one must pay with more data, meaning more items. When the functional form of the posterior resembles the functional form of the prior, this is a sign that convergence is near.
The algorithm used in dexter is based on composition (sample a proposal value, simulate data from it, accept if the simulated sum score matches the observed one), and uses recycling for higher speed (rejected values are saved for some other persons). See Maarten Marsman (2014) for details.
pv = plausible_values(db, parms=item_parms, nPV=5)
head(pv)
## booklet_id person_id booklet_score PV1 PV2 PV3
## 1 B01 dx_0000001 7 -0.823141408 -1.0854864 -1.6082933
## 2 B01 dx_0000002 15 0.375734533 -0.1744456 0.1022719
## 3 B01 dx_0000003 12 -0.008198676 -0.5795975 -0.2381185
## 4 B01 dx_0000004 22 1.258887508 2.0439863 1.2562507
## 5 B01 dx_0000005 2 -2.568729624 -3.1756783 -2.3094269
## 6 B01 dx_0000006 2 -4.263408956 -3.1428478 -2.3403892
## PV4 PV5
## 1 -1.3724004 -1.7113120
## 2 0.7187336 -0.6527584
## 3 -0.3802600 -0.9113997
## 4 1.3050071 1.5130896
## 5 -3.3043147 -1.8926860
## 6 -2.5261844 -3.5770779
Note that we could have omitted the parms
argument, in
which case the IRT model will be estimated automatically. Once we have
set up the dexter data base, PISA’s original approach,
or a good approximation, takes one line of code!
Large-scale studies like PISA make every effort to represent faithfully particular populations. They use scientific sampling, which, for efficiency reasons, does not necessarily give every student in the population the same probability to be selected. As long the sampling probabilities are known, the potential bias can be neutralized by weighting the individuals’ data with (essentially) the inverse of these probabilities. The weights are also adjusted to reduce the biasing effect of nonresponse.
Actually, we don’t need the weights to compare our PV to PISA’s original ones. But, for didactic purposes, and to get done with all technicalities before proceeding to more conceptual issues, we will do three comparisons: unweighted, weighted, and exactly as in PISA.
First, download the student questionnaire file, which contains, along with a huge amount of background data, PISA’s plausible values, and the sampling and replicate weights necessary to compute the estimates and their standard errors. The operations are similar to those we used to get the cognitive response data:
url3 = 'http://www.oecd.org/pisa/pisaproducts/INT_STU12_DEC03.zip'
url4 = 'http://www.oecd.org/pisa/pisaproducts/PISA2012_SAS_student.sas'
zipfile = tempfile(fileext='.zip')
options(timeout = max(300, getOption("timeout")))
download.file(url3, zipfile)
fname = unzip(zipfile, list=TRUE)$Name[1]
unzip(zipfile, files = fname, overwrite=TRUE)
dict_quest = parse.SAScii(sas_ri = url4)
dict_quest = parse.SAScii(sas_ri = url4) %>%
mutate(end = cumsum(width),
beg = end - width + 1) %>%
filter(grepl('CNT|SCHOOLID|STIDSTD|PV.MATH|^W_FST', varname))
data_quest = read_fwf(file = fname, fwf_positions(dict_quest$beg, dict_quest$end, dict_quest$varname))
unlink(zipfile)
unlink(fname)
We have selected the variables containing the country (CNT), the
school and the student IDs, the five plausible values, and a large
number of variables whose name starts in W_FST
; the latter
contain sampling and replicate weights.
From the dexter data base, extract student identification variables and add them to our freshly generated PV, so we can match these with PISA’s PV:
scores = get_persons(db) %>%
inner_join(pv, by='person_id') %>%
setNames(toupper(names(.))) %>%
inner_join(data_quest, by = c("CNT", "SCHOOLID", "STIDSTD"))
Using just the first PV out of five, compare unweighted and weighted country means:
means = scores %>%
group_by(CNT) %>%
summarise(
uD = mean(PV1),
uP = mean(PV1MATH),
wD = weighted.mean(PV1, w=W_FSTUWT),
wP = weighted.mean(PV1MATH, w=W_FSTUWT)
)
plot(means$uP, means$uD, main='Unweighted country means', xlab='PISA', ylab='dexter')
plot(means$wP, means$wD, main='Weighted country means', xlab='PISA', ylab='dexter')
To estimate country means and their standard estimates as in PISA, we
use package survey (Lumley
(2004)). First, we make a replication design with
W_FSTUWT
as sampling weights, and the many variables whose
names start in W_FSTR
as replicate weights:
library(survey)
ds = svrepdesign(repweights = '^W_FSTR',
weights = ~W_FSTUWT,
combined.weights = TRUE,
data = scores)
The following function will do all computations explained in the PISA Technical Manual. Essentially, we compute, for each country and for each of the five PV, a large number of differently weighted means. The replicate weights are chosen such that, in each replication, the data for half of the schools is ignored while the data for the other half is doubled; the replications differ in which schools fall into the two halves. The uncertainty due to having a sample of schools and students rather than the whole population is reflected in the differences between replications, while the uncertainty due to psychometric measurement is reflected in differences between the five PV. The two kinds of error, sampling variance and imputation variance, are then combined with a simple formula (see the arrow in the code):
pvby = function(formula, by, design, FUN, ...) {
npv = length(attr(terms(formula),"term.labels"))
m = 1 + 1 / npv
est = svyby(formula, by, design, FUN, ...)
nv = attr(est,"svyby")$nstats
e = est[,2:(nv+1)]
v = est[,(nv+2):(nv+nv+1)]^2
estimate = apply(e, 1, mean)
sampvar = apply(v, 1, mean)
impuvar = apply(e, 1, var)
stderr = sqrt(sampvar + m * impuvar) # <======
res = cbind(estimate, stderr)
res
}
We use this function twice to get estimates and standard errors for the plausible values computes with dexter and for PISA’s original estimates:
res_pv = pvby(~PV1+PV2+PV3+PV4+PV5,~CNT,ds,svymean)
res_pisa = pvby(~PV1MATH+PV2MATH+PV3MATH+PV4MATH+PV5MATH,~CNT,ds,svymean)
plot(res_pisa[,1], res_pv[,1], main='Computed with survey', xlab='PISA', ylab='dexter')
Obviously, and in spite of some minor technical differences, the plausible values estimated with dexter produce almost the same results as the original PISA survey of 2012. This is nice, but it is neither surprising nor sensational. Having explained, step by step, how to compute with a large data set we are now ready to discuss what to compute, and why: a more entertaining discussion that will be taken up in the next posts.