vignettes/blog/2022-02-02-standard-errors-in-survey.Rmd
2022-02-02-standard-errors-in-survey.Rmd
For a survey statistician, there can hardly be anything more helpful
than Thomas Lumley’s survey
Lumley
(2010). Not surprisingly, it has been downloaded over 3.1 million
times at the time of writing, which makes it one of the more popular
packages for R (R Development Core Team
2005).
Unfortunately, survey
does not have functions to
estimate two popular statistics, the standard deviation and the Pearson
correlation, and their standard errors. Packages like srvyr
(Freedman Ellis and Schneider 2021) or
jtools
(Long 2020) have
attempted to fill this lacune, but they only provide the estimates, not
the standard errors.
Consider an object, est
, containing survey
estimates for the variances of three variables. The variables happen to
be plausible values (Marsman et al. 2016)
from a large educational survey, and the estimates have been produced by
a replication method (BRR, see Lumley
(2010), Ch.2.3). Note that I could have had any variables instead
of plausible values (remember petal length? sepal width?), and that we
are discussing the standard error under repeated sampling form a
population. The question about the meaning of the correlation between
two sets of plausible values, or the variability seen in a whole bunch
of them, is psychometrically relevant, but it will be discussed in a
different post.
Applying the generic print method, I get:
## variance SE
## PV1 1.2476 0.0268
## PV2 1.2638 0.0252
## PV3 1.2558 0.0255
I may be misled about the standard errors if I type
sqrt(est)
## variance SE
## PV1 1.1169 0.0268
## PV2 1.1242 0.0252
## PV3 1.1206 0.0255
On the other hand, if I type
str(est)
## 'svrepvar' num [1:3, 1:3] 1.25 1.15 1.15 1.15 1.26 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3] "PV1" "PV2" "PV3"
## ..$ : chr [1:3] "PV1" "PV2" "PV3"
## - attr(*, "var")= num [1:9, 1:9] 0.00072 0.000594 0.000611 0.000594 0.000521 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
## .. ..$ : chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
## ..- attr(*, "means")= Named num [1:9] 1.25 1.15 1.15 1.15 1.26 ...
## .. ..- attr(*, "names")= chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
## - attr(*, "statistic")= chr "variance"
I get to see the entire contents of this object. It is a
matrix, let us call it
,
containing the estimated covariance matrix of my plausible values. It
has several attributes, one of which, var
, is the
covariance matrix of
,
which we may call
.
Because the standard deviation is a simple function of the statistic
that survey
does provide, we can derive its approximate
standard error by applying the delta theorem . If
is the estimate of the variance and
is the variance of that estimate, the variance for the standard
deviation is estimated as
multiplied by the square of the first derivative of
,
i.e.~,
at
.
In other words, the standard error for
obtains as
.
The correlation, , is a function of three variables, the covariance and the two variances. We have to apply the multivariate delta theorem: the original covariance matrix is pre- and post-multiplied with the vector of the partial derivatives of the function with respect to each of the three variables. For simplicity, let be the submatrix of containing just the rows and columns that have to do with , , and . Then, the standard error of the correlation will be estimated as , where
The code is provided below. There is just a short function,
deltify
, that takes as a single argument the object
returned by survey
and returns a list of two matrices:
estimates
has the standard deviations along the diagonal
and the correlations in the off-diagonal cells, and stderr
shows the standard errors similarly arranged:
deltify = function(est) {
v = attr(est,"var")
n = nrow(est)
est2 = var2 = matrix(0, n, n)
pos = function(i,j,n){(i-1)*n+j}
for (i in 2:n) {
for (j in 1 : (i-1)) {
est2[i,j] = est[i,j] / (sqrt(est[i,i])*sqrt(est[j,j]))
jac = rep(0, n*n)
jac[pos(i,j,n)] = 1.0 / (sqrt(est[i,i])*sqrt(est[j,j]))
jac[pos(i,i,n)] = - est[i,j] / (2.0 * est[i,i]^1.5 * sqrt(est[j,j]))
jac[pos(j,j,n)] = - est[i,j] / (2.0 * est[j,j]^1.5 * sqrt(est[i,i]))
var2[i,j] = jac %*% v %*% jac
}
}
var2 = var2 + t(var2)
est2 = est2 + t(est2)
for (i in 1:n) {
est2[i,i] = sqrt(est[i,i])
var2[i,i] = v[pos(i,i,n)] / (4*est[i,i])
}
dimnames(est2) = dimnames(var2) = dimnames(est)
list(estimates=est2, stderr=sqrt(var2))
}
deltify(est)
## $estimates
## PV1 PV2 PV3
## PV1 1.1169480 0.9161572 0.9183138
## PV2 0.9161572 1.1241876 0.9174176
## PV3 0.9183138 0.9174176 1.1206254
##
## $stderr
## PV1 PV2 PV3
## PV1 0.012011386 0.001819093 0.001922409
## PV2 0.001819093 0.010153011 0.001963528
## PV3 0.001922409 0.001963528 0.010355547