Approximate survey errors for the correlation and the standard deviation: with R package survey

For a survey statistician, there can hardly be anything more helpful than Thomas Lumley’s survey Lumley (2010). Not surprisingly, it has been downloaded over 3.1 million times at the time of writing, which makes it one of the more popular packages for R (R Development Core Team 2005).

Unfortunately, survey does not have functions to estimate two popular statistics, the standard deviation and the Pearson correlation, and their standard errors. Packages like srvyr (Freedman Ellis and Schneider 2021) or jtools (Long 2020) have attempted to fill this lacune, but they only provide the estimates, not the standard errors.

Consider an object, est, containing survey estimates for the variances of three variables. The variables happen to be plausible values (Marsman et al. 2016) from a large educational survey, and the estimates have been produced by a replication method (BRR, see Lumley (2010), Ch.2.3). Note that I could have had any variables instead of plausible values (remember petal length? sepal width?), and that we are discussing the standard error under repeated sampling form a population. The question about the meaning of the correlation between two sets of plausible values, or the variability seen in a whole bunch of them, is psychometrically relevant, but it will be discussed in a different post.

Applying the generic print method, I get:

library(survey)
load('/Rdatasets/est.Rdata')
print(est)

##     variance     SE
## PV1   1.2476 0.0268
## PV2   1.2638 0.0252
## PV3   1.2558 0.0255

I may be misled about the standard errors if I type

sqrt(est)

##     variance     SE
## PV1   1.1169 0.0268
## PV2   1.1242 0.0252
## PV3   1.1206 0.0255

On the other hand, if I type

str(est)

##  'svrepvar' num [1:3, 1:3] 1.25 1.15 1.15 1.15 1.26 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:3] "PV1" "PV2" "PV3"
##   ..$ : chr [1:3] "PV1" "PV2" "PV3"
##  - attr(*, "var")= num [1:9, 1:9] 0.00072 0.000594 0.000611 0.000594 0.000521 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
##   .. ..$ : chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
##   ..- attr(*, "means")= Named num [1:9] 1.25 1.15 1.15 1.15 1.26 ...
##   .. ..- attr(*, "names")= chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
##  - attr(*, "statistic")= chr "variance"

I get to see the entire contents of this object. It is a $3\times 3$ matrix, let us call it $\hat S$ , containing the estimated covariance matrix of my plausible values. It has several attributes, one of which, var, is the $9\times 9$ covariance matrix of $\hat S$ , which we may call $\hat V$ .

Because the standard deviation is a simple function of the statistic that survey does provide, we can derive its approximate standard error by applying the delta theorem . If $\hat{s^2}$ is the estimate of the variance and $\hat{v^2}$ is the variance of that estimate, the variance for the standard deviation is estimated as $\hat{v^2}$ multiplied by the square of the first derivative of $\sqrt{x}$ , i.e.~ $1/(2\sqrt{x})$ , at $\hat{s^2}$ . In other words, the standard error for $\sqrt{\hat s^2}$ obtains as $\sqrt{\hat{v^2}/4\hat{s^2}}=0.5\sqrt{\hat{v^2}/\hat{s^2}}$ .

The correlation, $\hat r= \hat{s_{xy}}/\sqrt{\hat{s^2_{x}}\hat{s^2_{y}}}$ , is a function of three variables, the covariance and the two variances. We have to apply the multivariate delta theorem: the original covariance matrix is pre- and post-multiplied with the vector of the partial derivatives of the function with respect to each of the three variables. For simplicity, let $M$ be the $3\times 3$ submatrix of $\hat{V}$ containing just the rows and columns that have to do with $\hat{s_{xy}}$ , $\hat{s^2_{x}}$ , and $\hat{s^2_{y}}$ . Then, the standard error of the correlation will be estimated as $J'MJ$ , where

$J = \left( \begin{array}{c} \displaystyle{\frac{ \partial \hat{r} } { \partial \hat{s_{xy}} }}\\ \displaystyle{\frac{ \partial \hat r}{\partial \hat{s^2_{x}}}} \\ \displaystyle{\frac{ \partial \hat r}{\partial \hat{s^2_{y}}}} \end{array} \right) = \left( \begin{array}{c} \displaystyle{\frac{ 1 } { \sqrt{ \hat{s^2_{x}}\hat{s^2_{y}}} }}\\ - \displaystyle{\frac{ \hat{s_{xy}}}{2\hat{s^2_{x}}^{3/2}\sqrt{\hat{s^2_{y}}}}} \\ - \displaystyle{\frac{ \hat{s_{xy}}}{2\sqrt{\hat{s^2_{x}}}\hat{s^2_{y}}^{3/2}}} \end{array} \right)$

The code is provided below. There is just a short function, deltify, that takes as a single argument the object returned by survey and returns a list of two matrices: estimates has the standard deviations along the diagonal and the correlations in the off-diagonal cells, and stderr shows the standard errors similarly arranged:

deltify = function(est) {
  v = attr(est,"var")
  n = nrow(est)
  est2 = var2 = matrix(0, n, n)
  pos = function(i,j,n){(i-1)*n+j}
  
  for (i in 2:n) {
    for (j in 1 : (i-1)) {
      est2[i,j] = est[i,j] / (sqrt(est[i,i])*sqrt(est[j,j]))
      jac = rep(0, n*n)
      jac[pos(i,j,n)] = 1.0 / (sqrt(est[i,i])*sqrt(est[j,j]))
      jac[pos(i,i,n)] = - est[i,j] / (2.0 * est[i,i]^1.5 * sqrt(est[j,j]))
      jac[pos(j,j,n)] = - est[i,j] / (2.0 * est[j,j]^1.5 * sqrt(est[i,i]))
      var2[i,j] = jac %*% v %*% jac
    }
  }

  var2 = var2 + t(var2)
  est2 = est2 + t(est2)

  for (i in 1:n) {
    est2[i,i] = sqrt(est[i,i])
    var2[i,i] = v[pos(i,i,n)] / (4*est[i,i])
  }
  dimnames(est2) = dimnames(var2) = dimnames(est)
  list(estimates=est2, stderr=sqrt(var2))
}

deltify(est)

## $estimates
##           PV1       PV2       PV3
## PV1 1.1169480 0.9161572 0.9183138
## PV2 0.9161572 1.1241876 0.9174176
## PV3 0.9183138 0.9174176 1.1206254
## 
## $stderr
##             PV1         PV2         PV3
## PV1 0.012011386 0.001819093 0.001922409
## PV2 0.001819093 0.010153011 0.001963528
## PV3 0.001922409 0.001963528 0.010355547

References

Freedman Ellis, Greg, and Ben Schneider. 2021. Srvyr: ’Dplyr’-Like Syntax for Summary Statistics of Survey Data. https://CRAN.R-project.org/package=srvyr.

Long, Jacob A. 2020. Jtools: Analysis and Presentation of Social Scientific Data. https://cran.r-project.org/package=jtools.

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” Journal of Statistical Software 9 (1): 1–19.

———. 2010. Complex Surveys: A Guide to Analysis Using r: A Guide to Analysis Using r. John Wiley; Sons.

Marsman, M., G. Maris, T. M. Bechger, and C. A. W. Glas. 2016. “What can we learn from plausible values?” Psychometrika, 1–16.

R Development Core Team. 2005. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Ivailo Partchev

2022-02-04

References