`vignettes/blog/2022-02-02-standard-errors-in-survey.Rmd`

`2022-02-02-standard-errors-in-survey.Rmd`

For a survey statistician, there can hardly be anything more helpful than Thomas Lumley’s `survey`

(Lumley 2004, @lumley). Not surprisingly, it has been downloaded over 3.1 million times at the time of writing, which makes it one of the more popular packages for R (R Development Core Team 2005).

Unfortunately, `survey`

does not have functions to estimate two popular statistics, the standard deviation and the Pearson correlation, and their standard errors. Packages like `srvyr`

(Freedman Ellis and Schneider 2021) or `jtools`

(Long 2020) have attempted to fill this lacune, but they only provide the estimates, not the standard errors.

Consider an object, `est`

, containing `survey`

estimates for the variances of three variables. The variables happen to be plausible values (Marsman et al. 2016) from a large educational survey, and the estimates have been produced by a replication method (BRR, see Lumley (2010), Ch.2.3). Note that I could have had any variables instead of plausible values (remember petal length? sepal width?), and that we are discussing the standard error under repeated sampling form a population. The question about the meaning of the correlation between two sets of plausible values, or the variability seen in a whole bunch of them, is psychometrically relevant, but it will be discussed in a different post.

Applying the generic print method, I get:

```
## variance SE
## PV1 1.2476 0.0268
## PV2 1.2638 0.0252
## PV3 1.2558 0.0255
```

I may be misled about the standard errors if I type

`sqrt(est)`

```
## variance SE
## PV1 1.1169 0.0268
## PV2 1.1242 0.0252
## PV3 1.1206 0.0255
```

On the other hand, if I type

`str(est)`

```
## 'svrepvar' num [1:3, 1:3] 1.25 1.15 1.15 1.15 1.26 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:3] "PV1" "PV2" "PV3"
## ..$ : chr [1:3] "PV1" "PV2" "PV3"
## - attr(*, "var")= num [1:9, 1:9] 0.00072 0.000594 0.000611 0.000594 0.000521 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
## .. ..$ : chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
## ..- attr(*, "means")= Named num [1:9] 1.25 1.15 1.15 1.15 1.26 ...
## .. ..- attr(*, "names")= chr [1:9] "PV1" "PV1" "PV1" "PV2" ...
## - attr(*, "statistic")= chr "variance"
```

I get to see the entire contents of this object. It is a \(3\times 3\) matrix, let us call it \(\hat S\), containing the estimated covariance matrix of my plausible values. It has several attributes, one of which, `var`

, is the \(9\times 9\) covariance matrix of \(\hat S\), which we may call \(\hat V\).

Because the standard deviation is a simple function of the statistic that `survey`

does provide, we can derive its approximate standard error by applying the delta theorem . If \(\hat{s^2}\) is the estimate of the variance and \(\hat{v^2}\) is the variance of that estimate, the variance for the standard deviation is estimated as \(\hat{v^2}\) multiplied by the square of the first derivative of \(\sqrt{x}\), i.e.~\(1/(2\sqrt{x})\), at \(\hat{s^2}\). In other words, the standard error for \(\sqrt{\hat s^2}\) obtains as \(\sqrt{\hat{v^2}/4\hat{s^2}}=0.5\sqrt{\hat{v^2}/\hat{s^2}}\).

The correlation, \(\hat r= \hat{s_{xy}}/\sqrt{\hat{s^2_{x}}\hat{s^2_{y}}}\), is a function of three variables, the covariance and the two variances. We have to apply the multivariate delta theorem: the original covariance matrix is pre- and post-multiplied with the vector of the partial derivatives of the function with respect to each of the three variables. For simplicity, let \(M\) be the \(3\times 3\) submatrix of \(\hat{V}\) containing just the rows and columns that have to do with \(\hat{s_{xy}}\), \(\hat{s^2_{x}}\), and \(\hat{s^2_{y}}\). Then, the standard error of the correlation will be estimated as \(J'MJ\), where

\[ J = \left( \begin{array}{c} \displaystyle{\frac{ \partial \hat{r} } { \partial \hat{s_{xy}} }}\\ \displaystyle{\frac{ \partial \hat r}{\partial \hat{s^2_{x}}}} \\ \displaystyle{\frac{ \partial \hat r}{\partial \hat{s^2_{y}}}} \end{array} \right) = \left( \begin{array}{c} \displaystyle{\frac{ 1 } { \sqrt{ \hat{s^2_{x}}\hat{s^2_{y}}} }}\\ - \displaystyle{\frac{ \hat{s_{xy}}}{2\hat{s^2_{x}}^{3/2}\sqrt{\hat{s^2_{y}}}}} \\ - \displaystyle{\frac{ \hat{s_{xy}}}{2\sqrt{\hat{s^2_{x}}}\hat{s^2_{y}}^{3/2}}} \end{array} \right) \]

The code is provided below. There is just a short function, `deltify`

, that takes as a single argument the object returned by `survey`

and returns a list of two matrices: `estimates`

has the standard deviations along the diagonal and the correlations in the off-diagonal cells, and `stderr`

shows the standard errors similarly arranged:

```
deltify = function(est) {
v = attr(est,"var")
n = nrow(est)
est2 = var2 = matrix(0, n, n)
pos = function(i,j,n){(i-1)*n+j}
for (i in 2:n) {
for (j in 1 : (i-1)) {
est2[i,j] = est[i,j] / (sqrt(est[i,i])*sqrt(est[j,j]))
jac = rep(0, n*n)
jac[pos(i,j,n)] = 1.0 / (sqrt(est[i,i])*sqrt(est[j,j]))
jac[pos(i,i,n)] = - est[i,j] / (2.0 * est[i,i]^1.5 * sqrt(est[j,j]))
jac[pos(j,j,n)] = - est[i,j] / (2.0 * est[j,j]^1.5 * sqrt(est[i,i]))
var2[i,j] = jac %*% v %*% jac
}
}
var2 = var2 + t(var2)
est2 = est2 + t(est2)
for (i in 1:n) {
est2[i,i] = sqrt(est[i,i])
var2[i,i] = v[pos(i,i,n)] / (4*est[i,i])
}
dimnames(est2) = dimnames(var2) = dimnames(est)
list(estimates=est2, stderr=sqrt(var2))
}
deltify(est)
```

```
## $estimates
## PV1 PV2 PV3
## PV1 1.1169480 0.9161572 0.9183138
## PV2 0.9161572 1.1241876 0.9174176
## PV3 0.9183138 0.9174176 1.1206254
##
## $stderr
## PV1 PV2 PV3
## PV1 0.012011386 0.001819093 0.001922409
## PV2 0.001819093 0.010153011 0.001963528
## PV3 0.001922409 0.001963528 0.010355547
```

Freedman Ellis, Greg, and Ben Schneider. 2021. *Srvyr: ’Dplyr’-Like Syntax for Summary Statistics of Survey Data*. https://CRAN.R-project.org/package=srvyr.

Long, Jacob A. 2020. *Jtools: Analysis and Presentation of Social Scientific Data*. https://cran.r-project.org/package=jtools.

Lumley, Thomas. 2004. “Analysis of Complex Survey Samples.” *Journal of Statistical Software* 9 (1): 1–19.

———. 2010. *Complex Surveys: A Guide to Analysis Using R: A Guide to Analysis Using R*. John Wiley; Sons.

Marsman, M., G. Maris, T. M. Bechger, and C. A. W. Glas. 2016. “What can we learn from plausible values?” *Psychometrika*, 1–16.

R Development Core Team. 2005. *R: A Language and Environment for Statistical Computing*. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.