# Survey Calibration

## Introduction

Calibration is a widely used technique in survey sampling. Suppose \(m\) sampling units in a survey have been assigned initial weights \(d_i\) for \(i = 1,\ldots,m\), and furthermore, there are \(n\) auxiliary variables whose values in the sample are known. Calibration seeks to improve the initial weights \(d_i\) by finding new weights \(w_i\) that incorporate this auxiliary information while perturbing the initial weights as little as possible, , the ratio \(g_i = w_i/d_i\) must be close to one. Such reweighting improves precision of estimates (Chapter 7, Lumley (2010)).

Let \(X \in {\mathbf R}^{m \times n}\) be the matrix of survey samples, with each column corresponding to an auxiliary variable. Reweighting can be expressed as the optimization problem (see Davies, Gillard, and Zhigljavsky (2016)):

\[ \begin{array}{ll} \mbox{minimize} & \sum_{i=1}^m d_i\phi(g_i) \\ \mbox{subject to} & A^Tg = r \end{array} \]

with respect to \(g \in {\mathbf R}^m\), where \(\phi:{\mathbf R} \rightarrow {\mathbf R}\) is a strictly convex function with \(\phi(1) = 0\), \(r \in {\mathbf R}^n\) are the known population totals of the auxiliary variables, and \(A \in {\mathbf R}^{m \times n}\) is related to \(X\) by \(A_{ij} = d_iX_{ij}\) for \(i = 1,\ldots,m\) and \(j = 1,\ldots,n\).

## Raking

A common calibration technique is *raking*, which uses the penalty
function \(\phi(g_i) = g_i\log(g_i) - g_i + 1\) as the calibration
metric.

We illustrate with the California Academic Performance Index data in
the `survey`

package (Lumley (2018)) which also supplies facilities for
calibration via the function `calibrate`

. Both the population dataset
(`apipop`

) and a simple random sample of \(m = 200\) (`apisrs`

) are
provided. Suppose that we wish to reweight the observations in the
sample using known totals for two variables from the population:
`stype`

, the school type (elementary, middle or high) and `sch.wide`

,
whether the school met the yearly target or not. This reweighting
would make the sample more representative of the general population.

The code below estimates the weights using `survey::calibrate`

.

```
data(api)
design_api <- svydesign(id = ~dnum, weights = ~pw, data = apisrs)
formula <- ~stype + sch.wide
T <- apply(model.matrix(object = formula, data = apipop),
2,
sum)
cal_api <- calibrate(design_api, formula, population = T, calfun = cal.raking)
w_survey <- weights(cal_api)
```

The `CVXR`

formulation follows.

```
di <- apisrs$pw
X <- model.matrix(object = formula, data = apisrs)
A <- di * X
n <- nrow(apisrs)
g <- Variable(n)
constraints <- list(t(A) %*% g == T)
## Raking
Phi_R <- Minimize(sum(di * (-entr(g) - g + 1)))
p <- Problem(Phi_R, constraints)
res <- solve(p)
w_cvxr <- di * res$getValue(g)
```

We compare the results below in a table which show them to be identical.

```
## Using functions in the *un echoed* preamble of this document...
build_table(d1 = build_df(apisrs, "Survey", w_survey),
d2 = build_df(apisrs, "CVXR", w_cvxr),
title = "Calibration weights from Raking")
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.911 | 15 | 28.911 | 15 |

E | Yes | 31.396 | 127 | 31.396 | 127 |

H | No | 29.003 | 13 | 29.003 | 13 |

H | Yes | 31.497 | 12 | 31.497 | 12 |

M | No | 29.033 | 9 | 29.033 | 9 |

M | Yes | 31.529 | 24 | 31.529 | 24 |

## Other Calibration Metrics

Two other penalty functions commonly used are:

**Quadratic**\[ \phi^{Q}(g) = \frac{1}{2}(g-1)^2; \]**Logit**\[ \phi^{L}(g; l, u) = \frac{1}{C}\biggl[ (g-l)\log\left(\frac{g-l}{1-l}\right) + (u-g)\log\left(\frac{u-g}{u-1}\right) \biggr] \mbox{ for } C = \frac{u-l}{(1-l)(u-1)}. \]

It is again easy to incorporate these in our example and compare to
`survey`

results.

### Quadratic

The `survey`

function for this calibration is invoked as `cal.linear`

.

```
## Quadratic
Phi_Q <- Minimize(sum_squares(g - 1) / 2)
p <- Problem(Phi_Q, constraints)
res <- solve(p, solver = "SCS")
w_cvxr_q <- di * res$getValue(g)
w_survey_q <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear))
```

Note the use of the `SCS`

solver above; the default `ECOS`

solver
produces a different number of unique weights, for reasons we have not
fully investigated yet. (*Such differences are not unheard of among
solvers!*)

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.907 | 15 | 28.907 | 15 |

E | Yes | 31.397 | 127 | 31.397 | 127 |

H | No | 29.005 | 13 | 29.005 | 13 |

H | Yes | 31.495 | 12 | 31.495 | 12 |

M | No | 29.037 | 9 | 29.037 | 9 |

M | Yes | 31.528 | 24 | 31.528 | 24 |

### Logistic

Finally, the logistic, which requires bounds \(l\) and \(u\) on the coefficients; we use \(l=0.9\) and \(u=1.1\).

```
u <- 1.10; l <- 0.90
w_survey_l <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear,
bounds = c(l, u)))
Phi_L <- Minimize(sum(-entr((g - l) / (u - l)) -
entr((u - g) / (u - l))))
p <- Problem(Phi_L, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_l <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.907 | 15 | 28.929 | 15 |

E | Yes | 31.397 | 127 | 31.394 | 127 |

H | No | 29.005 | 13 | 28.995 | 13 |

H | Yes | 31.495 | 12 | 31.505 | 12 |

M | No | 29.037 | 9 | 29.014 | 9 |

M | Yes | 31.528 | 24 | 31.536 | 24 |

## Further Metrics

Following examples in `survey::calibrate`

, we can try a few other metrics.

First, the hellinger distance.

```
hellinger <- make.calfun(Fm1 = function(u, bounds) ((1 - u / 2)^-2) - 1,
dF= function(u, bounds) (1 -u / 2)^-3 ,
name = "Hellinger distance")
w_survey_h <- weights(calibrate(design_api, formula, population = T, calfun = hellinger))
Phi_h <- Minimize(sum((1 - g / 2)^(-2)))
p <- Problem(Phi_h, constraints)
res <- solve(p)
w_cvxr_h <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.913 | 15 | 28.890 | 15 |

E | Yes | 31.396 | 127 | 31.399 | 127 |

H | No | 29.002 | 13 | 29.011 | 13 |

H | Yes | 31.498 | 12 | 31.488 | 12 |

M | No | 29.031 | 9 | 29.056 | 9 |

M | Yes | 31.530 | 24 | 31.521 | 24 |

Next, the derivative of the inverse hyperbolic sine.

```
w_survey_s <- weights(calibrate(design_api, formula, population = T, calfun = cal.sinh,
bounds = c(l, u)))
Phi_s <- Minimize(sum( 0.5 * (exp(g) + exp(-g))))
p <- Problem(Phi_s, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_s <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.911 | 15 | 28.904 | 15 |

E | Yes | 31.396 | 127 | 31.397 | 127 |

H | No | 29.003 | 13 | 29.006 | 13 |

H | Yes | 31.497 | 12 | 31.494 | 12 |

M | No | 29.033 | 9 | 29.041 | 9 |

M | Yes | 31.529 | 24 | 31.526 | 24 |

## Session Info

`sessionInfo()`

```
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin18.5.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] survey_3.36 survival_2.44-1.1 Matrix_1.2-17 dplyr_0.8.1
## [5] kableExtra_1.1.0 CVXR_0.99-6
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.7 purrr_0.3.2
## [4] mitools_2.4 splines_3.6.0 lattice_0.20-38
## [7] colorspace_1.4-1 htmltools_0.3.6 viridisLite_0.3.0
## [10] yaml_2.2.0 gmp_0.5-13.5 rlang_0.3.4
## [13] R.oo_1.22.0 pillar_1.4.1 glue_1.3.1
## [16] Rmpfr_0.7-2 DBI_1.0.0 R.utils_2.8.0
## [19] bit64_0.9-7 scs_1.2-3 stringr_1.4.0
## [22] munsell_0.5.0 blogdown_0.12.1 rvest_0.3.4
## [25] R.methodsS3_1.7.1 evaluate_0.14 knitr_1.23
## [28] highr_0.8 Rcpp_1.0.1 readr_1.3.1
## [31] scales_1.0.0 webshot_0.5.1 bit_1.1-14
## [34] hms_0.4.2 digest_0.6.19 stringi_1.4.3
## [37] bookdown_0.11 ECOSolveR_0.5.2 tools_3.6.0
## [40] magrittr_1.5 tibble_2.1.2 crayon_1.3.4
## [43] pkgconfig_2.0.2 MASS_7.3-51.4 xml2_1.2.0
## [46] assertthat_0.2.1 rmarkdown_1.13 httr_1.4.0
## [49] rstudioapi_0.10 R6_2.4.0 compiler_3.6.0
```

## Source

## References

Davies, G., J. Gillard, and A. Zhigljavsky. 2016. “Comparative Study of Different Penalty Functions and Algorithms in Survey Calibration.” In *Advances in Stochastic and Deterministic Global Optimization*, 87–127. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-29975-4_6.

Lumley, Thomas. 2018. “Survey: Analysis of Complex Survey Samples.”

Lumley, Thomas S. 2010. *Complex Surveys: A Guide to Analysis Using R*. Wiley Publishing.