# Survey Calibration

## Introduction

Calibration is a widely used technique in survey sampling. Suppose \(m\) sampling units in a survey have been assigned initial weights \(d_i\) for \(i = 1,\ldots,m\), and furthermore, there are \(n\) auxiliary variables whose values in the sample are known. Calibration seeks to improve the initial weights \(d_i\) by finding new weights \(w_i\) that incorporate this auxiliary information while perturbing the initial weights as little as possible, , the ratio \(g_i = w_i/d_i\) must be close to one. Such reweighting improves precision of estimates (Chapter 7, Lumley (2010)).

Let \(X \in {\mathbf R}^{m \times n}\) be the matrix of survey samples, with each column corresponding to an auxiliary variable. Reweighting can be expressed as the optimization problem (see Davies, Gillard, and Zhigljavsky (2016)):

\[ \begin{array}{ll} \mbox{minimize} & \sum_{i=1}^m d_i\phi(g_i) \\ \mbox{subject to} & A^Tg = r \end{array} \]

with respect to \(g \in {\mathbf R}^m\), where \(\phi:{\mathbf R} \rightarrow {\mathbf R}\) is a strictly convex function with \(\phi(1) = 0\), \(r \in {\mathbf R}^n\) are the known population totals of the auxiliary variables, and \(A \in {\mathbf R}^{m \times n}\) is related to \(X\) by \(A_{ij} = d_iX_{ij}\) for \(i = 1,\ldots,m\) and \(j = 1,\ldots,n\).

## Raking

A common calibration technique is *raking*, which uses the penalty
function \(\phi(g_i) = g_i\log(g_i) - g_i + 1\) as the calibration
metric.

We illustrate with the California Academic Performance Index data in
the `survey`

package (Lumley (2018)) which also supplies facilities for
calibration via the function `calibrate`

. Both the population dataset
(`apipop`

) and a simple random sample of \(m = 200\) (`apisrs`

) are
provided. Suppose that we wish to reweight the observations in the
sample using known totals for two variables from the population:
`stype`

, the school type (elementary, middle or high) and `sch.wide`

,
whether the school met the yearly target or not. This reweighting
would make the sample more representative of the general population.

The code below estimates the weights using `survey::calibrate`

.

```
data(api)
design_api <- svydesign(id = ~dnum, weights = ~pw, data = apisrs)
formula <- ~stype + sch.wide
T <- apply(model.matrix(object = formula, data = apipop),
2,
sum)
cal_api <- calibrate(design_api, formula, population = T, calfun = cal.raking)
w_survey <- weights(cal_api)
```

The `CVXR`

formulation follows.

```
di <- apisrs$pw
X <- model.matrix(object = formula, data = apisrs)
A <- di * X
n <- nrow(apisrs)
g <- Variable(n)
constraints <- list(t(A) %*% g == T)
## Raking
Phi_R <- Minimize(sum(di * (-entr(g) - g + 1)))
p <- Problem(Phi_R, constraints)
res <- solve(p)
w_cvxr <- di * res$getValue(g)
```

We compare the results below in a table which show them to be identical.

```
## Using functions in the *un echoed* preamble of this document...
build_table(d1 = build_df(apisrs, "Survey", w_survey),
d2 = build_df(apisrs, "CVXR", w_cvxr),
title = "Calibration weights from Raking")
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.911 | 15 | 28.911 | 15 |

E | Yes | 31.396 | 127 | 31.396 | 127 |

H | No | 29.003 | 13 | 29.003 | 13 |

H | Yes | 31.497 | 12 | 31.497 | 12 |

M | No | 29.033 | 9 | 29.033 | 9 |

M | Yes | 31.529 | 24 | 31.529 | 24 |

## Other Calibration Metrics

Two other penalty functions commonly used are:

**Quadratic**\[ \phi^{Q}(g) = \frac{1}{2}(g-1)^2; \]**Logit**\[ \phi^{L}(g; l, u) = \frac{1}{C}\biggl[ (g-l)\log\left(\frac{g-l}{1-l}\right) + (u-g)\log\left(\frac{u-g}{u-1}\right) \biggr] \mbox{ for } C = \frac{u-l}{(1-l)(u-1)}. \]

It is again easy to incorporate these in our example and compare to
`survey`

results.

### Quadratic

The `survey`

function for this calibration is invoked as `cal.linear`

.

```
## Quadratic
Phi_Q <- Minimize(sum_squares(g - 1) / 2)
p <- Problem(Phi_Q, constraints)
res <- solve(p)
w_cvxr_q <- di * res$getValue(g)
w_survey_q <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear))
```

**N.B.** This example used the `SCS`

solver (before `CVXR v0.99-7)`

as
the default `ECOS`

solver produced a different number of unique
weights. With updates to both `ECOSolveR`

and `scs`

, the situation is
now reversed. *Such differences are not unheard of among solvers!*
However, one can obtain the same results by using additional arguments
`solver = "SCS"`

and `acceleration_lookback = 5L`

in the call to
`CVXR::solve()`

above.

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.907 | 15 | 28.907 | 15 |

E | Yes | 31.397 | 127 | 31.397 | 127 |

H | No | 29.005 | 13 | 29.005 | 13 |

H | Yes | 31.495 | 12 | 31.495 | 12 |

M | No | 29.037 | 9 | 29.037 | 9 |

M | Yes | 31.528 | 24 | 31.528 | 24 |

### Logistic

Finally, the logistic, which requires bounds \(l\) and \(u\) on the coefficients; we use \(l=0.9\) and \(u=1.1\).

```
u <- 1.10; l <- 0.90
w_survey_l <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear,
bounds = c(l, u)))
Phi_L <- Minimize(sum(-entr((g - l) / (u - l)) -
entr((u - g) / (u - l))))
p <- Problem(Phi_L, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_l <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.907 | 15 | 28.929 | 15 |

E | Yes | 31.397 | 127 | 31.394 | 127 |

H | No | 29.005 | 13 | 28.995 | 13 |

H | Yes | 31.495 | 12 | 31.505 | 12 |

M | No | 29.037 | 9 | 29.014 | 9 |

M | Yes | 31.528 | 24 | 31.536 | 24 |

## Further Metrics

Following examples in `survey::calibrate`

, we can try a few other metrics.

First, the hellinger distance.

```
hellinger <- make.calfun(Fm1 = function(u, bounds) ((1 - u / 2)^-2) - 1,
dF= function(u, bounds) (1 -u / 2)^-3 ,
name = "Hellinger distance")
w_survey_h <- weights(calibrate(design_api, formula, population = T, calfun = hellinger))
Phi_h <- Minimize(sum((1 - g / 2)^(-2)))
p <- Problem(Phi_h, constraints)
res <- solve(p)
w_cvxr_h <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.913 | 15 | 28.890 | 15 |

E | Yes | 31.396 | 127 | 31.399 | 127 |

H | No | 29.002 | 13 | 29.011 | 13 |

H | Yes | 31.498 | 12 | 31.488 | 12 |

M | No | 29.031 | 9 | 29.056 | 9 |

M | Yes | 31.530 | 24 | 31.521 | 24 |

Next, the derivative of the inverse hyperbolic sine.

```
w_survey_s <- weights(calibrate(design_api, formula, population = T, calfun = cal.sinh,
bounds = c(l, u)))
Phi_s <- Minimize(sum( 0.5 * (exp(g) + exp(-g))))
p <- Problem(Phi_s, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_s <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.911 | 15 | 28.904 | 15 |

E | Yes | 31.396 | 127 | 31.397 | 127 |

H | No | 29.003 | 13 | 29.006 | 13 |

H | Yes | 31.497 | 12 | 31.494 | 12 |

M | No | 29.033 | 9 | 29.041 | 9 |

M | Yes | 31.529 | 24 | 31.526 | 24 |

## Session Info

`sessionInfo()`

```
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-apple-darwin19.5.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.10_1/lib/libopenblasp-r0.3.10.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] survey_4.0 survival_3.1-12 Matrix_1.2-18 dplyr_1.0.0
## [5] kableExtra_1.1.0 CVXR_1.0-9
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.15 slam_0.1-47 purrr_0.3.4
## [5] mitools_2.4 splines_4.0.2 lattice_0.20-41 Rmosek_9.2.3
## [9] colorspace_1.4-1 vctrs_0.3.2 generics_0.0.2 htmltools_0.5.0
## [13] viridisLite_0.3.0 yaml_2.2.1 gmp_0.6-0 rlang_0.4.7
## [17] pillar_1.4.6 glue_1.4.1 Rmpfr_0.8-1 DBI_1.1.0
## [21] Rcplex_0.3-3 bit64_0.9-7 lifecycle_0.2.0 stringr_1.4.0
## [25] munsell_0.5.0 blogdown_0.19 gurobi_9.0.3.1 rvest_0.3.5
## [29] codetools_0.2-16 evaluate_0.14 knitr_1.28 cccp_0.2-4
## [33] highr_0.8 Rcpp_1.0.5 readr_1.3.1 scales_1.1.1
## [37] osqp_0.6.0.3 webshot_0.5.2 bit_1.1-15.2 hms_0.5.3
## [41] digest_0.6.25 stringi_1.4.6 bookdown_0.19 Rglpk_0.6-4
## [45] ECOSolveR_0.5.3 tools_4.0.2 magrittr_1.5 tibble_3.0.3
## [49] crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1 MASS_7.3-51.6
## [53] rcbc_0.1.0.9001 xml2_1.3.2 assertthat_0.2.1 rmarkdown_2.3
## [57] httr_1.4.2 rstudioapi_0.11 R6_2.4.1 compiler_4.0.2
```

## Source

## References

Davies, G., J. Gillard, and A. Zhigljavsky. 2016. “Comparative Study of Different Penalty Functions and Algorithms in Survey Calibration.” In *Advances in Stochastic and Deterministic Global Optimization*, 87–127. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-29975-4_6.

Lumley, Thomas. 2018. “Survey: Analysis of Complex Survey Samples.”

Lumley, Thomas S. 2010. *Complex Surveys: A Guide to Analysis Using R*. Wiley Publishing.