# Survey Calibration

## Introduction

Calibration is a widely used technique in survey sampling. Suppose \(m\) sampling units in a survey have been assigned initial weights \(d_i\) for \(i = 1,\ldots,m\), and furthermore, there are \(n\) auxiliary variables whose values in the sample are known. Calibration seeks to improve the initial weights \(d_i\) by finding new weights \(w_i\) that incorporate this auxiliary information while perturbing the initial weights as little as possible, , the ratio \(g_i = w_i/d_i\) must be close to one. Such reweighting improves precision of estimates (Chapter 7, T. S. Lumley (2010)).

Let \(X \in {\mathbf R}^{m \times n}\) be the matrix of survey samples, with each column corresponding to an auxiliary variable. Reweighting can be expressed as the optimization problem (see Davies, Gillard, and Zhigljavsky (2016)):

\[ \begin{array}{ll} \mbox{minimize} & \sum_{i=1}^m d_i\phi(g_i) \\ \mbox{subject to} & A^Tg = r \end{array} \]

with respect to \(g \in {\mathbf R}^m\), where \(\phi:{\mathbf R} \rightarrow {\mathbf R}\) is a strictly convex function with \(\phi(1) = 0\), \(r \in {\mathbf R}^n\) are the known population totals of the auxiliary variables, and \(A \in {\mathbf R}^{m \times n}\) is related to \(X\) by \(A_{ij} = d_iX_{ij}\) for \(i = 1,\ldots,m\) and \(j = 1,\ldots,n\).

## Raking

A common calibration technique is *raking*, which uses the penalty
function \(\phi(g_i) = g_i\log(g_i) - g_i + 1\) as the calibration
metric.

We illustrate with the California Academic Performance Index data in
the `survey`

package (T. Lumley (2018)) which also supplies facilities for
calibration via the function `calibrate`

. Both the population dataset
(`apipop`

) and a simple random sample of \(m = 200\) (`apisrs`

) are
provided. Suppose that we wish to reweight the observations in the
sample using known totals for two variables from the population:
`stype`

, the school type (elementary, middle or high) and `sch.wide`

,
whether the school met the yearly target or not. This reweighting
would make the sample more representative of the general population.

The code below estimates the weights using `survey::calibrate`

.

```
data(api)
design_api <- svydesign(id = ~dnum, weights = ~pw, data = apisrs)
```

```
## Warning in svydesign(id = ~dnum, weights = ~pw, data = apisrs): partial
## argument match of 'id' to 'ids'
```

```
## Warning in svydesign.default(id = ~dnum, weights = ~pw, data = apisrs): partial
## argument match of 'id' to 'ids'
```

```
## Warning in match.call(definition, call, expand.dots, envir): partial argument
## match of 'id' to 'ids'
```

```
formula <- ~stype + sch.wide
T <- apply(model.matrix(object = formula, data = apipop),
2,
sum)
cal_api <- calibrate(design_api, formula, population = T, calfun = cal.raking)
w_survey <- weights(cal_api)
```

The `CVXR`

formulation follows.

```
di <- apisrs$pw
X <- model.matrix(object = formula, data = apisrs)
A <- di * X
n <- nrow(apisrs)
g <- Variable(n)
constraints <- list(t(A) %*% g == T)
## Raking
Phi_R <- Minimize(sum(di * (-entr(g) - g + 1)))
p <- Problem(Phi_R, constraints)
res <- solve(p)
w_cvxr <- di * res$getValue(g)
```

We compare the results below in a table which show them to be identical.

```
## Using functions in the *un echoed* preamble of this document...
build_table(d1 = build_df(apisrs, "Survey", w_survey),
d2 = build_df(apisrs, "CVXR", w_cvxr),
title = "Calibration weights from Raking")
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.911 | 15 | 28.911 | 15 |

E | Yes | 31.396 | 127 | 31.396 | 127 |

H | No | 29.003 | 13 | 29.003 | 13 |

H | Yes | 31.497 | 12 | 31.497 | 12 |

M | No | 29.033 | 9 | 29.033 | 9 |

M | Yes | 31.529 | 24 | 31.529 | 24 |

## Other Calibration Metrics

Two other penalty functions commonly used are:

**Quadratic**\[ \phi^{Q}(g) = \frac{1}{2}(g-1)^2; \]**Logit**\[ \phi^{L}(g; l, u) = \frac{1}{C}\biggl[ (g-l)\log\left(\frac{g-l}{1-l}\right) + (u-g)\log\left(\frac{u-g}{u-1}\right) \biggr] \mbox{ for } C = \frac{u-l}{(1-l)(u-1)}. \]

It is again easy to incorporate these in our example and compare to
`survey`

results.

### Quadratic

The `survey`

function for this calibration is invoked as `cal.linear`

.

```
## Quadratic
Phi_Q <- Minimize(sum_squares(g - 1) / 2)
p <- Problem(Phi_Q, constraints)
res <- solve(p)
w_cvxr_q <- di * res$getValue(g)
w_survey_q <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear))
```

**N.B.** This example used the `SCS`

solver (before `CVXR v0.99-7)`

as
the default `ECOS`

solver produced a different number of unique
weights. With updates to both `ECOSolveR`

and `scs`

, the situation is
now reversed. *Such differences are not unheard of among solvers!*
However, one can obtain the same results by using additional arguments
`solver = "SCS"`

and `acceleration_lookback = 5L`

in the call to
`CVXR::solve()`

above.

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.907 | 15 | 28.907 | 15 |

E | Yes | 31.397 | 127 | 31.397 | 127 |

H | No | 29.005 | 13 | 29.005 | 13 |

H | Yes | 31.495 | 12 | 31.495 | 12 |

M | No | 29.037 | 9 | 29.037 | 9 |

M | Yes | 31.528 | 24 | 31.528 | 24 |

### Logistic

Finally, the logistic, which requires bounds \(l\) and \(u\) on the coefficients; we use \(l=0.9\) and \(u=1.1\).

```
u <- 1.10; l <- 0.90
w_survey_l <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear,
bounds = c(l, u)))
Phi_L <- Minimize(sum(-entr((g - l) / (u - l)) -
entr((u - g) / (u - l))))
p <- Problem(Phi_L, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_l <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.907 | 15 | 28.929 | 15 |

E | Yes | 31.397 | 127 | 31.394 | 127 |

H | No | 29.005 | 13 | 28.995 | 13 |

H | Yes | 31.495 | 12 | 31.505 | 12 |

M | No | 29.037 | 9 | 29.014 | 9 |

M | Yes | 31.528 | 24 | 31.536 | 24 |

## Further Metrics

Following examples in `survey::calibrate`

, we can try a few other metrics.

First, the hellinger distance.

```
hellinger <- make.calfun(Fm1 = function(u, bounds) ((1 - u / 2)^-2) - 1,
dF= function(u, bounds) (1 -u / 2)^-3 ,
name = "Hellinger distance")
w_survey_h <- weights(calibrate(design_api, formula, population = T, calfun = hellinger))
Phi_h <- Minimize(sum((1 - g / 2)^(-2)))
p <- Problem(Phi_h, constraints)
res <- solve(p)
w_cvxr_h <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.913 | 15 | 28.890 | 15 |

E | Yes | 31.396 | 127 | 31.399 | 127 |

H | No | 29.002 | 13 | 29.011 | 13 |

H | Yes | 31.498 | 12 | 31.488 | 12 |

M | No | 29.031 | 9 | 29.056 | 9 |

M | Yes | 31.530 | 24 | 31.521 | 24 |

Next, the derivative of the inverse hyperbolic sine.

```
w_survey_s <- weights(calibrate(design_api, formula, population = T, calfun = cal.sinh,
bounds = c(l, u)))
Phi_s <- Minimize(sum( 0.5 * (exp(g) + exp(-g))))
p <- Problem(Phi_s, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_s <- di * res$getValue(g)
```

stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|

E | No | 28.911 | 15 | 28.904 | 15 |

E | Yes | 31.396 | 127 | 31.397 | 127 |

H | No | 29.003 | 13 | 29.006 | 13 |

H | Yes | 31.497 | 12 | 31.494 | 12 |

M | No | 29.033 | 9 | 29.041 | 9 |

M | Yes | 31.529 | 24 | 31.526 | 24 |

## Session Info

`sessionInfo()`

```
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin23.4.0
## Running under: macOS Sonoma 14.5
##
## Matrix products: default
## BLAS: /usr/local/Cellar/openblas/0.3.27/lib/libopenblasp-r0.3.27.dylib
## LAPACK: /usr/local/Cellar/r/4.4.0_1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] survey_4.4-2 survival_3.6-4 Matrix_1.7-0 dplyr_1.1.4
## [5] kableExtra_1.4.0 CVXR_1.0-13
##
## loaded via a namespace (and not attached):
## [1] gmp_0.7-4 sass_0.4.9 utf8_1.2.4 generics_0.1.3
## [5] slam_0.1-50 xml2_1.3.6 blogdown_1.19 stringi_1.8.4
## [9] lattice_0.22-6 digest_0.6.35 magrittr_2.0.3 evaluate_0.23
## [13] bookdown_0.39 fastmap_1.2.0 jsonlite_1.8.8 ECOSolveR_0.5.5
## [17] DBI_1.2.2 Rmosek_10.2.0 fansi_1.0.6 viridisLite_0.4.2
## [21] scales_1.3.0 codetools_0.2-20 jquerylib_0.1.4 cli_3.6.2
## [25] Rmpfr_0.9-5 mitools_2.4 rlang_1.1.3 Rglpk_0.6-5.1
## [29] bit64_4.0.5 munsell_0.5.1 splines_4.4.0 cachem_1.1.0
## [33] yaml_2.3.8 tools_4.4.0 osqp_0.6.3.2 Rcplex_0.3-6
## [37] rcbc_0.1.0.9001 colorspace_2.1-0 gurobi_11.0-0 assertthat_0.2.1
## [41] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 stringr_1.5.1
## [45] bit_4.0.5 MASS_7.3-60.2 cccp_0.3-1 pkgconfig_2.0.3
## [49] bslib_0.7.0 pillar_1.9.0 glue_1.7.0 Rcpp_1.0.12
## [53] systemfonts_1.1.0 highr_0.11 xfun_0.44 tibble_3.2.1
## [57] tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.47 htmltools_0.5.8.1
## [61] rmarkdown_2.27 svglite_2.1.3 compiler_4.4.0
```

## Source

## References

*Advances in Stochastic and Deterministic Global Optimization*, 87–127. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-29975-4_6.

*Complex Surveys: A Guide to Analysis Using r*. Wiley Publishing.