Survey Calibration
Introduction
Calibration is a widely used technique in survey sampling. Suppose \(m\) sampling units in a survey have been assigned initial weights \(d_i\) for \(i = 1,\ldots,m\), and furthermore, there are \(n\) auxiliary variables whose values in the sample are known. Calibration seeks to improve the initial weights \(d_i\) by finding new weights \(w_i\) that incorporate this auxiliary information while perturbing the initial weights as little as possible, , the ratio \(g_i = w_i/d_i\) must be close to one. Such reweighting improves precision of estimates (Chapter 7, T. S. Lumley (2010)).
Let \(X \in {\mathbf R}^{m \times n}\) be the matrix of survey samples, with each column corresponding to an auxiliary variable. Reweighting can be expressed as the optimization problem (see Davies, Gillard, and Zhigljavsky (2016)):
\[ \begin{array}{ll} \mbox{minimize} & \sum_{i=1}^m d_i\phi(g_i) \\ \mbox{subject to} & A^Tg = r \end{array} \]
with respect to \(g \in {\mathbf R}^m\), where \(\phi:{\mathbf R} \rightarrow {\mathbf R}\) is a strictly convex function with \(\phi(1) = 0\), \(r \in {\mathbf R}^n\) are the known population totals of the auxiliary variables, and \(A \in {\mathbf R}^{m \times n}\) is related to \(X\) by \(A_{ij} = d_iX_{ij}\) for \(i = 1,\ldots,m\) and \(j = 1,\ldots,n\).
Raking
A common calibration technique is raking, which uses the penalty function \(\phi(g_i) = g_i\log(g_i) - g_i + 1\) as the calibration metric.
We illustrate with the California Academic Performance Index data in
the survey
package (T. Lumley (2018)) which also supplies facilities for
calibration via the function calibrate
. Both the population dataset
(apipop
) and a simple random sample of \(m = 200\) (apisrs
) are
provided. Suppose that we wish to reweight the observations in the
sample using known totals for two variables from the population:
stype
, the school type (elementary, middle or high) and sch.wide
,
whether the school met the yearly target or not. This reweighting
would make the sample more representative of the general population.
The code below estimates the weights using survey::calibrate
.
data(api)
design_api <- svydesign(id = ~dnum, weights = ~pw, data = apisrs)
## Warning in svydesign(id = ~dnum, weights = ~pw, data = apisrs): partial
## argument match of 'id' to 'ids'
## Warning in svydesign.default(id = ~dnum, weights = ~pw, data = apisrs): partial
## argument match of 'id' to 'ids'
## Warning in match.call(definition, call, expand.dots, envir): partial argument
## match of 'id' to 'ids'
formula <- ~stype + sch.wide
T <- apply(model.matrix(object = formula, data = apipop),
2,
sum)
cal_api <- calibrate(design_api, formula, population = T, calfun = cal.raking)
w_survey <- weights(cal_api)
The CVXR
formulation follows.
di <- apisrs$pw
X <- model.matrix(object = formula, data = apisrs)
A <- di * X
n <- nrow(apisrs)
g <- Variable(n)
constraints <- list(t(A) %*% g == T)
## Raking
Phi_R <- Minimize(sum(di * (-entr(g) - g + 1)))
p <- Problem(Phi_R, constraints)
res <- solve(p)
w_cvxr <- di * res$getValue(g)
We compare the results below in a table which show them to be identical.
## Using functions in the *un echoed* preamble of this document...
build_table(d1 = build_df(apisrs, "Survey", w_survey),
d2 = build_df(apisrs, "CVXR", w_cvxr),
title = "Calibration weights from Raking")
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.911 | 15 | 28.911 | 15 |
E | Yes | 31.396 | 127 | 31.396 | 127 |
H | No | 29.003 | 13 | 29.003 | 13 |
H | Yes | 31.497 | 12 | 31.497 | 12 |
M | No | 29.033 | 9 | 29.033 | 9 |
M | Yes | 31.529 | 24 | 31.529 | 24 |
Other Calibration Metrics
Two other penalty functions commonly used are:
- Quadratic \[ \phi^{Q}(g) = \frac{1}{2}(g-1)^2; \]
- Logit \[ \phi^{L}(g; l, u) = \frac{1}{C}\biggl[ (g-l)\log\left(\frac{g-l}{1-l}\right) + (u-g)\log\left(\frac{u-g}{u-1}\right) \biggr] \mbox{ for } C = \frac{u-l}{(1-l)(u-1)}. \]
It is again easy to incorporate these in our example and compare to
survey
results.
Quadratic
The survey
function for this calibration is invoked as cal.linear
.
## Quadratic
Phi_Q <- Minimize(sum_squares(g - 1) / 2)
p <- Problem(Phi_Q, constraints)
res <- solve(p)
w_cvxr_q <- di * res$getValue(g)
w_survey_q <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear))
N.B. This example used the SCS
solver (before CVXR v0.99-7)
as
the default ECOS
solver produced a different number of unique
weights. With updates to both ECOSolveR
and scs
, the situation is
now reversed. Such differences are not unheard of among solvers!
However, one can obtain the same results by using additional arguments
solver = "SCS"
and acceleration_lookback = 5L
in the call to
CVXR::solve()
above.
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.907 | 15 | 28.907 | 15 |
E | Yes | 31.397 | 127 | 31.397 | 127 |
H | No | 29.005 | 13 | 29.005 | 13 |
H | Yes | 31.495 | 12 | 31.495 | 12 |
M | No | 29.037 | 9 | 29.037 | 9 |
M | Yes | 31.528 | 24 | 31.528 | 24 |
Logistic
Finally, the logistic, which requires bounds \(l\) and \(u\) on the coefficients; we use \(l=0.9\) and \(u=1.1\).
u <- 1.10; l <- 0.90
w_survey_l <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear,
bounds = c(l, u)))
Phi_L <- Minimize(sum(-entr((g - l) / (u - l)) -
entr((u - g) / (u - l))))
p <- Problem(Phi_L, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_l <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.907 | 15 | 28.929 | 15 |
E | Yes | 31.397 | 127 | 31.394 | 127 |
H | No | 29.005 | 13 | 28.995 | 13 |
H | Yes | 31.495 | 12 | 31.505 | 12 |
M | No | 29.037 | 9 | 29.014 | 9 |
M | Yes | 31.528 | 24 | 31.536 | 24 |
Further Metrics
Following examples in survey::calibrate
, we can try a few other metrics.
First, the hellinger distance.
hellinger <- make.calfun(Fm1 = function(u, bounds) ((1 - u / 2)^-2) - 1,
dF= function(u, bounds) (1 -u / 2)^-3 ,
name = "Hellinger distance")
w_survey_h <- weights(calibrate(design_api, formula, population = T, calfun = hellinger))
Phi_h <- Minimize(sum((1 - g / 2)^(-2)))
p <- Problem(Phi_h, constraints)
res <- solve(p)
w_cvxr_h <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.913 | 15 | 28.890 | 15 |
E | Yes | 31.396 | 127 | 31.399 | 127 |
H | No | 29.002 | 13 | 29.011 | 13 |
H | Yes | 31.498 | 12 | 31.488 | 12 |
M | No | 29.031 | 9 | 29.056 | 9 |
M | Yes | 31.530 | 24 | 31.521 | 24 |
Next, the derivative of the inverse hyperbolic sine.
w_survey_s <- weights(calibrate(design_api, formula, population = T, calfun = cal.sinh,
bounds = c(l, u)))
Phi_s <- Minimize(sum( 0.5 * (exp(g) + exp(-g))))
p <- Problem(Phi_s, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_s <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.911 | 15 | 28.904 | 15 |
E | Yes | 31.396 | 127 | 31.397 | 127 |
H | No | 29.003 | 13 | 29.006 | 13 |
H | Yes | 31.497 | 12 | 31.494 | 12 |
M | No | 29.033 | 9 | 29.041 | 9 |
M | Yes | 31.529 | 24 | 31.526 | 24 |
## Testthat Results: No output is good
Session Info
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] survey_4.4-2 survival_3.7-0 Matrix_1.7-1 dplyr_1.1.4
## [5] kableExtra_1.4.0 CVXR_1.0-15 testthat_3.2.1.1 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] xfun_0.49 bslib_0.8.0 lattice_0.22-6 Rmosek_10.2.0
## [5] vctrs_0.6.5 tools_4.4.2 generics_0.1.3 tibble_3.2.1
## [9] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3 desc_1.4.3
## [13] assertthat_0.2.1 lifecycle_1.0.4 compiler_4.4.2 stringr_1.5.1
## [17] brio_1.1.5 munsell_0.5.1 gurobi_11.0-0 mitools_2.4
## [21] codetools_0.2-20 ECOSolveR_0.5.5 htmltools_0.5.8.1 sass_0.4.9
## [25] cccp_0.3-1 yaml_2.3.10 gmp_0.7-5 pillar_1.9.0
## [29] jquerylib_0.1.4 MASS_7.3-61 rcbc_0.1.0.9001 clarabel_0.9.0.1
## [33] Rcplex_0.3-6 cachem_1.1.0 tidyselect_1.2.1 digest_0.6.37
## [37] stringi_1.8.4 slam_0.1-54 bookdown_0.41 splines_4.4.2
## [41] rprojroot_2.0.4 fastmap_1.2.0 colorspace_2.1-1 cli_3.6.3
## [45] magrittr_2.0.3 utf8_1.2.4 Rmpfr_0.9-5 scales_1.3.0
## [49] bit64_4.5.2 rmarkdown_2.29 bit_4.5.0 blogdown_1.19
## [53] evaluate_1.0.1 knitr_1.48 Rglpk_0.6-5.1 viridisLite_0.4.2
## [57] rlang_1.1.4 Rcpp_1.0.13-1 glue_1.8.0 DBI_1.2.3
## [61] xml2_1.3.6 pkgload_1.4.0 osqp_0.6.3.3 svglite_2.1.3
## [65] rstudioapi_0.17.1 jsonlite_1.8.9 R6_2.5.1 systemfonts_1.1.0