Survey Calibration
Introduction
Calibration is a widely used technique in survey sampling. Suppose \(m\) sampling units in a survey have been assigned initial weights \(d_i\) for \(i = 1,\ldots,m\), and furthermore, there are \(n\) auxiliary variables whose values in the sample are known. Calibration seeks to improve the initial weights \(d_i\) by finding new weights \(w_i\) that incorporate this auxiliary information while perturbing the initial weights as little as possible, , the ratio \(g_i = w_i/d_i\) must be close to one. Such reweighting improves precision of estimates (Chapter 7, T. S. Lumley (2010)).
Let \(X \in {\mathbf R}^{m \times n}\) be the matrix of survey samples, with each column corresponding to an auxiliary variable. Reweighting can be expressed as the optimization problem (see Davies, Gillard, and Zhigljavsky (2016)):
\[ \begin{array}{ll} \mbox{minimize} & \sum_{i=1}^m d_i\phi(g_i) \\ \mbox{subject to} & A^Tg = r \end{array} \]
with respect to \(g \in {\mathbf R}^m\), where \(\phi:{\mathbf R} \rightarrow {\mathbf R}\) is a strictly convex function with \(\phi(1) = 0\), \(r \in {\mathbf R}^n\) are the known population totals of the auxiliary variables, and \(A \in {\mathbf R}^{m \times n}\) is related to \(X\) by \(A_{ij} = d_iX_{ij}\) for \(i = 1,\ldots,m\) and \(j = 1,\ldots,n\).
Raking
A common calibration technique is raking, which uses the penalty function \(\phi(g_i) = g_i\log(g_i) - g_i + 1\) as the calibration metric.
We illustrate with the California Academic Performance Index data in
the survey
package (T. Lumley (2018)) which also supplies facilities for
calibration via the function calibrate
. Both the population dataset
(apipop
) and a simple random sample of \(m = 200\) (apisrs
) are
provided. Suppose that we wish to reweight the observations in the
sample using known totals for two variables from the population:
stype
, the school type (elementary, middle or high) and sch.wide
,
whether the school met the yearly target or not. This reweighting
would make the sample more representative of the general population.
The code below estimates the weights using survey::calibrate
.
data(api)
design_api <- svydesign(id = ~dnum, weights = ~pw, data = apisrs)
formula <- ~stype + sch.wide
T <- apply(model.matrix(object = formula, data = apipop),
2,
sum)
cal_api <- calibrate(design_api, formula, population = T, calfun = cal.raking)
w_survey <- weights(cal_api)
The CVXR
formulation follows.
di <- apisrs$pw
X <- model.matrix(object = formula, data = apisrs)
A <- di * X
n <- nrow(apisrs)
g <- Variable(n)
constraints <- list(t(A) %*% g == T)
## Raking
Phi_R <- Minimize(sum(di * (-entr(g) - g + 1)))
p <- Problem(Phi_R, constraints)
res <- solve(p)
w_cvxr <- di * res$getValue(g)
We compare the results below in a table which show them to be identical.
## Using functions in the *un echoed* preamble of this document...
build_table(d1 = build_df(apisrs, "Survey", w_survey),
d2 = build_df(apisrs, "CVXR", w_cvxr),
title = "Calibration weights from Raking")
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.911 | 15 | 28.911 | 15 |
E | Yes | 31.396 | 127 | 31.396 | 127 |
H | No | 29.003 | 13 | 29.003 | 13 |
H | Yes | 31.497 | 12 | 31.497 | 12 |
M | No | 29.033 | 9 | 29.033 | 9 |
M | Yes | 31.529 | 24 | 31.529 | 24 |
Other Calibration Metrics
Two other penalty functions commonly used are:
- Quadratic \[ \phi^{Q}(g) = \frac{1}{2}(g-1)^2; \]
- Logit \[ \phi^{L}(g; l, u) = \frac{1}{C}\biggl[ (g-l)\log\left(\frac{g-l}{1-l}\right) + (u-g)\log\left(\frac{u-g}{u-1}\right) \biggr] \mbox{ for } C = \frac{u-l}{(1-l)(u-1)}. \]
It is again easy to incorporate these in our example and compare to
survey
results.
Quadratic
The survey
function for this calibration is invoked as cal.linear
.
## Quadratic
Phi_Q <- Minimize(sum_squares(g - 1) / 2)
p <- Problem(Phi_Q, constraints)
res <- solve(p)
w_cvxr_q <- di * res$getValue(g)
w_survey_q <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear))
N.B. This example used the SCS
solver (before CVXR v0.99-7)
as
the default ECOS
solver produced a different number of unique
weights. With updates to both ECOSolveR
and scs
, the situation is
now reversed. Such differences are not unheard of among solvers!
However, one can obtain the same results by using additional arguments
solver = "SCS"
and acceleration_lookback = 5L
in the call to
CVXR::solve()
above.
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.907 | 15 | 28.907 | 15 |
E | Yes | 31.397 | 127 | 31.397 | 127 |
H | No | 29.005 | 13 | 29.005 | 13 |
H | Yes | 31.495 | 12 | 31.495 | 12 |
M | No | 29.037 | 9 | 29.037 | 9 |
M | Yes | 31.528 | 24 | 31.528 | 24 |
Logistic
Finally, the logistic, which requires bounds \(l\) and \(u\) on the coefficients; we use \(l=0.9\) and \(u=1.1\).
u <- 1.10; l <- 0.90
w_survey_l <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear,
bounds = c(l, u)))
Phi_L <- Minimize(sum(-entr((g - l) / (u - l)) -
entr((u - g) / (u - l))))
p <- Problem(Phi_L, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_l <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.907 | 15 | 28.929 | 15 |
E | Yes | 31.397 | 127 | 31.394 | 127 |
H | No | 29.005 | 13 | 28.995 | 13 |
H | Yes | 31.495 | 12 | 31.505 | 12 |
M | No | 29.037 | 9 | 29.014 | 9 |
M | Yes | 31.528 | 24 | 31.536 | 24 |
Further Metrics
Following examples in survey::calibrate
, we can try a few other metrics.
First, the hellinger distance.
hellinger <- make.calfun(Fm1 = function(u, bounds) ((1 - u / 2)^-2) - 1,
dF= function(u, bounds) (1 -u / 2)^-3 ,
name = "Hellinger distance")
w_survey_h <- weights(calibrate(design_api, formula, population = T, calfun = hellinger))
Phi_h <- Minimize(sum((1 - g / 2)^(-2)))
p <- Problem(Phi_h, constraints)
res <- solve(p)
w_cvxr_h <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.913 | 15 | 28.890 | 15 |
E | Yes | 31.396 | 127 | 31.399 | 127 |
H | No | 29.002 | 13 | 29.011 | 13 |
H | Yes | 31.498 | 12 | 31.488 | 12 |
M | No | 29.031 | 9 | 29.056 | 9 |
M | Yes | 31.530 | 24 | 31.521 | 24 |
Next, the derivative of the inverse hyperbolic sine.
w_survey_s <- weights(calibrate(design_api, formula, population = T, calfun = cal.sinh,
bounds = c(l, u)))
Phi_s <- Minimize(sum( 0.5 * (exp(g) + exp(-g))))
p <- Problem(Phi_s, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_s <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.911 | 15 | 28.904 | 15 |
E | Yes | 31.396 | 127 | 31.397 | 127 |
H | No | 29.003 | 13 | 29.006 | 13 |
H | Yes | 31.497 | 12 | 31.494 | 12 |
M | No | 29.033 | 9 | 29.041 | 9 |
M | Yes | 31.529 | 24 | 31.526 | 24 |
Session Info
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin21.6.0 (64-bit)
## Running under: macOS Ventura 13.0
##
## Matrix products: default
## BLAS: /usr/local/Cellar/openblas/0.3.21/lib/libopenblasp-r0.3.21.dylib
## LAPACK: /usr/local/Cellar/r/4.2.1_4/lib/R/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] survey_4.1-1 survival_3.4-0 Matrix_1.5-1 dplyr_1.0.10
## [5] kableExtra_1.3.4 CVXR_1.0-11
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 svglite_2.1.0 gurobi_9.5-2 lattice_0.20-45
## [5] rcbc_0.1.0.9001 Rmosek_10.0.25 assertthat_0.2.1 digest_0.6.30
## [9] utf8_1.2.2 gmp_0.6-6 slam_0.1-50 R6_2.5.1
## [13] evaluate_0.17 highr_0.9 httr_1.4.4 blogdown_1.13
## [17] pillar_1.8.1 rlang_1.0.6 rstudioapi_0.14 jquerylib_0.1.4
## [21] rmarkdown_2.17 splines_4.2.1 osqp_0.6.0.5 webshot_0.5.4
## [25] stringr_1.4.1 bit_4.0.4 munsell_0.5.0 compiler_4.2.1
## [29] xfun_0.34 pkgconfig_2.0.3 systemfonts_1.0.4 Rglpk_0.6-4
## [33] htmltools_0.5.3 mitools_2.4 Rcplex_0.3-5 tidyselect_1.2.0
## [37] tibble_3.1.8 bookdown_0.29 codetools_0.2-18 fansi_1.0.3
## [41] viridisLite_0.4.1 MASS_7.3-58.1 jsonlite_1.8.3 lifecycle_1.0.3
## [45] DBI_1.1.3 magrittr_2.0.3 scales_1.2.1 cli_3.4.1
## [49] stringi_1.7.8 cachem_1.0.6 Rmpfr_0.8-9 xml2_1.3.3
## [53] bslib_0.4.0 generics_0.1.3 vctrs_0.5.0 tools_4.2.1
## [57] bit64_4.0.5 glue_1.6.2 fastmap_1.1.0 yaml_2.3.6
## [61] colorspace_2.0-3 cccp_0.2-9 rvest_1.0.3 ECOSolveR_0.5.4
## [65] knitr_1.40 sass_0.4.2