Survey Calibration
Introduction
Calibration is a widely used technique in survey sampling. Suppose m sampling units in a survey have been assigned initial weights di for i=1,…,m, and furthermore, there are n auxiliary variables whose values in the sample are known. Calibration seeks to improve the initial weights di by finding new weights wi that incorporate this auxiliary information while perturbing the initial weights as little as possible, , the ratio gi=wi/di must be close to one. Such reweighting improves precision of estimates (Chapter 7, T. S. Lumley (2010)).
Let X∈Rm×n be the matrix of survey samples, with each column corresponding to an auxiliary variable. Reweighting can be expressed as the optimization problem (see Davies, Gillard, and Zhigljavsky (2016)):
minimize∑mi=1diϕ(gi)subject toATg=r
with respect to g∈Rm, where ϕ:R→R is a strictly convex function with ϕ(1)=0, r∈Rn are the known population totals of the auxiliary variables, and A∈Rm×n is related to X by Aij=diXij for i=1,…,m and j=1,…,n.
Raking
A common calibration technique is raking, which uses the penalty function ϕ(gi)=gilog(gi)−gi+1 as the calibration metric.
We illustrate with the California Academic Performance Index data in
the survey
package (T. Lumley (2018)) which also supplies facilities for
calibration via the function calibrate
. Both the population dataset
(apipop
) and a simple random sample of m=200 (apisrs
) are
provided. Suppose that we wish to reweight the observations in the
sample using known totals for two variables from the population:
stype
, the school type (elementary, middle or high) and sch.wide
,
whether the school met the yearly target or not. This reweighting
would make the sample more representative of the general population.
The code below estimates the weights using survey::calibrate
.
data(api)
design_api <- svydesign(id = ~dnum, weights = ~pw, data = apisrs)
## Warning in svydesign(id = ~dnum, weights = ~pw, data = apisrs): partial
## argument match of 'id' to 'ids'
## Warning in svydesign.default(id = ~dnum, weights = ~pw, data = apisrs): partial
## argument match of 'id' to 'ids'
## Warning in match.call(definition, call, expand.dots, envir): partial argument
## match of 'id' to 'ids'
formula <- ~stype + sch.wide
T <- apply(model.matrix(object = formula, data = apipop),
2,
sum)
cal_api <- calibrate(design_api, formula, population = T, calfun = cal.raking)
w_survey <- weights(cal_api)
The CVXR
formulation follows.
di <- apisrs$pw
X <- model.matrix(object = formula, data = apisrs)
A <- di * X
n <- nrow(apisrs)
g <- Variable(n)
constraints <- list(t(A) %*% g == T)
## Raking
Phi_R <- Minimize(sum(di * (-entr(g) - g + 1)))
p <- Problem(Phi_R, constraints)
res <- solve(p)
w_cvxr <- di * res$getValue(g)
We compare the results below in a table which show them to be identical.
## Using functions in the *un echoed* preamble of this document...
build_table(d1 = build_df(apisrs, "Survey", w_survey),
d2 = build_df(apisrs, "CVXR", w_cvxr),
title = "Calibration weights from Raking")
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.911 | 15 | 28.911 | 15 |
E | Yes | 31.396 | 127 | 31.396 | 127 |
H | No | 29.003 | 13 | 29.003 | 13 |
H | Yes | 31.497 | 12 | 31.497 | 12 |
M | No | 29.033 | 9 | 29.033 | 9 |
M | Yes | 31.529 | 24 | 31.529 | 24 |
Other Calibration Metrics
Two other penalty functions commonly used are:
- Quadratic
ϕQ(g)=12(g−1)2;
- Logit
ϕL(g;l,u)=1C[(g−l)log(g−l1−l)+(u−g)log(u−gu−1)] for C=u−l(1−l)(u−1).
It is again easy to incorporate these in our example and compare to
survey
results.
Quadratic
The survey
function for this calibration is invoked as cal.linear
.
## Quadratic
Phi_Q <- Minimize(sum_squares(g - 1) / 2)
p <- Problem(Phi_Q, constraints)
res <- solve(p)
w_cvxr_q <- di * res$getValue(g)
w_survey_q <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear))
N.B. This example used the SCS
solver (before CVXR v0.99-7)
as
the default ECOS
solver produced a different number of unique
weights. With updates to both ECOSolveR
and scs
, the situation is
now reversed. Such differences are not unheard of among solvers!
However, one can obtain the same results by using additional arguments
solver = "SCS"
and acceleration_lookback = 5L
in the call to
CVXR::solve()
above.
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.907 | 15 | 28.907 | 15 |
E | Yes | 31.397 | 127 | 31.397 | 127 |
H | No | 29.005 | 13 | 29.005 | 13 |
H | Yes | 31.495 | 12 | 31.495 | 12 |
M | No | 29.037 | 9 | 29.037 | 9 |
M | Yes | 31.528 | 24 | 31.528 | 24 |
Logistic
Finally, the logistic, which requires bounds l and u on the coefficients; we use l=0.9 and u=1.1.
u <- 1.10; l <- 0.90
w_survey_l <- weights(calibrate(design_api, formula, population = T, calfun = cal.linear,
bounds = c(l, u)))
Phi_L <- Minimize(sum(-entr((g - l) / (u - l)) -
entr((u - g) / (u - l))))
p <- Problem(Phi_L, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_l <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.907 | 15 | 28.929 | 15 |
E | Yes | 31.397 | 127 | 31.394 | 127 |
H | No | 29.005 | 13 | 28.995 | 13 |
H | Yes | 31.495 | 12 | 31.505 | 12 |
M | No | 29.037 | 9 | 29.014 | 9 |
M | Yes | 31.528 | 24 | 31.536 | 24 |
Further Metrics
Following examples in survey::calibrate
, we can try a few other metrics.
First, the hellinger distance.
hellinger <- make.calfun(Fm1 = function(u, bounds) ((1 - u / 2)^-2) - 1,
dF= function(u, bounds) (1 -u / 2)^-3 ,
name = "Hellinger distance")
w_survey_h <- weights(calibrate(design_api, formula, population = T, calfun = hellinger))
Phi_h <- Minimize(sum((1 - g / 2)^(-2)))
p <- Problem(Phi_h, constraints)
res <- solve(p)
w_cvxr_h <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.913 | 15 | 28.890 | 15 |
E | Yes | 31.396 | 127 | 31.399 | 127 |
H | No | 29.002 | 13 | 29.011 | 13 |
H | Yes | 31.498 | 12 | 31.488 | 12 |
M | No | 29.031 | 9 | 29.056 | 9 |
M | Yes | 31.530 | 24 | 31.521 | 24 |
Next, the derivative of the inverse hyperbolic sine.
w_survey_s <- weights(calibrate(design_api, formula, population = T, calfun = cal.sinh,
bounds = c(l, u)))
Phi_s <- Minimize(sum( 0.5 * (exp(g) + exp(-g))))
p <- Problem(Phi_s, c(constraints, list(l <= g, g <= u)))
res <- solve(p)
w_cvxr_s <- di * res$getValue(g)
stype | sch.wide | Survey wts. | Frequency | CVXR wts. | Frequency |
---|---|---|---|---|---|
E | No | 28.911 | 15 | 28.904 | 15 |
E | Yes | 31.396 | 127 | 31.397 | 127 |
H | No | 29.003 | 13 | 29.006 | 13 |
H | Yes | 31.497 | 12 | 31.494 | 12 |
M | No | 29.033 | 9 | 29.041 | 9 |
M | Yes | 31.529 | 24 | 31.526 | 24 |
## Testthat Results: No output is good
Session Info
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sequoia 15.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] grid stats graphics grDevices datasets utils methods
## [8] base
##
## other attached packages:
## [1] survey_4.4-2 survival_3.7-0 Matrix_1.7-1 dplyr_1.1.4
## [5] kableExtra_1.4.0 CVXR_1.0-15 testthat_3.2.1.1 here_1.0.1
##
## loaded via a namespace (and not attached):
## [1] xfun_0.49 bslib_0.8.0 lattice_0.22-6 Rmosek_10.2.0
## [5] vctrs_0.6.5 tools_4.4.2 generics_0.1.3 tibble_3.2.1
## [9] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3 desc_1.4.3
## [13] assertthat_0.2.1 lifecycle_1.0.4 compiler_4.4.2 stringr_1.5.1
## [17] brio_1.1.5 munsell_0.5.1 gurobi_11.0-0 mitools_2.4
## [21] codetools_0.2-20 ECOSolveR_0.5.5 htmltools_0.5.8.1 sass_0.4.9
## [25] cccp_0.3-1 yaml_2.3.10 gmp_0.7-5 pillar_1.9.0
## [29] jquerylib_0.1.4 MASS_7.3-61 rcbc_0.1.0.9001 clarabel_0.9.0.1
## [33] Rcplex_0.3-6 cachem_1.1.0 tidyselect_1.2.1 digest_0.6.37
## [37] stringi_1.8.4 slam_0.1-54 bookdown_0.41 splines_4.4.2
## [41] rprojroot_2.0.4 fastmap_1.2.0 colorspace_2.1-1 cli_3.6.3
## [45] magrittr_2.0.3 utf8_1.2.4 Rmpfr_0.9-5 scales_1.3.0
## [49] bit64_4.5.2 rmarkdown_2.29 bit_4.5.0 blogdown_1.19
## [53] evaluate_1.0.1 knitr_1.48 Rglpk_0.6-5.1 viridisLite_0.4.2
## [57] rlang_1.1.4 Rcpp_1.0.13-1 glue_1.8.0 DBI_1.2.3
## [61] xml2_1.3.6 pkgload_1.4.0 osqp_0.6.3.3 svglite_2.1.3
## [65] rstudioapi_0.17.1 jsonlite_1.8.9 R6_2.5.1 systemfonts_1.1.0