Near Isotonic and Near Convex Regression

Given a set of data points \(y \in {\mathbf R}^m\), R. J. Tibshirani, Hoefling, and Tibshirani (2011) fit a nearly-isotonic approximation \(\beta \in {\mathbf R}^m\) by solving

\[\begin{array}{ll} \underset{\beta}{\mbox{minimize}} & \frac{1}{2}\sum_{i=1}^m (y_i - \beta_i)^2 + \lambda \sum_{i=1}^{m-1}(\beta_i - \beta_{i+1})_+, \end{array}\]

where \(\lambda \geq 0\) is a penalty parameter and \(x_+ =\max(x,0)\). This can be directly formulated in CVXR. As an example, we use global warming data from the Carbon Dioxide Information Analysis Center (CDIAC). The data points are the annual temperature anomalies relative to the 1961–1990 mean.

suppressMessages(suppressWarnings(library(CVXR)))
data(cdiac)
str(cdiac)
## 'data.frame':    166 obs. of  14 variables:
##  $ year  : int  1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 ...
##  $ jan   : num  -0.702 -0.303 -0.308 -0.177 -0.36 -0.176 -0.119 -0.512 -0.532 -0.307 ...
##  $ feb   : num  -0.284 -0.362 -0.477 -0.33 -0.28 -0.4 -0.373 -0.344 -0.707 -0.192 ...
##  $ mar   : num  -0.732 -0.485 -0.505 -0.318 -0.284 -0.303 -0.513 -0.434 -0.55 -0.334 ...
##  $ apr   : num  -0.57 -0.445 -0.559 -0.352 -0.349 -0.217 -0.371 -0.646 -0.517 -0.203 ...
##  $ may   : num  -0.325 -0.302 -0.209 -0.268 -0.23 -0.336 -0.119 -0.567 -0.651 -0.31 ...
##  $ jun   : num  -0.213 -0.189 -0.038 -0.179 -0.215 -0.16 -0.288 -0.31 -0.58 -0.25 ...
##  $ jul   : num  -0.128 -0.215 -0.016 -0.059 -0.228 -0.268 -0.297 -0.544 -0.324 -0.285 ...
##  $ aug   : num  -0.233 -0.153 -0.195 -0.148 -0.163 -0.159 -0.305 -0.327 -0.28 -0.104 ...
##  $ sep   : num  -0.444 -0.108 -0.125 -0.409 -0.115 -0.339 -0.459 -0.393 -0.339 -0.575 ...
##  $ oct   : num  -0.452 -0.063 -0.216 -0.359 -0.188 -0.211 -0.384 -0.467 -0.2 -0.255 ...
##  $ nov   : num  -0.19 -0.03 -0.187 -0.256 -0.369 -0.212 -0.608 -0.665 -0.644 -0.316 ...
##  $ dec   : num  -0.268 -0.067 0.083 -0.444 -0.232 -0.51 -0.44 -0.356 -0.3 -0.363 ...
##  $ annual: num  -0.375 -0.223 -0.224 -0.271 -0.246 -0.271 -0.352 -0.46 -0.466 -0.286 ...

Since we plan to fit the regression and also get some idea of the standard errors, we write a function that computes the fit for use in bootstrapping.

neariso_fit <- function(y, lambda) {
    m <- length(y)
    beta <- Variable(m)
    obj <- 0.5 * sum((y - beta)^2) + lambda * sum(pos(diff(beta)))
    prob <- Problem(Minimize(obj))
    solve(prob)$getValue(beta)
}

The pos atom evaluates \(x_+\) elementwise on the input expression.

The boot library provides all the tools for bootstrapping but requires a statistic function that takes particular arguments: a data frame, followed by the bootstrap indices and any other arguments (\(\lambda\) for instance). This is shown below.

NOTE In what follows, we use a very small number of bootstrap samples as the fits are time consuming.

neariso_fit_stat <- function(data, index, lambda) {
    sample <- data[index,]                  # Bootstrap sample of rows
    sample <- sample[order(sample$year),]   # Order ascending by year
    neariso_fit(sample$annual, lambda)
}
library(boot)
set.seed(123)
boot.neariso <- boot(data = cdiac, statistic = neariso_fit_stat, R = 10, lambda = 0.44)

ci.neariso <- t(sapply(seq_len(nrow(cdiac)),
                            function(i) boot.ci(boot.out = boot.neariso, conf = 0.95,
                                                type = "norm", index = i)$normal[-1]))
data.neariso <- data.frame(year = cdiac$year, annual = cdiac$annual, est = boot.neariso$t0,
                              lower = ci.neariso[, 1], upper = ci.neariso[, 2])

We can now plot the fit and confidence bands for the near isotonic fit.

library(ggplot2)
(plot.neariso <- ggplot(data = data.neariso) +
     geom_point(mapping = aes(year, annual), color = "red") +
     geom_line(mapping = aes(year, est), color = "blue") +
     geom_ribbon(mapping = aes(x = year, ymin = lower,ymax = upper),alpha=0.3) +
     labs(x = "Year", y = "Temperature Anomalies")
)

The curve follows the data well, but exhibits some choppiness in regions with a steep trend.

For a smoother curve, we can solve for the nearly-convex fit described in the same paper:

\[ \begin{array}{ll} \underset{\beta}{\mbox{minimize}} & \frac{1}{2}\sum_{i=1}^m (y_i - \beta_i)^2 + \lambda \sum_{i=1}^{m-2}(\beta_i - 2\beta_{i+1} + \beta_{i+2})_+ \end{array} \]

This replaces the first difference term with an approximation to the second derivative at \(\beta_{i+1}\). In CVXR, the only change necessary is the penalty line: replacing diff(x) by diff(diff(x)).

nearconvex_fit <- function(y, lambda) {
    m <- length(y)
    beta <- Variable(m)
    penalty <- sum(pos(diff(beta)))
    obj <- 0.5 * sum((y - beta)^2) + lambda * sum(pos(diff(diff(beta))))
    prob <- Problem(Minimize(obj))
    solve(prob)$getValue(beta)
}

nearconvex_fit_stat <- function(data, index, lambda) {
    sample <- data[index,]                  # Bootstrap sample of rows
    sample <- sample[order(sample$year),]   # Order ascending by year
    nearconvex_fit(sample$annual, lambda)
}

set.seed(987)
boot.nearconvex <- boot(data = cdiac, statistic = nearconvex_fit_stat, R = 5, lambda = 0.44)

ci.nearconvex <- t(sapply(seq_len(nrow(cdiac)),
                          function(i) boot.ci(boot.out = boot.nearconvex, conf = 0.95,
                                              type = "norm", index = i)$normal[-1]))
data.nearconvex <- data.frame(year = cdiac$year, annual = cdiac$annual, est = boot.nearconvex$t0,
                              lower = ci.nearconvex[, 1], upper = ci.nearconvex[, 2])

The resulting curve for the near convex fit is depicted below with 95% confidence bands generated from \(R = 5\) samples. Note the jagged staircase pattern has been smoothed out.

(plot.nearconvex <- ggplot(data = data.nearconvex) +
     geom_point(mapping = aes(year, annual), color = "red") +
     geom_line(mapping = aes(year, est), color = "blue") +
     geom_ribbon(mapping = aes(x = year, ymin = lower,ymax = upper),alpha=0.3) +
     labs(x = "Year", y = "Temperature Anomalies")
)

Notes

We can easily extend this example to higher-order differences or lags. To make this easy, the function diff takes an argument differences that is 1 by default; a third-order difference is specified as diff(x, differences = 3).

Session Info

sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.1
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices datasets  utils     base     
## 
## other attached packages:
## [1] ggplot2_2.2.1 boot_1.3-20   CVXR_0.94-4  
## 
## loaded via a namespace (and not attached):
##  [1] gmp_0.5-13.1      Rcpp_0.12.13      compiler_3.4.2   
##  [4] plyr_1.8.4        R.methodsS3_1.7.1 R.utils_2.6.0    
##  [7] tools_3.4.2       digest_0.6.12     bit_1.1-12       
## [10] evaluate_0.10.1   tibble_1.3.4      gtable_0.2.0     
## [13] lattice_0.20-35   rlang_0.1.2       Matrix_1.2-11    
## [16] yaml_2.1.14       blogdown_0.1.7    Rmpfr_0.6-1      
## [19] ECOSolveR_0.3-2   stringr_1.2.0     knitr_1.17       
## [22] rprojroot_1.2     bit64_0.9-7       grid_3.4.2       
## [25] R6_2.2.2          rmarkdown_1.6     bookdown_0.5     
## [28] magrittr_1.5      backports_1.1.1   scales_0.5.0     
## [31] htmltools_0.3.6   scs_1.1-1         colorspace_1.3-2 
## [34] labeling_0.3      stringi_1.1.5     lazyeval_0.2.1   
## [37] munsell_0.4.3     R.oo_1.21.0

Source

R Markdown

References

Tibshirani, R. J., H. Hoefling, and R. Tibshirani. 2011. “Nearly-Isotonic Regression.” Technometrics 53 (1): 54–61.

comments powered by Disqus