Using a library of base kernels, **CVEK** learns the generating function from data by directly minimizing the ensemble model’s error, and tests whether the data is generated by the RKHS under the null hypothesis. Part I presents a simple example to conduct Gaussian process regression and hypothesis testing using *cvek* function on simulated data. Part II shows a real-world application where we use **CVEK** to understand whether the per capita crime rate impacts the relationship between the local socioeconomic status and the housing price at Boston, MA, U.S.A. Finally, Part III provides code showing how to visualize the interaction effect from a *cvek* model.

Download the package from CRAN or GitHub and then install and load it.

We generate a simulated dataset using the \("linear"\) kernel, and set the relative interaction strength to be \(0.2\). The outcome \(y_i\) is generated as, \[\begin{align*}
y_i=h_1(\mathbf{x}_{i, 1})+h_2(\mathbf{x}_{i, 2})+0.2 * h_{12}(\mathbf{x}_{i, 1}, \mathbf{x}_{i, 2})+\epsilon_i,
\end{align*}\] where \(h_1\), \(h_2\), \(h_{12}\) are sampled from RKHSs \(\textit{H}_1\), \(\textit{H}_2\), \(\textit{H}_{12}\), generated using the corresponding *linear* kernel. We standardize all sampled functions to have unit form, so that \(0.2\) represents the strength of interaction relative to the main effect.

```
set.seed(0726)
n <- 60 # including training and test
d <- 4
int_effect <- 0.2
data <- matrix(rnorm(n * d), ncol = d)
Z1 <- data[, 1:2]
Z2 <- data[, 3:4]
kern <- generate_kernel(method = "linear")
w <- rnorm(n)
w12 <- rnorm(n)
K1 <- kern(Z1, Z1)
K2 <- kern(Z2, Z2)
K1 <- K1 / sum(diag(K1)) # standardize kernel
K2 <- K2 / sum(diag(K2))
h0 <- K1 %*% w + K2 %*% w
h0 <- h0 / sqrt(sum(h0 ^ 2)) # standardize main effect
h1_prime <- (K1 * K2) %*% w12 # interaction effect
# standardize sampled functions to have unit norm, so that 0.2
# represents the interaction strength relative to main effect
Ks <- svd(K1 + K2)
len <- length(Ks$d[Ks$d / sum(Ks$d) > .001])
U0 <- Ks$u[, 1:len]
h1_prime_hat <- fitted(lm(h1_prime ~ U0))
h1 <- h1_prime - h1_prime_hat
h1 <- h1 / sqrt(sum(h1 ^ 2)) # standardize interaction effect
Y <- h0 + int_effect * h1 + rnorm(1) + rnorm(n, 0, 0.01)
data <- as.data.frame(cbind(Y, Z1, Z2))
colnames(data) <- c("y", paste0("z", 1:d))
data_train <- data[1:40, ]
data_test <- data[41:60, ]
```

The resulting data look as follows.

y | z1 | z2 | z3 | z4 |
---|---|---|---|---|

1.206494 | -0.3540220 | -0.8477765 | -1.9983111 | 1.3628378 |

1.595668 | -1.3521966 | 0.9002141 | 1.0221309 | -0.7187911 |

1.469933 | 0.5276426 | -0.8567797 | 0.0372108 | 0.4386086 |

1.593581 | -1.0576862 | 0.7018883 | 0.9085519 | -0.8034505 |

1.363133 | 0.9926991 | 0.7144279 | -0.9475966 | -0.2037185 |

Now we can apply the *cvek* function to conduct Gaussian process regression. Below table is a detailed list of all the arguments of the function *cvek*.

Suppose we want our kernel library to contain three kernels: \("linear"\), \("polynomial"\) with \(p=2\), and \("rbf"\) with \(l=1\) (the effective parameter for \("polynomial"\) is \(p\) and the effective parameter for \("rbf"\) is \(l\), so we can set anything to \(l\) for \("polynomial"\) kernel and \(p\) for \("rbf"\) kernel). We then first apply *define_library*.

```
kern_par <- data.frame(method = c("linear", "polynomial", "rbf"),
l = rep(1, 3), p = 1:3, stringsAsFactors = FALSE)
# define kernel library
kern_func_list <- define_library(kern_par)
```

The null model is then \(y \sim z1 + z2 + k(z3, z4)\).

With all these parameters specified, we can conduct Gaussian process regression.

```
est_res <- cvek(formula, kern_func_list = kern_func_list,
data = data_train)
est_res$lambda
#> [1] 4.539993e-05
est_res$u_hat
#> [1] 0.994864707 0.000000000 0.005135293
```

We can see that the ensemble weight assigns \(0.99\) to the \("linear"\) kernel, which is the true kernel. This illustrates the accuracy and efficiency of the CVEK method.

We next specify the testing procedure. Note that we can use the same function *cvek* to perform hypothesis testing, as we did for estimation, but we need to provide \(formula\_test\), which is the user-supplied formula indicating the additional alternative effect (e.g., interactions) to test for. Specifically, we will first show how to conduct the classic score test by specifying \(test="asymp"\), followed by a bootstrap test where we specify \(test="boot"\), and the number of bootstrap samples \(B=200\).

```
formula_test <- y ~ k(z1, z2):k(z3, z4)
cvek(formula, kern_func_list = kern_func_list,
data = data_train, formula_test = formula_test,
mode = "loocv", strategy = "stack",
beta_exp = 1, lambda = exp(seq(-10, 5)),
test = "asymp", alt_kernel_type = "ensemble",
verbose = FALSE)$pvalue
#> [,1]
#> [1,] 1.493613e-08
cvek(formula, kern_func_list = kern_func_list,
data = data_train, formula_test = formula_test,
mode = "loocv", strategy = "stack",
beta_exp = 1, lambda = exp(seq(-10, 5)),
test = "boot", alt_kernel_type = "ensemble",
B = 200, verbose = FALSE)$pvalue
#> [1] 0
```

Both tests come to the same conclusion. At the significance level \(0.05\), we reject the null hypothesis that there’s no interaction effect, which matches our data generation mechanism. Additionally, we can prediction new outcomes based on estimation result *est_res*.

y_pred | y | z1 | z2 | z3 | z4 | |
---|---|---|---|---|---|---|

41 | 1.459658 | 1.455218 | 0.4551541 | 0.8838175 | -0.1064603 | -0.7294840 |

42 | 1.522598 | 1.592729 | -0.3551146 | -0.7374137 | -1.0941214 | -2.1262211 |

43 | 1.499522 | 1.519744 | -2.9798328 | -1.0640729 | -0.2713319 | -0.2268063 |

44 | 1.493907 | 1.517606 | 0.0145789 | -0.2485406 | -0.4154637 | -0.9278958 |

45 | 1.486986 | 1.530861 | -0.2603291 | -1.0785118 | -0.8478335 | -0.8378207 |

In this part, we show an example of using *cvek* test to detect nonlinear interactions between socioeconomic factors that contribute to housing price in the city of Boston, Massachusetts, USA. We consider the *Boston* dataset (available in the **MASS** package), which is collected by the U.S Census Service about the median housing price (\(medv\)) in Boston, along with additional variables describing local socioeconomic information such as per capita crime rate, proportion of non-retail business, number of rooms per household, etc. Below table lists the \(14\) variables.

Here we use *cvek* to study whether the per capita crime rate (\(crim\)) impacts the relationship between the local socioeconomic status (\(lstat\)) and the housing price. The null model is, \[\begin{align*}
medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat),
\end{align*}\] where \(\mathbf{x}^\top=(1, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black)\), and \(k()\) is specified as a semi-parametric model with a model library that includes *linear* and *rbf* kernels with \(l=1\). This inclusion of nonlinearity (i.e., the *rbf* kernel) is important, since per classic results in the macroeconmics literature, the crime rates and socioeconomic status of local community are known to have nonlinear association with the local housing price harrison_hedonic_1978.

```
kern_par <- data.frame(method = c("linear", "rbf"),
l = rep(1, 2), p = 1:2, stringsAsFactors = FALSE)
# define kernel library
kern_func_list <- define_library(kern_par)
```

To this end, the hypothesis regarding whether the crime rate (\(crim\)) impacts the association between local socioeconomic status (\(lstat\)) and the housing price (\(medv\)) is equivalent to testing whether there exists a nonlinear interaction between \(crim\) and \(lstat\) in predicting \(medv\), i.e., \[\begin{align*} \mathcal{H}_0: &\; medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat), \\ \mathcal{H}_a: &\; medv \sim \mathbf{x}^\top \boldsymbol{\beta} + k(crim) + k(lstat) + k(crim):k(lstat). \end{align*}\]

To test this hypothesis using *cvek*, we specify the null model using *formula*, and specify the additional interaction term (\(k(crim):k(lstat)\)) in the alternative model using *formula_test*, as shown below:

```
formula <- medv ~ zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + black + k(crim) + k(lstat)
formula_test <- medv ~ k(crim):k(lstat)
fit_bos <- cvek(formula, kern_func_list = kern_func_list, data = Boston,
formula_test = formula_test,
lambda = exp(seq(-3, 5)), test = "asymp")
```

Given the fitted object (*fit_bos*), the p-value of the *cvek* test can be extracted as below:

Since \(p<0.05\), we reject the null hypothesis that there’s no \(crim:lstat\) interaction, and conclude that the data does suggest an impact of the crime rate on the relationship between the local socioeconomic status and the housing price.

A versatile and sometimes the most intepretable method for understanding interaction effects is via plotting. Next we show the *Boston* example of how to visualize the fitted interaction from a *cvek* model. We visualize the interaction effects by creating five datasets: Fix all confounding variables to their means, vary \(lstat\) in a reasonable range (i.e., from \(12.5\) to \(17.5\), since the original range of \(lstat\) in *Boston* dataset is \((1.73, 37.97)\)), and respectively set \(crim\) value to its \(5\%, 25\%, 50\%, 75\%\) and \(95\%\) quantiles.

```
# first fit the alternative model
formula_alt <- medv ~ zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + black + k(crim):k(lstat)
fit_bos_alt <- cvek(formula = formula_alt, kern_func_list = kern_func_list,
data = Boston, lambda = exp(seq(-3, 5)))
# mean-center all confounding variables not involved in the interaction
# so that the predicted values are more easily interpreted
pred_name <- c("zn", "indus", "chas", "nox", "rm", "age",
"dis", "rad", "tax", "ptratio", "black")
covar_mean <- apply(Boston, 2, mean)
pred_cov <- covar_mean[pred_name]
pred_cov_df <- t(as.data.frame(pred_cov))
lstat_list <- seq(12.5, 17.5, length.out = 100)
crim_quantiles <- quantile(Boston$crim, probs = c(.05, .25, .5, .75, .95))
# crim is set to its 5% quantile
data_test1 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[1])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[1]): row names were found from a short variable and have been
#> discarded
data_test1_pred <- predict(fit_bos_alt, data_test1)
# crim is set to its 25% quantile
data_test2 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[2])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[2]): row names were found from a short variable and have been
#> discarded
data_test2_pred <- predict(fit_bos_alt, data_test2)
# crim is set to its 50% quantile
data_test3 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[3])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[3]): row names were found from a short variable and have been
#> discarded
data_test3_pred <- predict(fit_bos_alt, data_test3)
# crim is set to its 75% quantile
data_test4 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[4])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[4]): row names were found from a short variable and have been
#> discarded
data_test4_pred <- predict(fit_bos_alt, data_test4)
# crim is set to its 95% quantile
data_test5 <- data.frame(pred_cov_df, lstat = lstat_list,
crim = crim_quantiles[5])
#> Warning in data.frame(pred_cov_df, lstat = lstat_list, crim =
#> crim_quantiles[5]): row names were found from a short variable and have been
#> discarded
data_test5_pred <- predict(fit_bos_alt, data_test5)
# combine five sets of prediction data together
medv <- rbind(data_test1_pred, data_test2_pred, data_test3_pred,
data_test4_pred, data_test5_pred)
data_pred <- data.frame(lstat = rep(lstat_list, 5), medv = medv,
crim = rep(c("5% quantile", "25% quantile",
"50% quantile", "75% quantile",
"95% quantile"), each = 100))
data_pred$crim <- factor(data_pred$crim,
levels = c("5% quantile", "25% quantile",
"50% quantile", "75% quantile",
"95% quantile"))
data_label <- data_pred[which(data_pred$lstat == 17.5), ]
data_label$value <- c("0.028%", "0.082%", "0.257%", "3.677%", "15.789%")
data_label$value <- factor(data_label$value, levels =
c("0.028%", "0.082%", "0.257%",
"3.677%", "15.789%"))
ggplot(data = data_pred, aes(x = lstat, y = medv, color = crim)) +
geom_point(size = 0.1) +
geom_text_repel(aes(label = value), data = data_label,
color = "black", size = 3.6) +
scale_colour_manual(values = c("firebrick1", "chocolate2",
"darkolivegreen3", "skyblue2",
"purple2")) +
geom_line() + theme_set(theme_bw()) +
theme(panel.grid = element_blank(),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
legend.title = element_text(size = 12, face = "bold"),
legend.text = element_text(size = 12)) +
labs(x = "percentage of lower status",
y = "median value of owner-occupied homes ($1000)",
col = "per capita crime rate")
```