6  ShapleyVIC for Continuous Outcomes

As introduced in the ShapleyVIC paper, this method can be applied to regression models beyond the logistic regression. This chapter provides a reproducible example to demonstrate its application for continuous outcomes using a simulated dataset from the AutoScore package, which is described in detail in the AutoScore Guidebook.

Specifically, as demonstrated in a recent clinical application, we use ShapleyVIC to analyse the importance of all candidate variables in the simulated dataset, exclude variables that have non-significant contribution to prediction, and apply the stepwise variable selection (starting with all significant contributors) to build sparse regression models for prediction.

Important
  • Although the outcome in this dataset is ordinal with 3 ordered categories, for demonstrative purpose we consider it as a continuous outcome in this chapter.
  • The previous chapter demonstrates how to analyse the same outcome as an ordinal variable.
  • In real-life clinical applications, users should take into consideration the nature of the outcome when choosing an analysis approach.

6.1 [R] Prepare data

This part of the workflow is implemented in R.

6.1.1 Load data

  • Load sample_data_ordinal from the AutoScore package.
  • Variable label is a simulated outcome label with 3 ordered categories.
  • Among the 20 predictor variables, Gender, Util_A and the 5 comorbidity variables (Comorb_A to Comorb_E) are categorical, and the rest are continuous.
library(AutoScore)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(purrr)
data("sample_data_ordinal")

head(sample_data_ordinal)
  label Age Gender Util_A Util_B Util_C    Util_D Comorb_A Comorb_B Comorb_C
1     1  63 FEMALE     P2      0   0.00 3.5933333        0        0        0
2     1  41 FEMALE     P2      0   0.96 3.6288889        0        0        0
3     1  86   MALE     P1      0   0.00 2.6502778        0        0        0
4     1  51   MALE     P2      0   0.00 4.9711111        0        0        0
5     1  23 FEMALE     P1      0   0.00 0.5352778        0        0        0
6     1  32 FEMALE     P2      0   4.13 4.4008333        0        0        0
  Comorb_D Comorb_E Lab_A Lab_B Lab_C Vital_A Vital_B Vital_C Vital_D Vital_E
1        0        0   117   3.9   136      91      19     100      70     152
2        1        0   500   3.6   114      91      16     100      70     147
3        0        0    72   4.1   136     100      18      99      65     126
4        0        0    67   5.0   122      73      17      97      46     100
5        0        0  1036   4.1   138      74      18      98      89     114
6        0        0   806   4.1   136      77      18      98      74     157
  Vital_F
1    25.7
2    22.6
3    25.7
4    24.9
5    25.7
6    25.3
dim(sample_data_ordinal)
[1] 20000    21
summary(sample_data_ordinal)
 label          Age            Gender            Util_A          Util_B       
 1:16360   Min.   : 18.00   FEMALE:10137   P1       : 3750   Min.   : 0.0000  
 2: 2449   1st Qu.: 50.00   MALE  : 9863   P2       :11307   1st Qu.: 0.0000  
 3: 1191   Median : 64.00                  P3 and P4: 4943   Median : 0.0000  
           Mean   : 61.68                                    Mean   : 0.9267  
           3rd Qu.: 76.00                                    3rd Qu.: 1.0000  
           Max.   :109.00                                    Max.   :42.0000  
     Util_C            Util_D         Comorb_A  Comorb_B  Comorb_C  Comorb_D 
 Min.   :  0.000   Min.   : 0.09806   0:18445   0:17401   0:19474   0:18113  
 1st Qu.:  0.000   1st Qu.: 1.52819   1: 1555   1: 2599   1:  526   1: 1887  
 Median :  0.600   Median : 2.46306                                          
 Mean   :  3.535   Mean   : 2.76030                                          
 3rd Qu.:  3.970   3rd Qu.: 3.61472                                          
 Max.   :293.680   Max.   :23.39056                                          
 Comorb_E      Lab_A            Lab_B           Lab_C          Vital_A      
 0:19690   Min.   :  16.0   Min.   :1.500   Min.   :102.0   Min.   :  0.00  
 1:  310   1st Qu.:  66.0   1st Qu.:3.800   1st Qu.:133.0   1st Qu.: 70.00  
           Median :  83.0   Median :4.100   Median :136.0   Median : 81.00  
           Mean   : 146.9   Mean   :4.155   Mean   :135.2   Mean   : 82.67  
           3rd Qu.: 115.0   3rd Qu.:4.400   3rd Qu.:138.0   3rd Qu.: 93.00  
           Max.   :3534.0   Max.   :8.800   Max.   :170.0   Max.   :197.00  
    Vital_B         Vital_C          Vital_D          Vital_E     
 Min.   : 1.00   Min.   :  0.00   Min.   :  5.00   Min.   :  0.0  
 1st Qu.:17.00   1st Qu.: 97.00   1st Qu.: 62.00   1st Qu.:116.0  
 Median :18.00   Median : 98.00   Median : 70.00   Median :131.0  
 Mean   :17.86   Mean   : 97.96   Mean   : 71.23   Mean   :133.5  
 3rd Qu.:18.00   3rd Qu.: 99.00   3rd Qu.: 79.00   3rd Qu.:148.0  
 Max.   :48.00   Max.   :100.00   Max.   :180.00   Max.   :262.0  
    Vital_F     
 Min.   : 2.30  
 1st Qu.:21.10  
 Median :23.00  
 Mean   :22.82  
 3rd Qu.:24.80  
 Max.   :44.30  

6.1.2 Prepare training, validation, and test datasets

  • Given large sample size (n=20000), split the data into training (70%), validation (10%) and test (20%) sets for regression model development.
  • For convenience, we reuse the stratified data split step from the previous chapter, but convert the outcome (“label”) to continuous.
set.seed(4)
out_split <- split_data(data = sample_data_ordinal, ratio = c(0.7, 0.1, 0.2), 
                        strat_by_label = TRUE)
train_set <- out_split$train_set
# For this demo, convert the outcome ("label") to continuous
train_set$label <- as.numeric(train_set$label)
dim(train_set)
[1] 14000    21
validation_set <- out_split$validation_set
# For this demo, convert the outcome ("label") to continuous
validation_set$label <- as.numeric(validation_set$label)
dim(validation_set)
[1] 2000   21
test_set <- out_split$test_set
# For this demo, convert the outcome ("label") to continuous
test_set$label <- as.numeric(test_set$label)
dim(test_set)
[1] 4000   21
  • Prepare cont_output for ShapleyVIC, using train_set as training set and validation_set as the explanation data.
Important
  • As detailed in Chapter 1, check for and handle data issues before applying ShapleyVIC. This demo uses data as-is because it is simulated clean data.
  • In this example the validation_set has 2000 samples, which is a reasonable sample size to be used as the explanation data. In cases with larger sample sizes, users should use a smaller subset as the explanation data (see Chapter 1 for detail).
output_dir <- "cont_output"
if (!dir.exists(output_dir)) dir.create(output_dir)
write.csv(train_set, file = file.path(output_dir, "train_set.csv"), 
          row.names = FALSE)
write.csv(validation_set, 
          file = file.path(output_dir, "validation_set.csv"), 
          row.names = FALSE)

6.2 [Python] Compute ShapleyVIC values

This part of the workflow is implemented in Python.

  • Load data and set up input information.
  • For continuous outcome, specify outcome_type='continuous'.
import os
import pandas as pd
output_dir = "cont_output"
dat_train = pd.read_csv(os.path.join(output_dir, 'train_set.csv'))
dat_expl = pd.read_csv(os.path.join(output_dir, 'validation_set.csv'))

y_name = 'label'
from ShapleyVIC import model
model_object = model.models(
    x=dat_train.drop(columns=[y_name]), y=dat_train[y_name], 
    x_names_cat=['Gender', 'Util_A', 'Comorb_A', 'Comorb_B', 'Comorb_C', 'Comorb_D', 'Comorb_E'],
    outcome_type='continuous', output_dir=output_dir
)
  • Set values for hyper-parameters u1 and u2.
u1, u2 = model_object.init_hyper_params()
(u1, u2)
(0.5, 77.5)
  • Draw 250 nearly optimal models from 500 initial samples.
model_object.draw_models(u1=u1, u2=u2, m=500, n_final=250, random_state=1234)
model_object.models_plot

  • Compute ShapleyVIC values.
from ShapleyVIC import compute
m_svic = compute.compute_shapley_vic(
    model_obj=model_object, 
    x_expl=dat_expl.drop(columns=[y_name]), y_expl=dat_expl[y_name], 
    n_cores=20, # running on a PC with 40 logical processors
    threshold=0.05
)
Note
  • For users’ reference, the command above took approximately 24 hours on a PC (Windows 10 Education; Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz 2.19GHz (2 processors); 128GB RAM).

6.3 [R] Develop prediction model

This part of the workflow is implemented in R.

6.3.1 Overall variable importance from ShapleyVIC

  • Compile ShapleyVIC output.
output_dir <- "cont_output"
library(ShapleyVIC)
model_object <- compile_shapley_vic(
  output_dir = output_dir, outcome_type = "continuous", 
  x_names_cat = c('Gender', 'Util_A', 'Comorb_A', 'Comorb_B', 'Comorb_C', 
                  'Comorb_D', 'Comorb_E')
)
Compiling results for continuous outcome using loss criterion to define neaerly optimal models.
  • Visualize ShapleyVIC values for overall variable importance.
model_plots <- plot(model_object)

6.3.2 ShapleyVIC-assisted backward selection

  • Identify variables with significant overall importance.
vars_svic <- rank_variables(model_object, summarise = TRUE, as_vector = TRUE) %>%
  names()
vars_svic
[1] "Comorb_A" "Age"      "Vital_A"  "Util_B"   "Lab_A"   
  • Starting with a model that include all variables above, develop a sparse regression model using AIC-based stepwise selection (implemented by the MASS package).
# Model with all ShapleyVIC-selected variables:
m_svic_all <- lm(label ~ ., data = train_set[, c("label", vars_svic)])
summary(m_svic_all)

Call:
lm(formula = label ~ ., data = train_set[, c("label", vars_svic)])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.59070 -0.24560 -0.17416 -0.06995  1.98002 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.7564288  0.0268424  28.180  < 2e-16 ***
Comorb_A1   0.4327596  0.0165608  26.131  < 2e-16 ***
Age         0.0036084  0.0002453  14.707  < 2e-16 ***
Vital_A     0.0021720  0.0002606   8.335  < 2e-16 ***
Util_B      0.0370323  0.0020596  17.980  < 2e-16 ***
Lab_A       0.0001014  0.0000225   4.507 6.63e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5258 on 13994 degrees of freedom
Multiple R-squared:  0.0861,    Adjusted R-squared:  0.08577 
F-statistic: 263.7 on 5 and 13994 DF,  p-value: < 2.2e-16
# Backward selection:
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
m_svic <- stepAIC(object = m_svic_all, scope = list(upper = ~ ., lower = ~ 1), 
                  trace = FALSE)
summary(m_svic)

Call:
lm(formula = label ~ Comorb_A + Age + Vital_A + Util_B + Lab_A, 
    data = train_set[, c("label", vars_svic)])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.59070 -0.24560 -0.17416 -0.06995  1.98002 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.7564288  0.0268424  28.180  < 2e-16 ***
Comorb_A1   0.4327596  0.0165608  26.131  < 2e-16 ***
Age         0.0036084  0.0002453  14.707  < 2e-16 ***
Vital_A     0.0021720  0.0002606   8.335  < 2e-16 ***
Util_B      0.0370323  0.0020596  17.980  < 2e-16 ***
Lab_A       0.0001014  0.0000225   4.507 6.63e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5258 on 13994 degrees of freedom
Multiple R-squared:  0.0861,    Adjusted R-squared:  0.08577 
F-statistic: 263.7 on 5 and 13994 DF,  p-value: < 2.2e-16
# Variables selected:
(x_svic <- setdiff(names(m_svic$model), "label"))
[1] "Comorb_A" "Age"      "Vital_A"  "Util_B"   "Lab_A"   

6.3.3 Compare with conventional backward selection

  • Backward selection from all candidate variables.
# Model with all variables:
m_all <- lm(label ~ ., data = train_set)
summary(m_all)

Call:
lm(formula = label ~ ., data = train_set)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.55915 -0.25176 -0.17028 -0.05685  2.12399 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.287e+00  1.977e-01   6.511 7.75e-11 ***
Age              3.596e-03  2.447e-04  14.695  < 2e-16 ***
GenderMALE      -1.563e-02  8.864e-03  -1.763   0.0779 .  
Util_AP2        -1.577e-05  1.178e-02  -0.001   0.9989    
Util_AP3 and P4 -6.534e-04  1.360e-02  -0.048   0.9617    
Util_B           3.718e-02  2.055e-03  18.094  < 2e-16 ***
Util_C           5.271e-05  5.093e-04   0.104   0.9176    
Util_D          -1.450e-02  2.579e-03  -5.623 1.91e-08 ***
Comorb_A1        4.301e-01  1.652e-02  26.037  < 2e-16 ***
Comorb_B1        8.332e-03  1.323e-02   0.630   0.5288    
Comorb_C1        3.536e-02  2.820e-02   1.254   0.2098    
Comorb_D1        3.266e-03  1.521e-02   0.215   0.8300    
Comorb_E1       -3.646e-02  3.598e-02  -1.013   0.3110    
Lab_A            9.818e-05  2.245e-05   4.374 1.23e-05 ***
Lab_B           -3.988e-03  6.498e-03  -0.614   0.5394    
Lab_C           -1.216e-03  9.222e-04  -1.318   0.1874    
Vital_A          2.174e-03  2.599e-04   8.362  < 2e-16 ***
Vital_B          1.323e-04  2.445e-03   0.054   0.9568    
Vital_C         -3.688e-04  1.364e-03  -0.270   0.7868    
Vital_D          7.918e-05  3.276e-04   0.242   0.8090    
Vital_E         -1.130e-03  1.747e-04  -6.467 1.04e-10 ***
Vital_F         -5.429e-03  1.255e-03  -4.327 1.52e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5242 on 13978 degrees of freedom
Multiple R-squared:  0.0927,    Adjusted R-squared:  0.09134 
F-statistic: 68.01 on 21 and 13978 DF,  p-value: < 2.2e-16
# Backward selection:
m_back <- stepAIC(object = m_all, scope = list(upper = ~ ., lower = ~ 1), 
                  trace = FALSE)
summary(m_back)

Call:
lm(formula = label ~ Age + Gender + Util_B + Util_D + Comorb_A + 
    Lab_A + Vital_A + Vital_E + Vital_F, data = train_set)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.55557 -0.25120 -0.17067 -0.05771  2.13009 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.080e+00  4.611e-02  23.419  < 2e-16 ***
Age          3.597e-03  2.446e-04  14.710  < 2e-16 ***
GenderMALE  -1.583e-02  8.859e-03  -1.787   0.0739 .  
Util_B       3.715e-02  2.053e-03  18.097  < 2e-16 ***
Util_D      -1.455e-02  2.578e-03  -5.644 1.69e-08 ***
Comorb_A1    4.302e-01  1.651e-02  26.062  < 2e-16 ***
Lab_A        9.902e-05  2.243e-05   4.415 1.02e-05 ***
Vital_A      2.180e-03  2.597e-04   8.392  < 2e-16 ***
Vital_E     -1.132e-03  1.746e-04  -6.482 9.37e-11 ***
Vital_F     -5.443e-03  1.254e-03  -4.341 1.43e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.524 on 13990 degrees of freedom
Multiple R-squared:  0.09236,   Adjusted R-squared:  0.09178 
F-statistic: 158.2 on 9 and 13990 DF,  p-value: < 2.2e-16
# Variables selected:
(x_back <- setdiff(names(m_back$model), "label"))
[1] "Age"      "Gender"   "Util_B"   "Util_D"   "Comorb_A" "Lab_A"    "Vital_A" 
[8] "Vital_E"  "Vital_F" 
  • ShapleyVIC-assisted backward selection developed a more parsimonious model (with 5 variables) than conventional backward selection (with 9 variables) without significantly impairing performance.
compute_mse <- function(y, y_pred) {
  mean((y - y_pred) ^ 2)
}
# Mean squared error (MSE) of the two models on test set:
c(ShapleyVIC_assisted_backward_selection = compute_mse(
  y = test_set$label, y_pred = predict(m_svic, newdata = test_set)),
  Conventional_backward_selection = compute_mse(
  y = test_set$label, y_pred = predict(m_back, newdata = test_set))
)
ShapleyVIC_assisted_backward_selection        Conventional_backward_selection 
                             0.2743280                              0.2722482