6 ShapleyVIC for Continuous Outcomes

As introduced in the ShapleyVIC paper, this method can be applied to regression models beyond the logistic regression. This chapter provides a reproducible example to demonstrate its application for continuous outcomes using a simulated dataset from the AutoScore package, which is described in detail in the AutoScore Guidebook.

Specifically, as demonstrated in a recent clinical application, we use ShapleyVIC to analyse the importance of all candidate variables in the simulated dataset, exclude variables that have non-significant contribution to prediction, and apply the stepwise variable selection (starting with all significant contributors) to build sparse regression models for prediction.

Important

Although the outcome in this dataset is ordinal with 3 ordered categories, for demonstrative purpose we consider it as a continuous outcome in this chapter.
The previous chapter demonstrates how to analyse the same outcome as an ordinal variable.
In real-life clinical applications, users should take into consideration the nature of the outcome when choosing an analysis approach.

6.1 [R] Prepare data

This part of the workflow is implemented in R.

6.1.1 Load data

Load sample_data_ordinal from the AutoScore package.
Variable label is a simulated outcome label with 3 ordered categories.
Among the 20 predictor variables, Gender, Util_A and the 5 comorbidity variables (Comorb_A to Comorb_E) are categorical, and the rest are continuous.

library(AutoScore)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(purrr)
data("sample_data_ordinal")

head(sample_data_ordinal)

  label Age Gender Util_A Util_B Util_C    Util_D Comorb_A Comorb_B Comorb_C
1     1  63 FEMALE     P2      0   0.00 3.5933333        0        0        0
2     1  41 FEMALE     P2      0   0.96 3.6288889        0        0        0
3     1  86   MALE     P1      0   0.00 2.6502778        0        0        0
4     1  51   MALE     P2      0   0.00 4.9711111        0        0        0
5     1  23 FEMALE     P1      0   0.00 0.5352778        0        0        0
6     1  32 FEMALE     P2      0   4.13 4.4008333        0        0        0
  Comorb_D Comorb_E Lab_A Lab_B Lab_C Vital_A Vital_B Vital_C Vital_D Vital_E
1        0        0   117   3.9   136      91      19     100      70     152
2        1        0   500   3.6   114      91      16     100      70     147
3        0        0    72   4.1   136     100      18      99      65     126
4        0        0    67   5.0   122      73      17      97      46     100
5        0        0  1036   4.1   138      74      18      98      89     114
6        0        0   806   4.1   136      77      18      98      74     157
  Vital_F
1    25.7
2    22.6
3    25.7
4    24.9
5    25.7
6    25.3

dim(sample_data_ordinal)

[1] 20000    21

summary(sample_data_ordinal)

 label          Age            Gender            Util_A          Util_B       
 1:16360   Min.   : 18.00   FEMALE:10137   P1       : 3750   Min.   : 0.0000  
 2: 2449   1st Qu.: 50.00   MALE  : 9863   P2       :11307   1st Qu.: 0.0000  
 3: 1191   Median : 64.00                  P3 and P4: 4943   Median : 0.0000  
           Mean   : 61.68                                    Mean   : 0.9267  
           3rd Qu.: 76.00                                    3rd Qu.: 1.0000  
           Max.   :109.00                                    Max.   :42.0000  
     Util_C            Util_D         Comorb_A  Comorb_B  Comorb_C  Comorb_D 
 Min.   :  0.000   Min.   : 0.09806   0:18445   0:17401   0:19474   0:18113  
 1st Qu.:  0.000   1st Qu.: 1.52819   1: 1555   1: 2599   1:  526   1: 1887  
 Median :  0.600   Median : 2.46306                                          
 Mean   :  3.535   Mean   : 2.76030                                          
 3rd Qu.:  3.970   3rd Qu.: 3.61472                                          
 Max.   :293.680   Max.   :23.39056                                          
 Comorb_E      Lab_A            Lab_B           Lab_C          Vital_A      
 0:19690   Min.   :  16.0   Min.   :1.500   Min.   :102.0   Min.   :  0.00  
 1:  310   1st Qu.:  66.0   1st Qu.:3.800   1st Qu.:133.0   1st Qu.: 70.00  
           Median :  83.0   Median :4.100   Median :136.0   Median : 81.00  
           Mean   : 146.9   Mean   :4.155   Mean   :135.2   Mean   : 82.67  
           3rd Qu.: 115.0   3rd Qu.:4.400   3rd Qu.:138.0   3rd Qu.: 93.00  
           Max.   :3534.0   Max.   :8.800   Max.   :170.0   Max.   :197.00  
    Vital_B         Vital_C          Vital_D          Vital_E     
 Min.   : 1.00   Min.   :  0.00   Min.   :  5.00   Min.   :  0.0  
 1st Qu.:17.00   1st Qu.: 97.00   1st Qu.: 62.00   1st Qu.:116.0  
 Median :18.00   Median : 98.00   Median : 70.00   Median :131.0  
 Mean   :17.86   Mean   : 97.96   Mean   : 71.23   Mean   :133.5  
 3rd Qu.:18.00   3rd Qu.: 99.00   3rd Qu.: 79.00   3rd Qu.:148.0  
 Max.   :48.00   Max.   :100.00   Max.   :180.00   Max.   :262.0  
    Vital_F     
 Min.   : 2.30  
 1st Qu.:21.10  
 Median :23.00  
 Mean   :22.82  
 3rd Qu.:24.80  
 Max.   :44.30

6.1.2 Prepare training, validation, and test datasets

Given large sample size (n=20000), split the data into training (70%), validation (10%) and test (20%) sets for regression model development.
For convenience, we reuse the stratified data split step from the previous chapter, but convert the outcome (“label”) to continuous.

set.seed(4)
out_split <- split_data(data = sample_data_ordinal, ratio = c(0.7, 0.1, 0.2), 
                        strat_by_label = TRUE)
train_set <- out_split$train_set
# For this demo, convert the outcome ("label") to continuous
train_set$label <- as.numeric(train_set$label)
dim(train_set)

[1] 14000    21

validation_set <- out_split$validation_set
# For this demo, convert the outcome ("label") to continuous
validation_set$label <- as.numeric(validation_set$label)
dim(validation_set)

[1] 2000   21

test_set <- out_split$test_set
# For this demo, convert the outcome ("label") to continuous
test_set$label <- as.numeric(test_set$label)
dim(test_set)

[1] 4000   21

Prepare cont_output for ShapleyVIC, using train_set as training set and validation_set as the explanation data.

Important

As detailed in Chapter 1, check for and handle data issues before applying ShapleyVIC. This demo uses data as-is because it is simulated clean data.
In this example the validation_set has 2000 samples, which is a reasonable sample size to be used as the explanation data. In cases with larger sample sizes, users should use a smaller subset as the explanation data (see Chapter 1 for detail).

output_dir <- "cont_output"
if (!dir.exists(output_dir)) dir.create(output_dir)
write.csv(train_set, file = file.path(output_dir, "train_set.csv"), 
          row.names = FALSE)
write.csv(validation_set, 
          file = file.path(output_dir, "validation_set.csv"), 
          row.names = FALSE)

6.2 [Python] Compute ShapleyVIC values

This part of the workflow is implemented in Python.

Load data and set up input information.
For continuous outcome, specify outcome_type='continuous'.

import os
import pandas as pd
output_dir = "cont_output"
dat_train = pd.read_csv(os.path.join(output_dir, 'train_set.csv'))
dat_expl = pd.read_csv(os.path.join(output_dir, 'validation_set.csv'))

y_name = 'label'
from ShapleyVIC import model
model_object = model.models(
    x=dat_train.drop(columns=[y_name]), y=dat_train[y_name], 
    x_names_cat=['Gender', 'Util_A', 'Comorb_A', 'Comorb_B', 'Comorb_C', 'Comorb_D', 'Comorb_E'],
    outcome_type='continuous', output_dir=output_dir
)

Set values for hyper-parameters u1 and u2.

u1, u2 = model_object.init_hyper_params()
(u1, u2)

(0.5, 77.5)

Draw 250 nearly optimal models from 500 initial samples.

model_object.draw_models(u1=u1, u2=u2, m=500, n_final=250, random_state=1234)
model_object.models_plot

Compute ShapleyVIC values.

from ShapleyVIC import compute
m_svic = compute.compute_shapley_vic(
    model_obj=model_object, 
    x_expl=dat_expl.drop(columns=[y_name]), y_expl=dat_expl[y_name], 
    n_cores=20, # running on a PC with 40 logical processors
    threshold=0.05
)

Note

For users’ reference, the command above took approximately 24 hours on a PC (Windows 10 Education; Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz 2.19GHz (2 processors); 128GB RAM).

6.3 [R] Develop prediction model

This part of the workflow is implemented in R.

6.3.1 Overall variable importance from ShapleyVIC

Compile ShapleyVIC output.

output_dir <- "cont_output"
library(ShapleyVIC)
model_object <- compile_shapley_vic(
  output_dir = output_dir, outcome_type = "continuous", 
  x_names_cat = c('Gender', 'Util_A', 'Comorb_A', 'Comorb_B', 'Comorb_C', 
                  'Comorb_D', 'Comorb_E')
)

Compiling results for continuous outcome using loss criterion to define neaerly optimal models.

Visualize ShapleyVIC values for overall variable importance.

model_plots <- plot(model_object)

6.3.2 ShapleyVIC-assisted backward selection

Identify variables with significant overall importance.

vars_svic <- rank_variables(model_object, summarise = TRUE, as_vector = TRUE) %>%
  names()
vars_svic

[1] "Comorb_A" "Age"      "Vital_A"  "Util_B"   "Lab_A"

Starting with a model that include all variables above, develop a sparse regression model using AIC-based stepwise selection (implemented by the MASS package).

# Model with all ShapleyVIC-selected variables:
m_svic_all <- lm(label ~ ., data = train_set[, c("label", vars_svic)])
summary(m_svic_all)


Call:
lm(formula = label ~ ., data = train_set[, c("label", vars_svic)])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.59070 -0.24560 -0.17416 -0.06995  1.98002 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.7564288  0.0268424  28.180  < 2e-16 ***
Comorb_A1   0.4327596  0.0165608  26.131  < 2e-16 ***
Age         0.0036084  0.0002453  14.707  < 2e-16 ***
Vital_A     0.0021720  0.0002606   8.335  < 2e-16 ***
Util_B      0.0370323  0.0020596  17.980  < 2e-16 ***
Lab_A       0.0001014  0.0000225   4.507 6.63e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5258 on 13994 degrees of freedom
Multiple R-squared:  0.0861,    Adjusted R-squared:  0.08577 
F-statistic: 263.7 on 5 and 13994 DF,  p-value: < 2.2e-16

# Backward selection:
library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

m_svic <- stepAIC(object = m_svic_all, scope = list(upper = ~ ., lower = ~ 1), 
                  trace = FALSE)
summary(m_svic)


Call:
lm(formula = label ~ Comorb_A + Age + Vital_A + Util_B + Lab_A, 
    data = train_set[, c("label", vars_svic)])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.59070 -0.24560 -0.17416 -0.06995  1.98002 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.7564288  0.0268424  28.180  < 2e-16 ***
Comorb_A1   0.4327596  0.0165608  26.131  < 2e-16 ***
Age         0.0036084  0.0002453  14.707  < 2e-16 ***
Vital_A     0.0021720  0.0002606   8.335  < 2e-16 ***
Util_B      0.0370323  0.0020596  17.980  < 2e-16 ***
Lab_A       0.0001014  0.0000225   4.507 6.63e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5258 on 13994 degrees of freedom
Multiple R-squared:  0.0861,    Adjusted R-squared:  0.08577 
F-statistic: 263.7 on 5 and 13994 DF,  p-value: < 2.2e-16

# Variables selected:
(x_svic <- setdiff(names(m_svic$model), "label"))

[1] "Comorb_A" "Age"      "Vital_A"  "Util_B"   "Lab_A"

6.3.3 Compare with conventional backward selection

Backward selection from all candidate variables.

# Model with all variables:
m_all <- lm(label ~ ., data = train_set)
summary(m_all)


Call:
lm(formula = label ~ ., data = train_set)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.55915 -0.25176 -0.17028 -0.05685  2.12399 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.287e+00  1.977e-01   6.511 7.75e-11 ***
Age              3.596e-03  2.447e-04  14.695  < 2e-16 ***
GenderMALE      -1.563e-02  8.864e-03  -1.763   0.0779 .  
Util_AP2        -1.577e-05  1.178e-02  -0.001   0.9989    
Util_AP3 and P4 -6.534e-04  1.360e-02  -0.048   0.9617    
Util_B           3.718e-02  2.055e-03  18.094  < 2e-16 ***
Util_C           5.271e-05  5.093e-04   0.104   0.9176    
Util_D          -1.450e-02  2.579e-03  -5.623 1.91e-08 ***
Comorb_A1        4.301e-01  1.652e-02  26.037  < 2e-16 ***
Comorb_B1        8.332e-03  1.323e-02   0.630   0.5288    
Comorb_C1        3.536e-02  2.820e-02   1.254   0.2098    
Comorb_D1        3.266e-03  1.521e-02   0.215   0.8300    
Comorb_E1       -3.646e-02  3.598e-02  -1.013   0.3110    
Lab_A            9.818e-05  2.245e-05   4.374 1.23e-05 ***
Lab_B           -3.988e-03  6.498e-03  -0.614   0.5394    
Lab_C           -1.216e-03  9.222e-04  -1.318   0.1874    
Vital_A          2.174e-03  2.599e-04   8.362  < 2e-16 ***
Vital_B          1.323e-04  2.445e-03   0.054   0.9568    
Vital_C         -3.688e-04  1.364e-03  -0.270   0.7868    
Vital_D          7.918e-05  3.276e-04   0.242   0.8090    
Vital_E         -1.130e-03  1.747e-04  -6.467 1.04e-10 ***
Vital_F         -5.429e-03  1.255e-03  -4.327 1.52e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5242 on 13978 degrees of freedom
Multiple R-squared:  0.0927,    Adjusted R-squared:  0.09134 
F-statistic: 68.01 on 21 and 13978 DF,  p-value: < 2.2e-16

# Backward selection:
m_back <- stepAIC(object = m_all, scope = list(upper = ~ ., lower = ~ 1), 
                  trace = FALSE)
summary(m_back)


Call:
lm(formula = label ~ Age + Gender + Util_B + Util_D + Comorb_A + 
    Lab_A + Vital_A + Vital_E + Vital_F, data = train_set)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.55557 -0.25120 -0.17067 -0.05771  2.13009 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.080e+00  4.611e-02  23.419  < 2e-16 ***
Age          3.597e-03  2.446e-04  14.710  < 2e-16 ***
GenderMALE  -1.583e-02  8.859e-03  -1.787   0.0739 .  
Util_B       3.715e-02  2.053e-03  18.097  < 2e-16 ***
Util_D      -1.455e-02  2.578e-03  -5.644 1.69e-08 ***
Comorb_A1    4.302e-01  1.651e-02  26.062  < 2e-16 ***
Lab_A        9.902e-05  2.243e-05   4.415 1.02e-05 ***
Vital_A      2.180e-03  2.597e-04   8.392  < 2e-16 ***
Vital_E     -1.132e-03  1.746e-04  -6.482 9.37e-11 ***
Vital_F     -5.443e-03  1.254e-03  -4.341 1.43e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.524 on 13990 degrees of freedom
Multiple R-squared:  0.09236,   Adjusted R-squared:  0.09178 
F-statistic: 158.2 on 9 and 13990 DF,  p-value: < 2.2e-16

# Variables selected:
(x_back <- setdiff(names(m_back$model), "label"))

[1] "Age"      "Gender"   "Util_B"   "Util_D"   "Comorb_A" "Lab_A"    "Vital_A" 
[8] "Vital_E"  "Vital_F"

ShapleyVIC-assisted backward selection developed a more parsimonious model (with 5 variables) than conventional backward selection (with 9 variables) without significantly impairing performance.

compute_mse <- function(y, y_pred) {
  mean((y - y_pred) ^ 2)
}
# Mean squared error (MSE) of the two models on test set:
c(ShapleyVIC_assisted_backward_selection = compute_mse(
  y = test_set$label, y_pred = predict(m_svic, newdata = test_set)),
  Conventional_backward_selection = compute_mse(
  y = test_set$label, y_pred = predict(m_back, newdata = test_set))
)

ShapleyVIC_assisted_backward_selection        Conventional_backward_selection 
                             0.2743280                              0.2722482