5  ShapleyVIC for Ordinal Outcomes

As introduced in the ShapleyVIC paper, this method can be applied to regression models beyond the logistic regression. This chapter provides a reproducible example to demonstrate its application for ordinal outcomes using a simulated dataset with ordinal outcome from the AutoScore package. The data is described in detail in the AutoScore Guidebook.

Specifically, as demonstrated in a recent clinical application, we use ShapleyVIC to analyse the importance of all candidate variables in the simulated dataset, exclude variables that have non-significant contribution to prediction, and apply the stepwise variable selection (starting with all significant contributors) to build sparse regression models for prediction.

5.1 [R] Prepare data

This part of the workflow is implemented in R.

5.1.1 Load data

  • Load sample_data_ordinal from the AutoScore package.
  • Variable label is a simulated outcome label with 3 ordered categories.
  • Among the 20 predictor variables, Gender, Util_A and the 5 comorbidity variables (Comorb_A to Comorb_E) are categorical, and the rest are continuous.
library(AutoScore)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(purrr)
data("sample_data_ordinal")
head(sample_data_ordinal)
  label Age Gender Util_A Util_B Util_C    Util_D Comorb_A Comorb_B Comorb_C
1     1  63 FEMALE     P2      0   0.00 3.5933333        0        0        0
2     1  41 FEMALE     P2      0   0.96 3.6288889        0        0        0
3     1  86   MALE     P1      0   0.00 2.6502778        0        0        0
4     1  51   MALE     P2      0   0.00 4.9711111        0        0        0
5     1  23 FEMALE     P1      0   0.00 0.5352778        0        0        0
6     1  32 FEMALE     P2      0   4.13 4.4008333        0        0        0
  Comorb_D Comorb_E Lab_A Lab_B Lab_C Vital_A Vital_B Vital_C Vital_D Vital_E
1        0        0   117   3.9   136      91      19     100      70     152
2        1        0   500   3.6   114      91      16     100      70     147
3        0        0    72   4.1   136     100      18      99      65     126
4        0        0    67   5.0   122      73      17      97      46     100
5        0        0  1036   4.1   138      74      18      98      89     114
6        0        0   806   4.1   136      77      18      98      74     157
  Vital_F
1    25.7
2    22.6
3    25.7
4    24.9
5    25.7
6    25.3
dim(sample_data_ordinal)
[1] 20000    21
summary(sample_data_ordinal)
 label          Age            Gender            Util_A          Util_B       
 1:16360   Min.   : 18.00   FEMALE:10137   P1       : 3750   Min.   : 0.0000  
 2: 2449   1st Qu.: 50.00   MALE  : 9863   P2       :11307   1st Qu.: 0.0000  
 3: 1191   Median : 64.00                  P3 and P4: 4943   Median : 0.0000  
           Mean   : 61.68                                    Mean   : 0.9267  
           3rd Qu.: 76.00                                    3rd Qu.: 1.0000  
           Max.   :109.00                                    Max.   :42.0000  
     Util_C            Util_D         Comorb_A  Comorb_B  Comorb_C  Comorb_D 
 Min.   :  0.000   Min.   : 0.09806   0:18445   0:17401   0:19474   0:18113  
 1st Qu.:  0.000   1st Qu.: 1.52819   1: 1555   1: 2599   1:  526   1: 1887  
 Median :  0.600   Median : 2.46306                                          
 Mean   :  3.535   Mean   : 2.76030                                          
 3rd Qu.:  3.970   3rd Qu.: 3.61472                                          
 Max.   :293.680   Max.   :23.39056                                          
 Comorb_E      Lab_A            Lab_B           Lab_C          Vital_A      
 0:19690   Min.   :  16.0   Min.   :1.500   Min.   :102.0   Min.   :  0.00  
 1:  310   1st Qu.:  66.0   1st Qu.:3.800   1st Qu.:133.0   1st Qu.: 70.00  
           Median :  83.0   Median :4.100   Median :136.0   Median : 81.00  
           Mean   : 146.9   Mean   :4.155   Mean   :135.2   Mean   : 82.67  
           3rd Qu.: 115.0   3rd Qu.:4.400   3rd Qu.:138.0   3rd Qu.: 93.00  
           Max.   :3534.0   Max.   :8.800   Max.   :170.0   Max.   :197.00  
    Vital_B         Vital_C          Vital_D          Vital_E     
 Min.   : 1.00   Min.   :  0.00   Min.   :  5.00   Min.   :  0.0  
 1st Qu.:17.00   1st Qu.: 97.00   1st Qu.: 62.00   1st Qu.:116.0  
 Median :18.00   Median : 98.00   Median : 70.00   Median :131.0  
 Mean   :17.86   Mean   : 97.96   Mean   : 71.23   Mean   :133.5  
 3rd Qu.:18.00   3rd Qu.: 99.00   3rd Qu.: 79.00   3rd Qu.:148.0  
 Max.   :48.00   Max.   :100.00   Max.   :180.00   Max.   :262.0  
    Vital_F     
 Min.   : 2.30  
 1st Qu.:21.10  
 Median :23.00  
 Mean   :22.82  
 3rd Qu.:24.80  
 Max.   :44.30  
  • Recode the outcome labels to start from 0, which is required by ShapleyVIC.
sample_data_ordinal$label <- as.ordered(as.numeric(sample_data_ordinal$label) - 1)
table(sample_data_ordinal$label)

    0     1     2 
16360  2449  1191 

5.1.2 Prepare training, validation, and test datasets

  • Given large sample size (n=20000), split the data into training (70%), validation (10%) and test (20%) sets for regression model development.
  • Stratify by the outcome variable (label) when splitting data.
set.seed(4)
out_split <- split_data(data = sample_data_ordinal, ratio = c(0.7, 0.1, 0.2), 
                        strat_by_label = TRUE)
train_set <- out_split$train_set
dim(train_set)
[1] 14000    21
validation_set <- out_split$validation_set
dim(validation_set)
[1] 2000   21
test_set <- out_split$test_set
dim(test_set)
[1] 4000   21
  • Prepare ord_output for ShapleyVIC, using train_set as training set and validation_set as the explanation data.
Important
  • As detailed in Chapter 1, check for and handle data issues before applying ShapleyVIC. This demo uses data as-is because it is simulated clean data.
  • In this example the validation_set has 2000 samples, which is a reasonable sample size to be used as the explanation data. In cases with larger sample sizes, users should use a smaller subset as the explanation data (see Chapter 1 for detail).
output_dir <- "ord_output"
if (!dir.exists(output_dir)) dir.create(output_dir)
write.csv(train_set, file = file.path(output_dir, "train_set.csv"), 
          row.names = FALSE)
write.csv(validation_set, 
          file = file.path(output_dir, "validation_set.csv"), 
          row.names = FALSE)

5.2 [Python] Compute ShapleyVIC values

This part of the workflow is implemented in Python.

  • Load data and set up input information.
  • For ordinal outcome, specify outcome_type='ordinal'.
import os
import pandas as pd
output_dir = "ord_output"
dat_train = pd.read_csv(os.path.join(output_dir, 'train_set.csv'))
dat_expl = pd.read_csv(os.path.join(output_dir, 'validation_set.csv'))

y_name = 'label'
from ShapleyVIC import model
model_object = model.models(
    x=dat_train.drop(columns=[y_name]), y=dat_train[y_name], 
    x_names_cat=['Gender', 'Util_A', 'Comorb_A', 'Comorb_B', 'Comorb_C', 'Comorb_D', 'Comorb_E'],
    outcome_type='ordinal', output_dir=output_dir
)
  • Set values for hyper-parameters u1 and u2.
u1, u2 = model_object.init_hyper_params()
(u1, u2)
(0.5, 43.75)
  • Draw 250 nearly optimal models from 500 initial samples.
model_object.draw_models(u1=u1, u2=u2, m=500, n_final=250, random_state=1234)
model_object.models_plot

  • Compute ShapleyVIC values.
from ShapleyVIC import compute
m_svic = compute.compute_shapley_vic(
    model_obj=model_object, 
    x_expl=dat_expl.drop(columns=[y_name]), y_expl=dat_expl[y_name], 
    n_cores=20, # running on a PC with 40 logical processors
    threshold=0.05
)
Note
  • For users’ reference, the command above took approximately 18 hours on a PC (Windows 10 Education; Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz 2.19GHz (2 processors); 128GB RAM).

5.3 [R] Develop prediction model

This part of the workflow is implemented in R.

5.3.1 Overall variable importance from ShapleyVIC

  • Compile ShapleyVIC output.
output_dir <- "ord_output"
library(ShapleyVIC)
model_object <- compile_shapley_vic(
  output_dir = output_dir, outcome_type = "ordinal", 
  x_names_cat = c('Gender', 'Util_A', 'Comorb_A', 'Comorb_B', 'Comorb_C', 
                  'Comorb_D', 'Comorb_E')
)
Compiling results for ordinal outcome using loss criterion to define neaerly optimal models.
  • Visualize ShapleyVIC values for overall variable importance.
model_plots <- plot(model_object)

5.3.2 ShapleyVIC-assisted backward selection

  • Identify variables with significant overall importance.
vars_svic <- rank_variables(model_object, summarise = TRUE, as_vector = TRUE) %>%
  names()
vars_svic
[1] "Comorb_A" "Age"      "Vital_A"  "Util_D"   "Util_B"   "Vital_E"  "Vital_B" 
  • Starting with a model that include all variables above, develop a sparse regression model using AIC-based stepwise selection (implemented by the MASS package).
  • Using the ordinal package to develop ordinal regression models, more specifically cumulative link model (CLM) with the logit link. See the ordinal package for detailed usage.
# Model with all ShapleyVIC-selected variables:
library(ordinal)

Attaching package: 'ordinal'
The following object is masked from 'package:dplyr':

    slice
m_svic_all <- clm(label ~ ., data = train_set[, c("label", vars_svic)])
summary(m_svic_all)
formula: label ~ Comorb_A + Age + Vital_A + Util_D + Util_B + Vital_E + Vital_B
data:    train_set[, c("label", vars_svic)]

 link  threshold nobs  logLik   AIC      niter max.grad cond.H 
 logit flexible  14000 -7726.66 15471.32 5(0)  7.24e-07 1.0e+07

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
Comorb_A1  1.5184021  0.0659862  23.011  < 2e-16 ***
Age        0.0195374  0.0013333  14.654  < 2e-16 ***
Vital_A    0.0105223  0.0012776   8.236  < 2e-16 ***
Util_D    -0.0768266  0.0140619  -5.463 4.67e-08 ***
Util_B     0.1349605  0.0087143  15.487  < 2e-16 ***
Vital_E   -0.0062609  0.0009181  -6.819 9.14e-12 ***
Vital_B   -0.0039738  0.0127336  -0.312    0.755    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1   2.8254     0.2961   9.543
1|2   4.1694     0.2980  13.993
# Backward selection:
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
m_svic <- stepAIC(object = m_svic_all, scope = list(upper = ~ ., lower = ~ 1), 
                  trace = FALSE)
summary(m_svic)
formula: label ~ Comorb_A + Age + Vital_A + Util_D + Util_B + Vital_E
data:    train_set[, c("label", vars_svic)]

 link  threshold nobs  logLik   AIC      niter max.grad cond.H 
 logit flexible  14000 -7726.71 15469.42 5(0)  7.24e-07 4.1e+06

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
Comorb_A1  1.5186641  0.0659814  23.017  < 2e-16 ***
Age        0.0195405  0.0013332  14.657  < 2e-16 ***
Vital_A    0.0105274  0.0012775   8.240  < 2e-16 ***
Util_D    -0.0768318  0.0140616  -5.464 4.66e-08 ***
Util_B     0.1349614  0.0087136  15.489  < 2e-16 ***
Vital_E   -0.0062629  0.0009181  -6.822 8.99e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1   2.8967     0.1885   15.37
1|2   4.2407     0.1914   22.15
# Variables selected:
(x_svic <- setdiff(names(m_svic$model), "label"))
[1] "Comorb_A" "Age"      "Vital_A"  "Util_D"   "Util_B"   "Vital_E" 

5.3.3 Compare with conventional backward selection

  • Backward selection from all candidate variables.
# Model with all variables:
m_all <- clm(formula = label ~ ., data = train_set)
summary(m_all)
formula: 
label ~ Age + Gender + Util_A + Util_B + Util_C + Util_D + Comorb_A + Comorb_B + Comorb_C + Comorb_D + Comorb_E + Lab_A + Lab_B + Lab_C + Vital_A + Vital_B + Vital_C + Vital_D + Vital_E + Vital_F
data:    train_set

 link  threshold nobs  logLik   AIC      niter max.grad cond.H 
 logit flexible  14000 -7706.19 15458.38 5(0)  8.70e-07 4.2e+08

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
Age              0.0196646  0.0013361  14.718  < 2e-16 ***
GenderMALE      -0.0890106  0.0452970  -1.965 0.049409 *  
Util_AP2         0.0036111  0.0601693   0.060 0.952144    
Util_AP3 and P4 -0.0231994  0.0696219  -0.333 0.738969    
Util_B           0.1360881  0.0087351  15.579  < 2e-16 ***
Util_C          -0.0003512  0.0024950  -0.141 0.888066    
Util_D          -0.0769431  0.0140759  -5.466 4.60e-08 ***
Comorb_A1        1.5267179  0.0661465  23.081  < 2e-16 ***
Comorb_B1        0.0322990  0.0664355   0.486 0.626846    
Comorb_C1        0.2021264  0.1351520   1.496 0.134771    
Comorb_D1        0.0261648  0.0771122   0.339 0.734377    
Comorb_E1       -0.1684798  0.1917073  -0.879 0.379489    
Lab_A            0.0004328  0.0001057   4.092 4.27e-05 ***
Lab_B           -0.0069865  0.0332327  -0.210 0.833489    
Lab_C           -0.0067104  0.0046660  -1.438 0.150389    
Vital_A          0.0106528  0.0012804   8.320  < 2e-16 ***
Vital_B         -0.0040444  0.0127466  -0.317 0.751019    
Vital_C         -0.0013990  0.0069806  -0.200 0.841155    
Vital_D          0.0006639  0.0016693   0.398 0.690843    
Vital_E         -0.0062020  0.0009198  -6.743 1.55e-11 ***
Vital_F         -0.0243927  0.0063573  -3.837 0.000125 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1    1.299      1.010   1.287
1|2    2.647      1.010   2.620
# Backward selection:
m_back <- stepAIC(object = m_all, scope = list(upper = ~ ., lower = ~ 1), 
                  trace = FALSE)
summary(m_back)
formula: 
label ~ Age + Gender + Util_B + Util_D + Comorb_A + Comorb_C + Lab_A + Vital_A + Vital_E + Vital_F
data:    train_set

 link  threshold nobs  logLik   AIC      niter max.grad cond.H 
 logit flexible  14000 -7708.06 15440.13 5(0)  8.70e-07 1.8e+07

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
Age         0.0196518  0.0013353  14.717  < 2e-16 ***
GenderMALE -0.0906919  0.0452646  -2.004  0.04511 *  
Util_B      0.1358021  0.0087232  15.568  < 2e-16 ***
Util_D     -0.0771132  0.0140754  -5.479 4.29e-08 ***
Comorb_A1   1.5262185  0.0660905  23.093  < 2e-16 ***
Comorb_C1   0.2026533  0.1350978   1.500  0.13360    
Lab_A       0.0004345  0.0001057   4.111 3.94e-05 ***
Vital_A     0.0106409  0.0012794   8.317  < 2e-16 ***
Vital_E    -0.0062111  0.0009193  -6.756 1.42e-11 ***
Vital_F    -0.0243158  0.0063537  -3.827  0.00013 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1   2.3946     0.2376   10.08
1|2   3.7417     0.2397   15.61
# Variables selected:
(x_back <- setdiff(names(m_back$model), "label"))
 [1] "Age"      "Gender"   "Util_B"   "Util_D"   "Comorb_A" "Comorb_C"
 [7] "Lab_A"    "Vital_A"  "Vital_E"  "Vital_F" 
  • ShapleyVIC-assisted backward selection developed a more parsimonious model (with 6 variables) than conventional backward selection (with 10 variables) without significantly impairing performance.
# Performance of model from ShapleyVIC-assisted backward selection on test set:
fx_svic <- model.matrix(~ ., data = test_set[, x_svic])[, -1] %*% m_svic$beta
print_performance_ordinal(label = test_set$label, score = as.numeric(fx_svic), 
                          n_boot = 100, report_cindex = TRUE)
mAUC: 0.7291     95% CI: 0.7062-0.7471 (from 100 bootstrap samples)
Generalised c-index: 0.6990      95% CI: 0.6803-0.7146 (from 100 bootstrap samples)
# Performance of model from conventional backward selection on test set:
fx_back <- model.matrix(~ ., data = test_set[, x_back])[, -1] %*% m_back$beta
print_performance_ordinal(label = test_set$label, score = as.numeric(fx_back), 
                          n_boot = 100, report_cindex = TRUE)
mAUC: 0.7351     95% CI: 0.7165-0.7559 (from 100 bootstrap samples)
Generalised c-index: 0.7033      95% CI: 0.6881-0.7197 (from 100 bootstrap samples)