5 ShapleyVIC for Ordinal Outcomes

As introduced in the ShapleyVIC paper, this method can be applied to regression models beyond the logistic regression. This chapter provides a reproducible example to demonstrate its application for ordinal outcomes using a simulated dataset with ordinal outcome from the AutoScore package. The data is described in detail in the AutoScore Guidebook.

Specifically, as demonstrated in a recent clinical application, we use ShapleyVIC to analyse the importance of all candidate variables in the simulated dataset, exclude variables that have non-significant contribution to prediction, and apply the stepwise variable selection (starting with all significant contributors) to build sparse regression models for prediction.

5.1 [R] Prepare data

This part of the workflow is implemented in R.

5.1.1 Load data

Load sample_data_ordinal from the AutoScore package.
Variable label is a simulated outcome label with 3 ordered categories.
Among the 20 predictor variables, Gender, Util_A and the 5 comorbidity variables (Comorb_A to Comorb_E) are categorical, and the rest are continuous.

library(AutoScore)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(purrr)
data("sample_data_ordinal")
head(sample_data_ordinal)

  label Age Gender Util_A Util_B Util_C    Util_D Comorb_A Comorb_B Comorb_C
1     1  63 FEMALE     P2      0   0.00 3.5933333        0        0        0
2     1  41 FEMALE     P2      0   0.96 3.6288889        0        0        0
3     1  86   MALE     P1      0   0.00 2.6502778        0        0        0
4     1  51   MALE     P2      0   0.00 4.9711111        0        0        0
5     1  23 FEMALE     P1      0   0.00 0.5352778        0        0        0
6     1  32 FEMALE     P2      0   4.13 4.4008333        0        0        0
  Comorb_D Comorb_E Lab_A Lab_B Lab_C Vital_A Vital_B Vital_C Vital_D Vital_E
1        0        0   117   3.9   136      91      19     100      70     152
2        1        0   500   3.6   114      91      16     100      70     147
3        0        0    72   4.1   136     100      18      99      65     126
4        0        0    67   5.0   122      73      17      97      46     100
5        0        0  1036   4.1   138      74      18      98      89     114
6        0        0   806   4.1   136      77      18      98      74     157
  Vital_F
1    25.7
2    22.6
3    25.7
4    24.9
5    25.7
6    25.3

dim(sample_data_ordinal)

[1] 20000    21

summary(sample_data_ordinal)

 label          Age            Gender            Util_A          Util_B       
 1:16360   Min.   : 18.00   FEMALE:10137   P1       : 3750   Min.   : 0.0000  
 2: 2449   1st Qu.: 50.00   MALE  : 9863   P2       :11307   1st Qu.: 0.0000  
 3: 1191   Median : 64.00                  P3 and P4: 4943   Median : 0.0000  
           Mean   : 61.68                                    Mean   : 0.9267  
           3rd Qu.: 76.00                                    3rd Qu.: 1.0000  
           Max.   :109.00                                    Max.   :42.0000  
     Util_C            Util_D         Comorb_A  Comorb_B  Comorb_C  Comorb_D 
 Min.   :  0.000   Min.   : 0.09806   0:18445   0:17401   0:19474   0:18113  
 1st Qu.:  0.000   1st Qu.: 1.52819   1: 1555   1: 2599   1:  526   1: 1887  
 Median :  0.600   Median : 2.46306                                          
 Mean   :  3.535   Mean   : 2.76030                                          
 3rd Qu.:  3.970   3rd Qu.: 3.61472                                          
 Max.   :293.680   Max.   :23.39056                                          
 Comorb_E      Lab_A            Lab_B           Lab_C          Vital_A      
 0:19690   Min.   :  16.0   Min.   :1.500   Min.   :102.0   Min.   :  0.00  
 1:  310   1st Qu.:  66.0   1st Qu.:3.800   1st Qu.:133.0   1st Qu.: 70.00  
           Median :  83.0   Median :4.100   Median :136.0   Median : 81.00  
           Mean   : 146.9   Mean   :4.155   Mean   :135.2   Mean   : 82.67  
           3rd Qu.: 115.0   3rd Qu.:4.400   3rd Qu.:138.0   3rd Qu.: 93.00  
           Max.   :3534.0   Max.   :8.800   Max.   :170.0   Max.   :197.00  
    Vital_B         Vital_C          Vital_D          Vital_E     
 Min.   : 1.00   Min.   :  0.00   Min.   :  5.00   Min.   :  0.0  
 1st Qu.:17.00   1st Qu.: 97.00   1st Qu.: 62.00   1st Qu.:116.0  
 Median :18.00   Median : 98.00   Median : 70.00   Median :131.0  
 Mean   :17.86   Mean   : 97.96   Mean   : 71.23   Mean   :133.5  
 3rd Qu.:18.00   3rd Qu.: 99.00   3rd Qu.: 79.00   3rd Qu.:148.0  
 Max.   :48.00   Max.   :100.00   Max.   :180.00   Max.   :262.0  
    Vital_F     
 Min.   : 2.30  
 1st Qu.:21.10  
 Median :23.00  
 Mean   :22.82  
 3rd Qu.:24.80  
 Max.   :44.30

Recode the outcome labels to start from 0, which is required by ShapleyVIC.

sample_data_ordinal$label <- as.ordered(as.numeric(sample_data_ordinal$label) - 1)
table(sample_data_ordinal$label)


    0     1     2 
16360  2449  1191

5.1.2 Prepare training, validation, and test datasets

Given large sample size (n=20000), split the data into training (70%), validation (10%) and test (20%) sets for regression model development.
Stratify by the outcome variable (label) when splitting data.

set.seed(4)
out_split <- split_data(data = sample_data_ordinal, ratio = c(0.7, 0.1, 0.2), 
                        strat_by_label = TRUE)
train_set <- out_split$train_set
dim(train_set)

[1] 14000    21

validation_set <- out_split$validation_set
dim(validation_set)

[1] 2000   21

test_set <- out_split$test_set
dim(test_set)

[1] 4000   21

Prepare ord_output for ShapleyVIC, using train_set as training set and validation_set as the explanation data.

Important

As detailed in Chapter 1, check for and handle data issues before applying ShapleyVIC. This demo uses data as-is because it is simulated clean data.
In this example the validation_set has 2000 samples, which is a reasonable sample size to be used as the explanation data. In cases with larger sample sizes, users should use a smaller subset as the explanation data (see Chapter 1 for detail).

output_dir <- "ord_output"
if (!dir.exists(output_dir)) dir.create(output_dir)
write.csv(train_set, file = file.path(output_dir, "train_set.csv"), 
          row.names = FALSE)
write.csv(validation_set, 
          file = file.path(output_dir, "validation_set.csv"), 
          row.names = FALSE)

5.2 [Python] Compute ShapleyVIC values

This part of the workflow is implemented in Python.

Load data and set up input information.
For ordinal outcome, specify outcome_type='ordinal'.

import os
import pandas as pd
output_dir = "ord_output"
dat_train = pd.read_csv(os.path.join(output_dir, 'train_set.csv'))
dat_expl = pd.read_csv(os.path.join(output_dir, 'validation_set.csv'))

y_name = 'label'
from ShapleyVIC import model
model_object = model.models(
    x=dat_train.drop(columns=[y_name]), y=dat_train[y_name], 
    x_names_cat=['Gender', 'Util_A', 'Comorb_A', 'Comorb_B', 'Comorb_C', 'Comorb_D', 'Comorb_E'],
    outcome_type='ordinal', output_dir=output_dir
)

Set values for hyper-parameters u1 and u2.

u1, u2 = model_object.init_hyper_params()
(u1, u2)

(0.5, 43.75)

Draw 250 nearly optimal models from 500 initial samples.

model_object.draw_models(u1=u1, u2=u2, m=500, n_final=250, random_state=1234)
model_object.models_plot

Compute ShapleyVIC values.

from ShapleyVIC import compute
m_svic = compute.compute_shapley_vic(
    model_obj=model_object, 
    x_expl=dat_expl.drop(columns=[y_name]), y_expl=dat_expl[y_name], 
    n_cores=20, # running on a PC with 40 logical processors
    threshold=0.05
)

Note

For users’ reference, the command above took approximately 18 hours on a PC (Windows 10 Education; Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz 2.19GHz (2 processors); 128GB RAM).

5.3 [R] Develop prediction model

This part of the workflow is implemented in R.

5.3.1 Overall variable importance from ShapleyVIC

Compile ShapleyVIC output.

output_dir <- "ord_output"
library(ShapleyVIC)
model_object <- compile_shapley_vic(
  output_dir = output_dir, outcome_type = "ordinal", 
  x_names_cat = c('Gender', 'Util_A', 'Comorb_A', 'Comorb_B', 'Comorb_C', 
                  'Comorb_D', 'Comorb_E')
)

Compiling results for ordinal outcome using loss criterion to define neaerly optimal models.

Visualize ShapleyVIC values for overall variable importance.

model_plots <- plot(model_object)

5.3.2 ShapleyVIC-assisted backward selection

Identify variables with significant overall importance.

vars_svic <- rank_variables(model_object, summarise = TRUE, as_vector = TRUE) %>%
  names()
vars_svic

[1] "Comorb_A" "Age"      "Vital_A"  "Util_D"   "Util_B"   "Vital_E"  "Vital_B"

Starting with a model that include all variables above, develop a sparse regression model using AIC-based stepwise selection (implemented by the MASS package).
Using the ordinal package to develop ordinal regression models, more specifically cumulative link model (CLM) with the logit link. See the ordinal package for detailed usage.

# Model with all ShapleyVIC-selected variables:
library(ordinal)


Attaching package: 'ordinal'

The following object is masked from 'package:dplyr':

    slice

m_svic_all <- clm(label ~ ., data = train_set[, c("label", vars_svic)])
summary(m_svic_all)

formula: label ~ Comorb_A + Age + Vital_A + Util_D + Util_B + Vital_E + Vital_B
data:    train_set[, c("label", vars_svic)]

 link  threshold nobs  logLik   AIC      niter max.grad cond.H 
 logit flexible  14000 -7726.66 15471.32 5(0)  7.24e-07 1.0e+07

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
Comorb_A1  1.5184021  0.0659862  23.011  < 2e-16 ***
Age        0.0195374  0.0013333  14.654  < 2e-16 ***
Vital_A    0.0105223  0.0012776   8.236  < 2e-16 ***
Util_D    -0.0768266  0.0140619  -5.463 4.67e-08 ***
Util_B     0.1349605  0.0087143  15.487  < 2e-16 ***
Vital_E   -0.0062609  0.0009181  -6.819 9.14e-12 ***
Vital_B   -0.0039738  0.0127336  -0.312    0.755    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1   2.8254     0.2961   9.543
1|2   4.1694     0.2980  13.993

# Backward selection:
library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

m_svic <- stepAIC(object = m_svic_all, scope = list(upper = ~ ., lower = ~ 1), 
                  trace = FALSE)
summary(m_svic)

formula: label ~ Comorb_A + Age + Vital_A + Util_D + Util_B + Vital_E
data:    train_set[, c("label", vars_svic)]

 link  threshold nobs  logLik   AIC      niter max.grad cond.H 
 logit flexible  14000 -7726.71 15469.42 5(0)  7.24e-07 4.1e+06

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
Comorb_A1  1.5186641  0.0659814  23.017  < 2e-16 ***
Age        0.0195405  0.0013332  14.657  < 2e-16 ***
Vital_A    0.0105274  0.0012775   8.240  < 2e-16 ***
Util_D    -0.0768318  0.0140616  -5.464 4.66e-08 ***
Util_B     0.1349614  0.0087136  15.489  < 2e-16 ***
Vital_E   -0.0062629  0.0009181  -6.822 8.99e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1   2.8967     0.1885   15.37
1|2   4.2407     0.1914   22.15

# Variables selected:
(x_svic <- setdiff(names(m_svic$model), "label"))

[1] "Comorb_A" "Age"      "Vital_A"  "Util_D"   "Util_B"   "Vital_E"

5.3.3 Compare with conventional backward selection

Backward selection from all candidate variables.

# Model with all variables:
m_all <- clm(formula = label ~ ., data = train_set)
summary(m_all)

formula: 
label ~ Age + Gender + Util_A + Util_B + Util_C + Util_D + Comorb_A + Comorb_B + Comorb_C + Comorb_D + Comorb_E + Lab_A + Lab_B + Lab_C + Vital_A + Vital_B + Vital_C + Vital_D + Vital_E + Vital_F
data:    train_set

 link  threshold nobs  logLik   AIC      niter max.grad cond.H 
 logit flexible  14000 -7706.19 15458.38 5(0)  8.70e-07 4.2e+08

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
Age              0.0196646  0.0013361  14.718  < 2e-16 ***
GenderMALE      -0.0890106  0.0452970  -1.965 0.049409 *  
Util_AP2         0.0036111  0.0601693   0.060 0.952144    
Util_AP3 and P4 -0.0231994  0.0696219  -0.333 0.738969    
Util_B           0.1360881  0.0087351  15.579  < 2e-16 ***
Util_C          -0.0003512  0.0024950  -0.141 0.888066    
Util_D          -0.0769431  0.0140759  -5.466 4.60e-08 ***
Comorb_A1        1.5267179  0.0661465  23.081  < 2e-16 ***
Comorb_B1        0.0322990  0.0664355   0.486 0.626846    
Comorb_C1        0.2021264  0.1351520   1.496 0.134771    
Comorb_D1        0.0261648  0.0771122   0.339 0.734377    
Comorb_E1       -0.1684798  0.1917073  -0.879 0.379489    
Lab_A            0.0004328  0.0001057   4.092 4.27e-05 ***
Lab_B           -0.0069865  0.0332327  -0.210 0.833489    
Lab_C           -0.0067104  0.0046660  -1.438 0.150389    
Vital_A          0.0106528  0.0012804   8.320  < 2e-16 ***
Vital_B         -0.0040444  0.0127466  -0.317 0.751019    
Vital_C         -0.0013990  0.0069806  -0.200 0.841155    
Vital_D          0.0006639  0.0016693   0.398 0.690843    
Vital_E         -0.0062020  0.0009198  -6.743 1.55e-11 ***
Vital_F         -0.0243927  0.0063573  -3.837 0.000125 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1    1.299      1.010   1.287
1|2    2.647      1.010   2.620

# Backward selection:
m_back <- stepAIC(object = m_all, scope = list(upper = ~ ., lower = ~ 1), 
                  trace = FALSE)
summary(m_back)

formula: 
label ~ Age + Gender + Util_B + Util_D + Comorb_A + Comorb_C + Lab_A + Vital_A + Vital_E + Vital_F
data:    train_set

 link  threshold nobs  logLik   AIC      niter max.grad cond.H 
 logit flexible  14000 -7708.06 15440.13 5(0)  8.70e-07 1.8e+07

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
Age         0.0196518  0.0013353  14.717  < 2e-16 ***
GenderMALE -0.0906919  0.0452646  -2.004  0.04511 *  
Util_B      0.1358021  0.0087232  15.568  < 2e-16 ***
Util_D     -0.0771132  0.0140754  -5.479 4.29e-08 ***
Comorb_A1   1.5262185  0.0660905  23.093  < 2e-16 ***
Comorb_C1   0.2026533  0.1350978   1.500  0.13360    
Lab_A       0.0004345  0.0001057   4.111 3.94e-05 ***
Vital_A     0.0106409  0.0012794   8.317  < 2e-16 ***
Vital_E    -0.0062111  0.0009193  -6.756 1.42e-11 ***
Vital_F    -0.0243158  0.0063537  -3.827  0.00013 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
0|1   2.3946     0.2376   10.08
1|2   3.7417     0.2397   15.61

# Variables selected:
(x_back <- setdiff(names(m_back$model), "label"))

 [1] "Age"      "Gender"   "Util_B"   "Util_D"   "Comorb_A" "Comorb_C"
 [7] "Lab_A"    "Vital_A"  "Vital_E"  "Vital_F"

ShapleyVIC-assisted backward selection developed a more parsimonious model (with 6 variables) than conventional backward selection (with 10 variables) without significantly impairing performance.

# Performance of model from ShapleyVIC-assisted backward selection on test set:
fx_svic <- model.matrix(~ ., data = test_set[, x_svic])[, -1] %*% m_svic$beta
print_performance_ordinal(label = test_set$label, score = as.numeric(fx_svic), 
                          n_boot = 100, report_cindex = TRUE)

mAUC: 0.7291     95% CI: 0.7062-0.7471 (from 100 bootstrap samples)
Generalised c-index: 0.6990      95% CI: 0.6803-0.7146 (from 100 bootstrap samples)

# Performance of model from conventional backward selection on test set:
fx_back <- model.matrix(~ ., data = test_set[, x_back])[, -1] %*% m_back$beta
print_performance_ordinal(label = test_set$label, score = as.numeric(fx_back), 
                          n_boot = 100, report_cindex = TRUE)

mAUC: 0.7351     95% CI: 0.7165-0.7559 (from 100 bootstrap samples)
Generalised c-index: 0.7033      95% CI: 0.6881-0.7197 (from 100 bootstrap samples)