ShapleyVIC: Shapley Variable Importance Cloud for Interpretable Machine Learning

Author

Yilin Ning, Chenglin Niu, Mingxuan Liu, Siqi Li, Nan Liu

Published

2024-01-07

ShapleyVIC Introduction

Variable importance assessment is important for interpreting machine learning models. Current practice in interpretable machine learning applications focuses on explaining the final models that optimize predictive performance. However, this does not fully address practical needs, where researchers are willing to consider models that are “good enough” but are easier to understand or implement. Shapley variable importance cloud (ShapleyVIC) fills this gap by extending current method to a set of “good models” for comprehensive and robust assessments. Building on a common theoretical basis (i.e., Shapley values for variable importance), ShapleyVIC seamlessly complements the widely adopted SHAP assessments of a single final model to avoid biased inference. Please visit GitHub page for source code.

Usage

As detailed in Chapter 3 ShapleyVIC analysis of variable importance consists of 3 general steps:

  1. Training an optimal prediction model (e.g., a logistic regression model).
  2. Generating a reasonable number of (e.g., 350) nearly optimal models of the same model class (e.g., logistic regression).
  3. Evaluate Shapley-based variable importance from each nearly optimal model and pool information for inference.

Chapter 3 demonstrates ShapleyVIC application for binary outcomes, and Chapter 6 and Chapter 7 provide additional examples for applications for ordinal and continuous outcomes, respectively.

ShapleyVIC does not require variable centering or standardization, but requires some data checking and pre-processing for stable and smooth processing, which we summarize in Chapter 2.

The ShapleyVIC-based variable ranking can also be used with the AutoScore framework to develop clinical risk scores for interpretable risk prediction, which we demonstrate in Chapter 4 and Chapter 5.

Installation

The ShapleyVIC framework is now implemented using a Python library that trains the optimal model, generates nearly optimal models and evaluate Shapley-based variable importance from such models, and an R package that pools information across models to generate summary statistics and visualizations for inference.

Python library

  • Required: Python version 3.6 or higher.
    • Recommended: latest stable release of Python 3.9 or 3.10.
  • Required: latest version of git.

Execute the following command in Terminal/Command Prompt to install the Python library from GitHub:

  • Linux/macOS:
pip install git+"https://github.com/nliulab/ShapleyVIC#egg=ShapleyVIC&subdirectory=python"
  • Windows:
python.exe -m pip install git+"https://github.com/nliulab/ShapleyVIC#egg=ShapleyVIC&subdirectory=python"
Note

R package

  • Required: R version 3.5.0 or higher.
    • Recommended: use latest version of R with RStudio.

Execute the following command in R/RStudio to install the R package from GitHub:

if (!require("devtools", quietly = TRUE)) install.packages("devtools")
devtools::install_github("nliulab/ShapleyVIC/r")

Citation

Core paper

Method extension

Clinical applications

Contact

  • Yilin Ning (Email: yilin.ning@duke-nus.edu.sg)
  • Nan Liu (Email: liu.nan@duke-nus.edu.sg)