2  Data processing and checking

2.1 Load data

  • Read data from CSV or Excel files.
  • For this demo, use the integrated data samples in the package.
  • sample_data has 20000 simulated samples with binary outcomes, with the same distribution as the data in the MIMIC-III ICU database (https://mimic.mit.edu/).
  • sample_data_survival has 20000 simulated samples with survival outcomes, which are also from MIMIC-III ICU database.
  • sample_data_ordinal has 20000 simulated samples with a 3-category ordinal outcome, based on emergency department data from a tertiary hospital.
library(AutoScore)
data("sample_data")
data("sample_data_survival")
data("sample_data_ordinal")

2.2 Check outcomes

Ensure that there are dependent variable (i.e., outcome).

  • For binary and ordinal outcomes: Change the name of outcome to "label" (make sure no other variables using the same name).
  • For survival outcomes: Change the name of the outcome to "label_time" and "label_status".
  • You may use the following code to revise or check the names of outcomes.
names(sample_data)[names(sample_data) == "Mortality_inpatient"] <- "label"

# Sample data with survival and ordinal outcome already has appropriate variable 
# names:
names(sample_data_survival)
 [1] "Vital_A"      "Vital_B"      "Vital_C"      "Vital_D"      "Vital_E"     
 [6] "Vital_F"      "Vital_G"      "Lab_A"        "Lab_B"        "Lab_C"       
[11] "Lab_D"        "Lab_E"        "Lab_F"        "Lab_G"        "Lab_H"       
[16] "Lab_I"        "Lab_J"        "Lab_K"        "Lab_L"        "Lab_M"       
[21] "Age"          "label_status" "label_time"  
names(sample_data_ordinal)
 [1] "label"    "Age"      "Gender"   "Util_A"   "Util_B"   "Util_C"  
 [7] "Util_D"   "Comorb_A" "Comorb_B" "Comorb_C" "Comorb_D" "Comorb_E"
[13] "Lab_A"    "Lab_B"    "Lab_C"    "Vital_A"  "Vital_B"  "Vital_C" 
[19] "Vital_D"  "Vital_E"  "Vital_F" 
  • Check if data fulfill the basic requirement by AutoScore using the appropriate function for the outcome type.
check_data(sample_data)
Data type check passed. 
No NA in data. 
check_data_survival(sample_data_survival)
Data type check passed. 
No NA in data. 
check_data_ordinal(sample_data_ordinal)
Data type check passed. 
No NA in data. 
  • Fix the problem if you see any warnings.
  • Modify your data, and re-run the appropriate function to check the data again until there are no warning messages.

2.3 Check variables

Use check_data(), check_data_survival() or check_data_ordinal() to check whether current data fulfill the following requirements:

  • No special characters from variable names, e.g., [, ], (, ),,. (Suggest using _ to replace them if needed).
  • Name of the variable should be unique and not entirely included by other variable names.
  • Independent variables should be numeric (class: num/int) or categorical (class: factor/logic).

2.3.1 Handle missing values

The check data functions check_data(), check_data_survival() or check_data_ordinal() will display the missing rates as warnings for variables that contain missingness.

  • As input, AutoScore requires a complete dataset (no missing values). Thus, if your data is complete and fulfill other requirements, you can then move forward to modelling.
  • If there are missing values in your dataset and you believe the missingness in your dataset is informative and prevalent enough that you prefer to preserve them as NA rather than removing or doing imputation, you can also move forward because AutoScore can automatically handle missing values by treating them as a new category named ‘Unknown’.
  • Otherwise, we suggest you first handle your missing values using appropriate imputation methods.
Note

In either way, imputation or treating as a new category, variables with high missing rate (e.g., >80%) may reduce model stability and should be analysed with caution.

2.3.2 Optional operations

  • Check variable distribution.
  • Handle outliers. The raw data may contain outliers caused by system errors or clerical mistakes. User are recommended to handle them well before using AutoScore to ensure good performance.