Assessing the Performance of Machine Learning Models for Default Prediction under Missing Data and Class Imbalance: A Simulation Study

  • Lindani Dube Lecturer
  • Tanja Verster


In the field of machine learning, robust model performance is essential for accurate predictions and informed decision-making. One critical challenge that hampers the effectiveness of machine learning algorithms is the presence of missing data. Missing values are ubiquitous in real-world datasets and can significantly impact the performance of predictive models. This study explores the impact of increasing levels of missing values on the performance of machine learning models. Simulated samples with missing values ranging from 5% to 50% were generated, and various models were evaluated accordingly. Missing data is a prevalent change that hinders the performance of machine learning algorithms. The results demonstrated a consistent trend of deteriorating model performance as the amount of missing values increases. Higher levels of missing values lead to decreased accuracy scores across all models. Among the models evaluated, decision trees (DT) and random forests (RF) consistently demonstrated high accuracy scores across all sampling techniques, showcasing their robustness in handling missing values. Logistic regression (LR) also performed relatively well, showing consistent performance across different levels of missing values. On the other hand, stochastic gradient descent classifier (SGDC), K-nearest neighbors (kNN), and naive Bayes (NB) models consistently exhibited lower accuracy scores across all sampling techniques, indicating limitations in handling missing values even when the dataset was more balanced. Furthermore, the study highlights the superiority of the SMOTE (Synthetic Minority OVER-sampling Technique) sampling technique compared to the UNDER-sampling approach. Models trained using SMOTE consistently achieved higher accuracy scores across all levels of missing values. This suggests that SMOTE sampling effectively handles imbalanced datasets and enhances classification performance, particularly when dealing with missing values. In an era where data fuels decision-making, this study's insights into the escalating impact of missing values on machine learning models stand as a clarion call for robust data handling techniques. As the quest for accurate predictions gains paramount importance, addressing the pervasive challenge of missing data emerges as a cornerstone for unlocking the true potential of machine learning in real-world applications.


Download data is not yet available.
Research Articles