International Journal of Web Research

International Journal of Web Research

Explainable Diabetes Prediction via Hybrid Data Preprocessing and Ensemble Learning

Document Type : Original Article

Authors
1 Data Mining Laboratory, Department of Computer Engineering Faculty of Engineering, Alzahra University Tehran, Iran.
2 Department of Computer Engineering, Faculty of Engineering, Alzahra University, Tehran, Iran.
Abstract
Accurate and early prediction of diabetes is crucial for initiating prompt treatment and minimizing the risk of long-term health issues. This study introduces a comprehensive machine learning model aimed at improving diabetes prediction by leveraging two clinical datasets: the PIMA Indians Diabetes Dataset and the Early-Stage Diabetes Dataset. The pipeline tackles common challenges in medical data, such as missing values, class imbalance, and feature relevance, through a series of advanced preprocessing steps, including class-specific imputation, engineered feature construction, and SMOTETomek resampling. To identify the most informative predictors, a hybrid feature selection strategy is employed, integrating recursive elimination, Random Forest-based importance, and gradient boosting. Model training uses Random Forest and Gradient Boosting classifiers, which are fine-tuned and combined through weighted ensemble averaging to boost predictive performance. The resulting model achieves 93.33% accuracy on the PIMA dataset and 98.44% accuracy on the Early-Stage dataset, outperforming previously reported approaches. To enhance transparency and clinical applicability, both local (LIME) and global (SHAP) explainability methods are applied, highlighting clinically relevant features. Furthermore, probability calibration is performed to ensure that predicted risk scores align with true outcome frequencies, increasing trust in the model’s use for clinical decision support. Overall, the proposed model offers a robust, interpretable, and clinically reliable solution for early-stage diabetes prediction.
Keywords

Subjects


[1] M. S. Reza, U. Hafsha, R. Amin, R. Yasmin and S. Ruhi, “Improving SVM performance for type II diabetes prediction with an improved non-linear kernel: Insights from the PIMA dataset,” Computer Methods and Programs in Biomedicine Update, vol. 4, p. 100118, Jan. 2023, https://doi.org/10.1016/J.CMPBUP.2023.100118
[2] G. Kakavand Teimoory and M. Keyvanpour, “Elevating Accuracy: Enhanced Feature Selection Methods for Type 2 Diabetes Prediction,” International Journal of Web Research, vol. 7, no. 2, pp. 37–48, Apr. 2024, https://doi.org/10.22133/IJWR.2024.458872.1218
[3] P. S. Moon, P. A. Bainalwar, S. M. Borkar and S. S. Shambharkar, “Machine Learning approach for Diabetes Prediction using Pima Dataset,” in ACM International Conference Proceeding Series, Association for Computing Machinery, Nov. 2023, pp. 1-9. https://doi.org/10.1145/3647444.3652479
[4] S. A. Tanim, A. R. Aurnob, T. E. Shrestha, M. R. I. Emon, M. F. Mridha and M. S. U. Miah, “Explainable deep learning for diabetes diagnosis with DeepNetX2,” Biomed Signal Process Control, vol. 99, p. 106902, Jan. 2025, https://doi.org/10.1016/J.BSPC.2024.106902
[5] K. S. Farsana and A. Poulose, “Hybrid Convolutional Neural Networks for PIMA Indians Diabetes Prediction,” in International Conference on Ubiquitous and Future Networks (ICUFN), Budapest, Hungary, IEEE Computer Society, 2024, pp. 268–273. https://doi.org/10.1109/ICUFN61752.2024.10624950
[6] M. Zhao et al., “Predictive value of machine learning for the progression of gestational diabetes mellitus to type 2 diabetes: a systematic review and meta-analysis,” BMC Med Inform Decis Mak, vol. 25, no. 1, p. 18, Dec. 2025, https://doi.org/10.1186/s12911-024-02848-x
[7] I. Tasin, T. U. Nabil, S. Islam, and R. Khan, “Diabetes prediction using machine learning and explainable AI techniques,” Healthc Technol Lett, vol. 10, no. 1–2, pp. 1–10, Feb. 2023, https://doi.org/10.1049/htl2.12039
[8] N. N. N. Nazirun et al., “Prediction Models for Type 2 Diabetes Progression: A Systematic Review,” IEEE Access, vol. 12, no. June, pp. 161595–161619, 2024, https://doi.org/10.1109/ACCESS.2024.3432118
[9] H. Lee et al., “Prediction model for type 2 diabetes mellitus and its association with mortality using machine learning in three independent cohorts from South Korea, Japan, and the UK: a model development and validation study,” EClinicalMedicine, vol. 80, Feb. 2025, https://doi.org/10.1016/j.eclinm.2025.103069
[10] Q. Sun, X. Cheng, K. Han, Y. Sun, H. Ren and P. Li, “Machine learning-based assessment of diabetes risk: Machine learning-based assessment of diabetes risk,” Applied Intelligence, vol. 55, no. 2, p. 106, Jan. 2025, https://doi.org/10.1007/s10489-024-05912-1
[11] M. Abroodi, M. R. Keyvanpour and G. K. Teimoory, “Efficient Prediction of Cardiovascular Disease via Extra Tree Feature Selection,” 2024 14th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Islamic Republic of Iran, 2024, pp. 70–75, https://doi.org/10.1109/ICCKE65377.2024.10874780
[12] H. Lee, M. B. Park and Y. J. Won, “AI Machine Learning-Based Diabetes Prediction in Older Adults in South Korea: Cross-Sectional Analysis,” JMIR Form Res, vol. 9, no. 1, p. e57874, 2025, https://doi.org/10.2196/57874
[13] M. J. Noh and Y. S. Kim, “Diabetes Prediction Through Linkage of Causal Discovery and Inference Model with Machine Learning Models,” Biomedicines, vol. 13, no. 1, p. 124, Jan. 2025, https://doi.org/10.3390/biomedicines13010124
[14] M. M. Islam, H. R. Rifat, M. S. Bin Shahid, A. Akhter, M. A. Uddin and K. M. M. Uddin, “Explainable Machine Learning for Efficient Diabetes Prediction Using Hyperparameter Tuning, SHAP Analysis, Partial Dependency, and LIME,” Engineering Reports, vol. 7, no. 1, p. e13080, Jan. 2025, https://doi.org/10.1002/eng2.13080
[15] F. Mirsharifi and M. R. Keyvanpour, “An EfficientNet-Based Method for Interpretable Early Detection of Alzheimer,” 2024 10th International Conference on Signal Processing and Intelligent Systems (ICSPIS), Shahrood, Islamic Republic of Iran, 2024, pp. 199–204, https://doi.org/10.1109/ICSPIS65223.2024.10931073
[16] M. Taheri, M. R. Keyvanpour and M. S. Mousavi, “Improving Drug-Target Interaction Prediction Using Enhanced Feature Selection,” 15th International Conference on Information and Knowledge Technology, (IKT), Isfahan, Islamic Republic of Iran, 2024, pp. 157–161, https://doi.org/10.1109/IKT65497.2024.10892664
[17] G. K. Teimoory and M. R. Keyvanpour, “An Explainable Ai Model for Diabetes Prediction Using Random Forest,” 2025 11th International Conference on Web Research (ICWR), Tehran, Islamic Republic of Iran, Apr. 2025, pp. 264–269, https://doi.org/10.1109/ICWR65219.2025.11006200
[18] A. Agliata, D. Giordano, F. Bardozzo, S. Bottiglieri, A. Facchiano and R. Tagliaferri, “Machine Learning as a Support for the Diagnosis of Type 2 Diabetes,” Int J Mol Sci, vol. 24, no. 7, p. 6775, 2023, https://doi.org/10.3390/ijms24076775
[19] M. A. Hama Saeed, “Diabetes type 2 classification using machine learning algorithms with up-sampling technique,” Journal of Electrical Systems and Information Technology, vol. 10, no. 1, p. 8, 2023, https://doi.org/10.1186/s43067-023-00074-5
[20] H. Zhou, S. Rahman, M. Angelova, C. R. Bruce and C. Karmakar, “A robust and generalized framework in diabetes classification across heterogeneous environments,” Comput Biol Med, vol. 186, p. 109720, 2025, https://doi.org/10.1016/j.compbiomed.2025.109720
[21] M. Y. Shams, Z. Tarek and A. M. Elshewey, “A novel RFE-GRU model for diabetes classification using PIMA Indian dataset,” Sci Rep, vol. 15, no. 1, p. 982, 2025, https://doi.org/10.1038/s41598-024-82420-9
[22] S. H. Talukder, S. K. Mondal and R. Bin Sulaiman, “An Efficient Approach for Diabetes Prediction Through Integrated Feature Engineering and Machine Learning,” In 2025 4th International Conference on Computing and Information Technology (ICCIT), Tabuk, Saudi Arabia, 2025, pp. 451-456, https://doi.org/10.1109/ICCIT63348.2025.10989350
[23] S. S. Bhat, G. A. Ansari and M. D. Ansari, “Performance Analysis of Machine Learning Based On Optimized Feature Selection for Type II Diabetes Mellitus,” Multimed Tools Appl, vol. 84, pp. 4945–4964, 2024, https://doi.org/10.1007/s11042-024-19000-6
[24] D. Chellappan and H. Rajaguru, “Generalizability of machine learning models for diabetes detection a study with nordic islet transplant and PIMA datasets,” Sci Rep, vol. 15, no. 1, p. 4479, Dec. 2025, https://doi.org/10.1038/S41598-025-87471-0
[25] U. E. Laila, K. Mahboob, A. W. Khan, F. Khan and W. Taekeun, “An Ensemble Approach to Predict Early-Stage Diabetes Risk Using Machine Learning: An Empirical Study,” Sensors, vol. 22, no. 14, p. 5247, Jul. 2022, https://doi.org/10.3390/s22145247
[26] M. Karuppasamy, J. M. Rani and K. Poorani, “Metaheuristic Feature Selection for Diabetes Prediction with P-G-S Approach,” Procedia Computer Science, vol. 252, pp. 165–171, 2025. https://doi.org/10.1016/j.procs.2024.12.018
[27] S. Hong and H. S. Lynn, “Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction,” BMC Med Res Methodol, vol. 20, p. 199, Jul. 2020, https://doi.org/10.1186/s12874-020-0 1080-1
[28] Y. Jang, “Feature-based ensemble modeling for addressing diabetes data imbalance using the SMOTE, RUS, and random forest methods: a prediction study,” Ewha Medical Journal, vol. 48, no. 2, p. e32, Apr. 2025, https://doi.org/10.12771/emj.2025.00353
[29] M. R. Hossain, M. J. Hossain, M. M. Rahman and M. M. Alam, “Machine Learning Based Prediction and Insights of Diabetes Disease: Pima Indian and Frankfurt Datasets,” Journal of Mechanics of Continua and Mathematical Sciences, vol. 20, no. 1, pp. 99–114, Jan. 2025, https://doi.org/10.26782/jmcms.2025.01.00007
[30] M. El Sherbiny, M. Abdel Fattah, A. Rabie, A. Taki Eldin and H. Moustafa, “A Diabetes Mellitus Prediction Model Based on Supervised Machine Learning Techniques,” International Journal of Telecommunications, vol. 5, no. 01, pp. 1-11, 2024. https://doi.org/10.21608/ijt.2025.359269.1083
[31] E. Majeed Hameed, H. Joshi and A. A. A. Ismael, “The Effect of Combining Datasets in Diabetes Prediction Using Ensemble Learning Techniques,” CommIT (Communication and Information Technology) Journal, vol. 19, no. 1, pp. 129-140, 2025, https://doi.org/10.21512/commit.v19i1.12064
[32] L. T. Phan, R. Rakkiyappan and B. Manavalan, “REMED-T2D: A robust ensemble learning model for early detection of type 2 diabetes using healthcare dataset,” Comput Biol Med, vol. 187, p. 109771, Mar. 2025, https://doi.org/10.1016/j.compbiomed.2025.109771
[33] O. Julius Adetunji, A. Olusogo Julius, A. Olusola Ayokunle and F. Olawale Ibrahim, “Early Diabetic Risk Prediction using Machine Learning Classification Techniques,” Int. J. Innov. Sci. Res. Technol, vol. 9, no. 6, pp. 502-507, 2021. https://www.researchgate.net/publication/369299560
[34] O. O. Oladimeji, A. Oladimeji and O. Oladimeji, “Classification models for likelihood prediction of diabetes at early stage using feature selection,” Applied Computing and Informatics, vol. 20, no. 3–4, pp. 279–286, Jun. 2024, https://doi.org/10.1108/ACI-01-2021-0022
[35] J. Borges, “Advancing Deep Learning Insights for Identifying Heart Disease in Diabetic Patients: A Data Mining Approach Using Logistic Regression and Random Forests,” 2025, https://doi.org/10.2139/SSRN.5091734