A Comprehensive Data Enhancement Method for the Pima Dataset to Improve Diabetes Prediction Performance
Authors
Diabetes is one of the silent killer diseases that can effect if left without medication and a real change in lifestyle. 10.5% of adult people (10-79 years) have diabetic in the world according to the International Diabetes Federation (IDF) Diabetes Atlas (2021) reports [1]. And number getting higher. Thus, in this study, we aim to build a prediction model using Pima Indian Diabetes (PID) dataset. Dataset required heavy-duty processing because of its low-quality characteristics, such as lot missing values and imbalance. This paper shows how enhancing data quality can affectively reflect on models’ performance. Based on the conducted experiments, ensemble models such as Random Forest show highest performance (0.86% AUC-ROC) with highest encasement among all other model by around 4%.
Keywords:
Diabetes Prediction, Pima Indian Diabetes (PID) Dataset, ensemble Models[1] International Diabetes Federation (IDF). (2021). Diabetes Atlas, 10th Edition.
[2] World Health Organization (WHO). (2021). Global Report on Diabetes.
[3] American Diabetes Association (ADA), "Diagnosis and Classification of Diabetes Mellitus," Diabetes Care, vol. 37, no. 1, pp. S81–S90, 2023.
[4] WHO. (2020). Global Estimates of Undiagnosed Diabetes.
[5] Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swani, S. M., Blau, H. M., ... & Thierauf, B. S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639), 115-118.
[6] Obermeyer, Z., Powers, B. J., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453.
[7] Beam, A. L., & Kohane, I. S. (2016). Big data and machine learning in health care. Jama, 316(21), 2363-2364.
[8] Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., ... & Wang, Y. (2017). Artificial intelligence in healthcare: past, present and future. Stroke and vascular neurology, 2(4), 230-243.
[9] https://archive.ics.uci.edu/datasets/
[10] Abnoosian K., et al., (2023), “Prediction of diabetes disease using an ensemble of machine learning multi-classifier models”, https://doi.org/10.1186/s12859-023-05465-z
[11] Viswanatha V., et al., (2023), ”Diabetes Prediction Using Machine Learning Approach”, DOI: 10.37896/sr10.8/008
[12] SAIHOOD Q., et al, (2023), “A practical framework for early detection of diabetes using A practical framework for early detection of diabetes using ensemble machine learning models”
[13] Ganie S., et al, (2023), “An ensemble learning approach for diabetes prediction using boosting techniques”, https://doi.org/10.3389/fgene.2023.1252159.
[14] Reza M., et al., (2024), “Improving diabetes disease patients classification using stacking ensemble method with PIMA and local healthcare data”, https://doi.org/10.1016/j.heliyon.2024.e24536.
[15] Kalyani K., et al, (2024), “Diabetes Prediction Using Random Forest”.
[16] Khan Q., et al, (2024), “An intelligent diabetes classification and perception framework based on ensemble and deep learning method PIMA Dataset”, DOI 10.7717/peerj-cs.1914.
[17] Maryam I Mousa Al-Khuzaay, Waleed A Mahmoud Al-Jawher ”New Proposed Mixed Transforms: CAW and FAW and Their Application in Medical Image Classification” International Journal of Innovative Computing, Volume 13, Issue 1-2, Pages 15-21, 2022.
[18] Salih M.,et al, (2024), “Diabetic Prediction based on Machine Learning Using PIMA Indian Dataset”.
[19] Hamid M Hasan, AL Jouhar, Majid A Alwan “Face recognition using improved FFT based radon by PSO and PCA techniques” International Journal of Image Processing (IJIP), Volume 6, Issue 1, Pages 26-37, 2012.
[20] American Diabetes Association (ADA), "Diagnosis and Classification of Diabetes Mellitus," Diabetes Care, vol. 37, no. 1, pp. S81–S90, 2023.
[21] Rasha Ali Dihin, Ebtesam AlShemmary, Waleed Al-Jawher “Diabetic Retinopathy Classification Using Swin Transformer with Multi Wavelet” Journal of Kufa for Mathematics and Computer, Volume 10, Issue 2, Pages 167-172, 2023.
[22] AHM Al-Heladi, WA Mahmoud, HA Hali, AF Fadhel “Multispectral Image Fusion using Walidlet Transform” Advances in Modelling and Analysis B, Volume 52, Issue 1-2, Pages 1-20, 2009.
[23] Japkowicz, N., & Stephen, S. "The class imbalance problem: A systematic study," Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, 2002.
[24] Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
[25] Waleed A Mahmoud Al-Jawher, Sarah H Awad “A proposed brain tumor detection algorithm using Multi wavelet Transform (MWT)” Materials Today: Proceedings, Volume 65, Pages 2731-2737, 2022.
License
Copyright (c) 2025 Journal Port Science Research

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
How to Cite
- Published: 2025-06-11
- Issue: Vol. 8 No. 4 (2025): TRANSACTION ON ENGINEERING TECHNOLOGY AND THEIR APPLICATIONS
- Section: Articles


