A COMPARATIVE STUDY OF K-NEAREST NEIGHBORS AND RANDOM FOREST CLASSIFIERS ON DIABETES

I Putu Gede Abdi Sudiatmika; Putu Satya  Saputra; Mochammad Rifki Ulil albaab

Authors

I Putu Gede Abdi Sudiatmika Politeknik Negeri Bali
Putu Satya Saputra Politeknik Negeri Bali
Mochammad Rifki Ulil albaab Politeknik Negeri Jember

Keywords:

Diabetes Prediction, Machine Learning, K-Nearest Neighbors, Random Forest

Abstract

The early detection of diabetes is crucial for effective management and prevention of complications associated with this chronic health condition. Machine learning techniques offer promising approaches for developing predictive models that can support medical decision-making. This study investigates the performance of two popular classification algorithms—K-Nearest Neighbors (KNN) and Random Forest—in predicting diabetes based on a dataset comprising key health indicators, including glucose levels, blood pressure, insulin, BMI, and age. The primary goal is to assess which model provides higher predictive accuracy and reliability for medical applications.The research employs data preprocessing steps, such as handling missing values and standardizing feature scales, to ensure model consistency. Hyperparameter tuning is performed to optimize the settings for both algorithms. For KNN, the optimal number of neighbors and weighting scheme are determined, while Random Forest is fine-tuned by adjusting the number of trees, maximum depth, and minimum samples required for a split. These optimizations are implemented through grid search cross-validation to enhance model performance.The models are evaluated using accuracy, precision, recall, and F1 score as key metrics. Results show that the KNN model achieves an accuracy of 70.13%, while Random Forest reaches a higher accuracy of 75.97%. Additionally, Random Forest demonstrates better recall and F1 scores, indicating a stronger ability to correctly identify positive diabetes cases and maintain a balance between precision and recall.. This demonstrates Random Forest’s robustness in handling complex, multivariate data, which is advantageous for accurately classifying diabetes cases. In contrast, KNN shows limitations in sensitivity, potentially due to its reliance on distance metrics, which may not fully capture the dataset’s inherent complexity.This study concludes that Random Forest is a more suitable model for diabetes prediction in this dataset, offering reliable and interpretable results that could support healthcare providers in early diagnosis. Future research may explore other advanced algorithms, data balancing methods, and feature engineering techniques to further improve prediction accuracy.

References

Abaker, Ali A., dan Fakhreldeen A. Saeed. 2021. “A comparative analysis of machine learning algorithms to build a predictive model for detecting diabetes complications.” Informatica (Slovenia) 45(1).

Abnoosian, Karlo, Rahman Farnoosh, dan Mohammad Hassan Behzadi. 2023. “Prediction of diabetes disease using an ensemble of machine learning multi-classifier models.” BMC Bioinformatics 24(1).

Alnowaiser, Khaled. 2024. “Improving Healthcare Prediction of Diabetic Patients Using KNN Imputed Features and Tri-Ensemble Model.” IEEE Access 12.

Elgeldawi, Enas, Awny Sayed, Ahmed R. Galal, dan Alaa M. Zaki. 2021. “Hyperparameter tuning for machine learning algorithms used for arabic sentiment analysis.” Informatics 8(4).

Georganos, Stefanos et al. 2021. “Geographical random forests: a spatial extension of the random forest algorithm to address spatial heterogeneity in remote sensing and population modelling.” Geocarto International 36(2).

Ghawi, Raji, dan Jürgen Pfeffer. 2019. “Efficient Hyperparameter Tuning with Grid Search for Text Categorization using kNN Approach with BM25 Similarity.” Open Computer Science 9(1).

Hasas, Ansarullah, Mohammad Shuaib Zarinkhail, Musawer Hakimi, dan Mohammad Mustafa Quchi. 2024. “Strengthening Digital Security: Dynamic Attack Detection with LSTM, KNN, and Random Forest.” Journal of Computer Science and Technology Studies 6(1).

Hong, Le Thi Thu, Nguyen Chi Thanh, dan Tran Quoc Long. 2021. “CRF-EfficientUNet: An Improved UNet Framework for Polyp Segmentation in Colonoscopy Images with Combined Asymmetric Loss Function and CRF-RNN Layer.” IEEE Access 9.

Jamaleddyn, Imad, Rachid El ayachi, dan Mohamed Biniz. 2023. “An improved approach to Arabic news classification based on hyperparameter tuning of machine learning algorithms.” Journal of Engineering Research (Kuwait) 11(2).

Khairudin, Moh et al. 2024. “Early detection of diabetes potential using cataract image processing approach.” Sinergi (Indonesia) 28(1).

Lee, Seungjun et al. 2023. “A Survey on Evaluation Metrics for Machine Translation.” Mathematics 11(4).

Liu, Chia Hui, Chih Fong Tsai, Kuen Liang Sue, dan Min Wei Huang. 2020. “The feature selection effect on missing value imputation of medical datasets.” Applied Sciences (Switzerland) 10(7).

M, Hossin, dan Sulaiman M.N. 2015. “A Review on Evaluation Metrics for Data Classification Evaluations.” International Journal of Data Mining & Knowledge Management Process 5(2).

Malik, Sumbal, Saad Harous, dan Hesham El-Sayed. 2021. “Comparative analysis of machine learning algorithms for early prediction of diabetes mellitus in women.” In Lecture Notes in Networks and Systems,.

Nivetha, Namakkal Ramasamy Periasamy, Paapampalayam Shanmugam Periasamy, dan Periasamy Anitha. 2024. “Binary fire hawks optimizer with deep learning driven noninvasive diabetes detection and classification.” Bratislava Medical Journal 125(2).

Ogunsanya, Michael, Joan Isichei, dan Salil Desai. 2023. “Grid search hyperparameter tuning in additive manufacturing processes.” Manufacturing Letters 35.

Premsagar, Preesha et al. 2022. “Comparing conventional statistical models and machine learning in a small cohort of South African cardiac patients.” Informatics in Medicine Unlocked 34.

Saha, Sunil et al. 2024. “Integrating deep learning neural network and M5P with conventional statistical models for landslide susceptibility modelling.” Bulletin of Engineering Geology and the Environment 83(1).

Sai, Ananya B., Akash Kumar Mohankumar, dan Mitesh M. Khapra. 2023. “A Survey of Evaluation Metrics Used for NLG Systems.” ACM Computing Surveys 55(2).

Sekulić, Aleksandar et al. 2020. “Random forest spatial interpolation.” Remote Sensing 12(10).

Shao, Xiaotong et al. 2022. “A review of energy efficiency evaluation metrics for data centers.” Energy and Buildings 271.

Shimpi, Jayanta Kiran, Poonkuntran Shanmugam, dan Albert Alexander Stonier. 2024. “Analytical model to predict diabetic patients using an optimized hybrid classifier.” Soft Computing 28(3).

Shin, Sheojung et al. 2021. “Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality.” ESC Heart Failure 8(1).

De Silva, Kushan et al. 2021. “Use and performance of machine learning models for type 2 diabetes prediction in clinical and community care settings: Protocol for a systematic review and meta-analysis of predictive modeling studies.” Digital Health 7.

Singh, Vaibhav Kant, dan Nageshwar Dev Yadav. 2024. “Proposing ML Approach for Detection of Diabetes.” In Lecture Notes in Electrical Engineering,.

Wahyudi, Wahyudi, Wulan Purnamasari S, Akmal Hidayat, dan M. Miftach Fakhri. 2022. “Penerapan Machine Learning Pada Mikrokontroler Arduino Mega PRO MINI ATmega2560-16AU.” Journal of Embedded Systems, Security and Intelligent Systems 3(1).