A Comparative Study of Principal Component Analysis with Ensemble Learning for Classification of Medical Data

Siti Amirah Batrisya Mohd Rahaizi; Wendy Ling Shinyie; Soo-Fen Fam

doi:10.33102/mjosht.572

Authors

Siti Amirah Batrisya Mohd Rahaizi Department of Mathematics and Statistics, Faculty of Science, Universiti Putra Malaysia, 43400 Serdang, Malaysia.
Wendy Ling Shinyie Department of Mathematics and Statistics, Faculty of Science, Universiti Putra Malaysia, 43400 Serdang, Malaysia.
Soo-Fen Fam Faculty of Technology Management and Technopreneurship, Universiti Teknikal Malaysia Melaka, 75450, Melaka, Malaysia.

DOI:

https://doi.org/10.33102/mjosht.572

Keywords:

Principal Component Analysis, Random Forest, Extremely Randomized Trees, Dimensionality Reduction

Abstract

Dimensionality reduction is a critical component in the analysis of medical data, specifically when addressing challenges like multicollinearity, noise, and high-dimensional feature spaces that can decrease classification performance. While principal component analysis (PCA) is a traditional choice, its utility in medical datasets is often hindered by outliers, corrupted observations, and low interpretability, as principal components are linear combinations of all original variables. This research compares PCA, robust PCA (RPCA), and sparse PCA (SPCA) integrated with random forest (RF) and extremely randomized trees (ERT). A simulation study revealed that while all PCA variants struggle with low class separation, RPCA and SPCA significantly outperform standard PCA in the presence of outliers. This study utilized a diabetes dataset that underwent thorough preprocessing, including median imputation, normalization, and the synthetic minority over-sampling technique (SMOTE) to address class imbalance. Model optimization involved cross-validation of the RPCA regularization parameter and the SPCA sparsity parameter based on the area under the receiver operating characteristic (ROC) curve (AUC). At the same time, RF and ERT hyperparameters were optimized using a two-stage random and grid search approach. Final empirical results demonstrate that the RPCA-ERT model is superior, achieving an accuracy of 0.8954 and a sensitivity of 0.9434, underscoring its effectiveness in managing contaminated medical data.

Downloads

Download data is not yet available.

References

[1] Z. M. Zain, M. Alshenaifi, A. Aljaloud, T. Albednah, R. Alghanim, A. Alqifari, and A. Alqahtani, “Predicting breast cancer recurrence using principal component analysis as feature extraction: an unbiased comparative analysis,” International Journal of Advances in Intelligent Informatics, vol. 6(3), pp. 313-327, Nov. 2020. https://doi.org/10.26555/ijain.v6i3.462

[2] BaniMustafa, S. Almatarneh, O. Bulkrock, R. Alazaidah, H. Almahasneh, and G. Samara, “Investigating Principal Component Analysis Impact on the Performance of Machine Learning Classifiers: A Health Informatics Application,” in 25th International Arab Conference on Information Technology (ACIT), 2024, pp. 1-6, IEEE. https://doi.org/10.1109/ACIT62805.2024.10876953

[3] J. Moreira, B. Silva, H. Faria, R. Santos, and A. S. P. Sousa, “Systematic Review on the Applicability of Principal Component Analysis for the Study of Movement in the Older Adult Population,” Sensors, vol. 23(1), 205, 2023. https://doi.org/10.3390/s23010205

[4] D. Jiménez-Narváez, V. D. C. Vaca, J. J. Loor-Duque, I. R. A. Martín, I. G. Reyes-Chacón, P. Vizcaíno, and M. E. Morocho-Cayamcela, “Predictive Modeling for Fetal Health: A Comparative Study of PCA, LDA and KPCA for Dimensionality Reduction,” IEEE Access, vol. 13, pp. 59687-59703, Mar. 2025. https://doi.org/10.1109/ACCESS.2025.3553110

[5] S. S. S. A. Mutalib, W. N. S. W. Yusoff, A. P. Kurniati, N. A. Osman, and Z. Zulhelmy, “Robust Principal Component Analysis in Multivariate Applications,” Journal of Applied Science, Engineering, Technology, and Education, vol. 7(2), 357-364, 2025. https://doi.org/10.35877/454RI.asci3948

[6] R. Guerra-Urzola, K. Van Deun, J. C. Vera, and K. Sijtsma, “A guide for sparse PCA: Model comparison and applications,” Psychometrika, vol. 86(4), 893-919, 2021. https://doi.org/10.1007/s11336-021-09773-2

[7] N. Le, S. Song, Q. Zhang, and Wang, R. K, “Robust principal component analysis in optical micro-angiography,” Quantitative imaging in medicine and surgery, vol. 7(6), 654, 2017. https://doi.org/10.21037/qims.2017.12.05

[8] E. A. Gibson, J. Zhang, J. Yan, L. Chillrud, J. Benavides, Y. Nunez, and M. A. Kioumourtzoglou, “Principal component pursuit for pattern identification in environmental mixtures,” Environmental Health Perspectives, vol. 130(11), 117008, 2022. https://doi.org/10.1289/EHP10479

[9] J. X. Liu, Y. Xu, C. H. Zheng, H. Kong, and Z. H. Lai, “RPCA-based tumor classification using gene expression data,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12(4), 964-970, 2014. https://doi.org/10.1109/TCBB.2014.2383375

[10] P. P. Brahma, Y. She, S. Li, J. Li, and D. Wu, “Reinforced robust principal component pursuit,” IEEE transactions on neural networks and learning systems, vol. 29(5), 1525-1538, 2017. https://doi.org/10.1109/TNNLS.2017.2671849

[11] S. Gajjar, M. Kulahci, and A. Palazoglu, “Selection of non-zero loadings in sparse principal component analysis,” Chemometrics and Intelligent Laboratory Systems, vol. 162, 160-171, 2017. https://doi.org/10.1016/j.jprocont.2017.03.005

[12] M. Baytas, K. Lin, F. Wang, A. K. Jain, and J. Zhou, J, “Stochastic convex sparse principal component analysis,” Journal on Bioinformatics and Systems Biology, vol. 15, 2016. https://doi.org/10.1186/s13637-016-0045-x

[13] Rehman, A. Khan, M. A. Ali, M. U. Khan, S. U. Khan, and L. Ali, “Performance analysis of PCA, sparse PCA, kernel PCA and incremental PCA algorithms for heart failure prediction,” in International Conference on Electrical, Communication, and Computer Engineering (ICECCE), 2020, IEEE, pp. 1-5. https://doi.org/10.1109/ICECCE49384.2020.9179199

[14] J. Camacho, A. K. Smilde, E. Saccenti, and J. A. Westerhuis, “All sparse PCA models are wrong, but some are useful. Part I: computation of scores, residuals and explained variance,” Chemometrics and Intelligent Laboratory Systems, vol. 196, 103907, Jan. 2020. https://doi.org/10.1016/j.chemolab.2019.103907

[15] S. H. Boppana, S. S. K. Komati, R. H. Chitturi, R. Raj, and C. D. Mintz, “DiabCompSepsAI: Integrated AI Model for Early Detection and Prediction of Postoperative Complications in Diabetic Patients—Using a Random Forest Classifier,” Journal of Clinical Medicine, vol. 14(20), 7173, 2025. https://doi.org/10.3390/jcm14207173

[16] P. Tyagi, J. Vijayashree, S. Mathur, and M. Thoke, “Integrating Machine Learning with Clinical Practice: Advancements in Heart Disease Prediction Models,” in International Conference on Data Science and Business Systems (ICDSBS), IEEE, 2025, pp. 1-8. https://doi.org/10.1109/ICDSBS63635.2025.11031691

[17] Kate, G. Deepika, M. N. Sravya, and S. Ganesan, “Enhanced diabetes diagnosis using ensemble classifiers with explainable AI and oversampling for imbalanced data,” in Proceedings of the 5th International Conference on Intelligent Technologies (CONIT 2025), IEEE, 2025. https://doi.org/10.1109/CONIT65521.2025.11167663

[18] N. A. Amiludin, M. M. Rosli, N. Ibrahim, and W. A. Hammood, “Mental Health Prediction Using Ensemble Learning Approaches with Rebalancing Technique,” in International Conference on Advanced Machine Learning and Data Science (AMLDS), IEEE, 2025, pp. 455-460. https://doi.org/10.1109/AMLDS63918.2025.11159407

[19] M. Ahmed, R. Hassan, S. Datto, S. Saleh, S. Islam, K. Redwan, and S. Mahmood, “A Comparative Study of Machine Learning Models for Cardiovascular Risk Prediction Using PCA-Transformed Framingham Data,” in International Conference on Quantum Photonics, Artificial Intelligence, and Networking (QPAIN), IEEE, 2025, pp. 1-6. https://doi.org/10.1109/QPAIN66474.2025.11171785

[20] Rehan, M. U. Rehman, M. Aamir, and S. Islam, “A CatBoost and ExtraTrees-based softvoting ensemble approach for non-invasive diabetes detection using hair LIBS spectral data,” Microchemical Journal, vol. 217, 114980, Oct. 2025. https://doi.org/10.1016/j.microc.2025.114980

[21] R. Sanakal, and T. Jayakumari, “Prognosis of diabetes using data mining approach—fuzzy C means clustering and support vector machines,” International Journal of Computer Trends and Technology, vol. 11(2), 94-98, 2014. https://doi.org/10.14445/22312803/IJCTT-V11P120

[22] R. Drikvandi, and O. Lawal, “Sparse principal component analysis for natural language processing,” Annals of Data Science, vol. 10, pp.25–41, Feb. 2023. https://doi.org/10.1007/S40745-020-00277-X

[23] Maruotto, F. K. Ciliberti, P. Gargiulo, and M. Recenti, “Feature selection in healthcare datasets: Towards a generalizable solution,” Computers in Biology and Medicine, vol. 196, 110812, 2025. https://doi.org/10.1016/j.compbiomed.2025.110812

[24] B. U. Rani, G. Bhavana, and S. Vemavarapu, “An enhanced microarray sample classification using machine learning,” Proceedings of the 5th International Conference for Emerging Technology (INCET), 2024, pp. 1-5. https://www.scopus.com/pages/publications/85200977011

[25] P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30(7), pp. 1145–1159, Jul. 1997. https://doi.org/10.1016/S0031-3203(96)00142-2

[26] Machkour, A. Breloy, M. Muma, D. P. Palomar, and F. Pascal, “Sparse PCA with false discovery rate controlled variable selection,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Apr. 2024, pp. 9715-9720. https://doi.org/10.1109/ICASSP48485.2024.10448237