Cancer Classification Challenges in High-Dimensional Microarray Data: An In-Depth Exploration of Machine Learning Models

Wafa' Qasim Al-Jamal; Sakinah Ali Pitchay; Farida Ridzuan; Muhammad Harith Noor Azam

doi:10.33102/mjosht.531

Authors

Wafa' Faculty of Science and Technology, Universiti Sains Islam Malaysia. 71800 Nilai, Negeri Sembilan, Malaysia.
Sakinah Ali Pitchay Faculty of Science and Technology, Universiti Sains Islam Malaysia. 71800 Nilai, Negeri Sembilan, Malaysia.
Farida Ridzuan Cybersecurity & Systems Research Unit, Universiti Sains Islam Malaysia, 71800, Negeri Sembilan, Malaysia.
Muhammad Harith Noor Azam Cybersecurity & Systems Research Unit, Universiti Sains Islam Malaysia, 71800, Negeri Sembilan, Malaysia.

DOI:

https://doi.org/10.33102/mjosht.531

Keywords:

Microarray, high dimensionality, biomedical data, cancer classification

Abstract

Microarray gene expression profiling has transformed biomedical research by enabling large-scale, parallel analysis of thousands of genes. Despite its promise, cancer classification using Machine Learning (ML) on microarray data continues to face critical challenges, particularly due to high dimensionality, limited sample sizes, and severe class imbalance. These factors contribute to overfitting, poor generalization, and inflated performance metrics, hindering the clinical translation of models. This Structured Literature Review (SLR) examines ML-based cancer classification studies published between 2015 and 2025. This period was marked by the emergence of deep learning, synthetic data generation, and biologically informed modeling. Using a transparent selection protocol, we synthesize findings from over 20 peer-reviewed studies. The review focuses on three methodological pillars: biologically grounded feature selection, constrained data augmentation, and robust performance evaluation. We identify a growing trend toward hybrid feature selection methods that balance statistical relevance and biological interpretability. However, comparative benchmarking across datasets remains limited. Data augmentation techniques, such as Synthetic Minority Oversampling Technique (SMOTE) and Generative Adversarial Networks (GAN)s, are increasingly being adopted. However, they often lack biological validation. This raises concerns about the plausibility of synthetic gene profiles. To address this, we recommend integrating pathway-level constraints and gene ontology checks during the augmentation process. Furthermore, we observe that many studies disproportionately emphasize accuracy. This can misrepresent the model's efficacy in imbalanced settings. Metrics such as Matthews Correlation Coefficient (MCC), F1-score, and precision-recall curves offer more reliable insights. These metrics should be standardized across evaluations. External validation using independent datasets is also essential to assess generalizability. In addition, it helps mitigate dataset-specific bias. Based on the findings, we present a conceptual hybrid framework that integrates biologically informed feature selection, biologically constrained data augmentation, and balanced evaluation protocols. This framework is intended to enhance reproducibility, biological fidelity, and translational reliability in machine learning-based cancer diagnostics, thereby contributing to the advancement of precision oncology.

Downloads

Download data is not yet available.

References

[1] N. Alrefai, O. Ibrahim, H. M. F. Shehzad, A. Altigani, W. Abu-ulbeh, M. Alzaqebah, and M. K. Alsmadi, “An integrated framework based deep learning for cancer classification using microarray datasets,” J. Ambient Intell. Humaniz. Comput., vol. 14, no. 3, pp. 2249–2260, 2023.

[2] H. AlMazrua and H. Alshamlan, “A comprehensive survey of recent hybrid feature selection methods in cancer microarray gene expression data,” IEEE Access, vol. 10, pp. 71427–71449, 2022.

[3] X. Deng, M. Li, S. Deng, and L. Wang, “Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification,” Med. Biol. Eng. Comput., vol. 60, no. 3, pp. 663–681, 2022.

[4] L. Wang, X. Chen, and Y. Li, “Integrative dimensionality reduction using Isomap and Genetic Algorithms (Iso-GA) for robust cancer classification from microarray data,” BMC Bioinformatics, vol. 24, no. 1, p. 112, 2023.

[5] M. Khalsan, A. Karimi, and E. Amirzadeh, “Fuzzy gene selection coupled with multilayer perceptrons for early cancer detection using microarray data,” J. Biomed. Inform., vol. 138, p. 104287, 2023.

[6] S. Ravindran, K. Ramesh, and P. Balasubramanian, “Wasserstein Tabular GAN (WT-GAN) for addressing class imbalance in microarray-based cancer classification,” Sci. Rep., vol. 14, no. 1, p. 5678, 2024.

[7] S. H. Shah, M. J. Iqbal, I. Ahmad, S. Khan, and J. J. Rodrigues, “Optimized gene selection and classification of cancer from microarray gene expression data using deep learning,” Neural Comput. Appl., pp. 1–12, 2020.

[8] A. B. I. Issa, “Exploring the transformative impact of AI across industries and its role in shaping global advancements,” Univ. J. Future Impact Artif. Intell., vol. 1, no. 1, Art. no. 24, 2024.

[9] F. Alharbi and A. Vakanski, “Machine learning methods for cancer classification using gene expression data: A review,” Bioengineering, vol. 10, no. 2, p. 173, 2023.

[10] N. Bhandari, R. Walambe, K. Kotecha, and S. P. Khare, “A comprehensive survey on computational learning methods for analysis of gene expression data,” Front. Mol. Biosci., vol. 9, p. 907150, 2022.

[11] A. U. Mazlan, N. A. Sahabudin, M. A. Remli, N. S. N. Ismail, M. S. Mohamad, H. W. Nies, and N. B. Abd Warif, “A review on recent progress in machine learning and deep learning methods for cancer classification on gene expression data,” Processes, vol. 9, no. 8, p. 1466, 2021.

[12] M. W. Libbrecht and W. S. Noble, “Machine learning applications in genetics and genomics,” Nat. Rev. Genet., vol. 16, no. 6, pp. 321–332, 2015.

[13] A. M. Musolf, E. R. Holzinger, J. D. Malley, and J. E. Bailey-Wilson, “What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics,” Hum. Genet., vol. 141, no. 9, pp. 1515–1528, 2022.

[14] G. Wu, A. Zaker, A. Ebrahimi, S. Tripathi, and A. Mer, “Text-mining-based feature selection for anticancer drug response prediction,” Bioinform. Adv., vol. 4, no. 1, p. vbae047, 2024.

[15] R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, no. 1, 2013.

[16] P. F. Ke, D. S. Xiong, J. H. Li, Z. L. Pan, J. Zhou, S. J. Li, J. Song, X. Y. Chen, G. X. Li, J. Chen, and X. B. Li, “An integrated machine learning framework for a discriminative analysis of schizophrenia using multi-biological data,” Sci. Rep., vol. 11, no. 1, p. 14636, 2021.

[17] D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics, vol. 21, no. 1, 2020.

[18] B. J. Parker, S. Günter, and J. Bedo, “Stratification bias in low signal microarray studies,” BMC Bioinformatics, vol. 8, no. 1, p. 326, 2007.

[19] W. Xie, L. Wang, K. Yu, T. Shi, and W. Li, “Improved multi-layer binary firefly algorithm for optimizing feature selection and classification of microarray data,” Biomed. Signal Process. Control, vol. 79, p. 104080, 2023.

[20] J. Shen, J. Shi, J. Luo, H. Zhai, X. Liu, Z. Wu, C. Yan, and H. Luo, “Deep learning approach for cancer subtype classification using high-dimensional gene expression data,” BMC Bioinformatics, vol. 23, no. 1, p. 430, 2022.

[21] M. Mostavi, P. Mirbagheri, and C. Wang, “Convolutional neural network models for cancer type prediction based on gene expression,” BMC Med. Genomics, vol. 13, no. 1, p. 134, 2019.

[22] A. Sun, E. J. Franzmann, Z. Chen, and X. Cai, “Deep contrastive learning for predicting cancer prognosis using gene expression values,” Brief. Bioinform., vol. 25, no. 6, p. bbae544, 2024.

[23] K. Chiew, C. Tan, K. Wong, K. Yong, and W. Tiong, “A new hybrid ensemble feature selection framework for machine learning-based phishing detection system,” Inf. Sci., vol. 484, pp. 153–166, 2019.

[24] M. Buda, A. Maki, and M. Mazurowski, “A systematic study of the class imbalance problem in convolutional neural networks,” Neural Netw., vol. 106, pp. 249–259, 2018.

[25] Y. Ju, L. Li, L. Jiao, Z. Ren, B. Hou, and S. Yang, “Modified diversity of class probability estimation co-training for hyperspectral image classification,” arXiv preprint arXiv:1809.01436, 2018.

[26] M. Bai, J. Liu, Z. Long, J. Luo, and D. Yu, “A comparative study on class-imbalanced gas turbine fault diagnosis,” Proc. Inst. Mech. Eng., Part G: J. Aerosp. Eng., vol. 237, no. 3, pp. 672–700, 2023.

[27] S. Bagui and K. Li, “Resampling imbalanced data for network intrusion detection datasets,” J. Big Data, vol. 8, no. 1, p. 6, 2021.

[28] M. Kim and P. Kang, “Text embedding augmentation based on retraining with pseudo-labeled adversarial embedding,” IEEE Access, vol. 10, pp. 8363–8376, 2022.

[29] P. Yao, S. Shen, M. Xu, P. Liu, F. Zhang, J. Xing, P. Shao, B. Kaffenberger, and R. X. Xu, “Single model deep learning on imbalanced small datasets for skin lesion classification,” IEEE Trans. Med. Imaging, vol. 41, no. 5, pp. 1242–1254, 2021.

[30] Y. Qiao, Y. Xiong, H. Gao, X. Zhu, and P. Chen, “Protein-protein interface hot spots prediction based on a hybrid feature selection strategy,” BMC Bioinformatics, vol. 19, no. 1, 2018.

[31] A. H. Alsaeedi, H. H. R. Al-Mahmood, Z. F. Alnaseri, M. R. Aziz, D. Al-Shammary, A. Ibaida, and K. Ahmed, “Fractal feature selection model for enhancing high-dimensional biological problems,” BMC Bioinformatics, vol. 25, no. 1, p. 12, 2024.

[32] W. He, H. Huang, X. Chen, J. Yu, J. Liu, X. Li, H. Yin, K. Zhang, and L. Peng, “Radiomic analysis of enhanced CMR cine images predicts left ventricular remodeling after TAVR in patients with symptomatic severe aortic stenosis,” Front. Cardiovasc. Med., vol. 9, p. 1096422, 2022.

[33] Z. Li, W. Xie, and T. Liu, “Efficient feature selection and classification for microarray data,” PLoS One, vol. 13, no. 8, e0202167, 2018.

[34] S. Pani, B. Ratha, and A. Mishra, “Performance analysis of microarray data classification using machine learning techniques,” Int. J. Knowl. Discov. Bioinform., vol. 5, no. 2, pp. 43–54, 2015.