Feature Importance-Guided Ensemble Classification for Predicting Recurrence in Differentiated Thyroid Cancer
DOI:
https://doi.org/10.33005/jasid.v1i2.22Keywords:
ensemble learning, feature selection, machine learning, recurrence prediction, thyroid cancerAbstract
Accurate prediction of cancer recurrence is critical for improving patient monitoring and personalized treatment planning. In this study, we propose a machine learning framework to predict recurrence in patients with differentiated thyroid cancer using statistically selected clinical features. Feature relevance was assessed using ANOVA for ordinal/numerical variables and the Chi-square test for one-hot encoded categorical variables, allowing us to identify the most informative predictors. We then trained three distinct classifiers—Random Forest, Logistic Regression, and XGBoost—and combined them using a hard voting ensemble strategy. The proposed ensemble achieved an accuracy of 98.7% on the test set, with particularly strong precision and recall scores for the recurrent class, indicating its potential clinical utility. Interestingly, all three base classifiers produced identical predictions on the test data, suggesting the dataset’s strong internal structure and the effectiveness of our feature selection process. This work highlights the value of integrating statistical feature selection with ensemble modeling for robust and interpretable prediction in clinical oncology applications.
References
American Cancer Society, “What Is Thyroid Cancer?” [Online]. Available: https://www.cancer.org/cancer/thyroid-cancer/about/what-is-thyroid-cancer.html
J. R. Sijmons, L. W. M. P. Timmers, et al., "Management of thyroid cancer: a practical guide for clinicians," Cancer Treatment Reviews, vol. 41, no. 6, pp. 501–510, 2015.
L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
D. W. Hosmer, S. Lemeshow, and R. X. Sturdivant, Applied Logistic Regression, 3rd ed. Wiley, 2013.
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (KDD '16), San Francisco, CA, USA, 2016, pp. 785–794.
R. A. Fisher, “The Use of Multiple Measurements in Taxonomic Problems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936. (Cited for ANOVA)
K. Pearson, “On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that It Can be Reasonably Supposed to Have Arisen from Random Sampling,” Philosophical Magazine, vol. 50, no. 302, pp. 157–175, 1900. (Cited for Chi-square test)
Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms, CRC Press, 2012.