Data Modeling
KMFusionNet: An Alternating Tree-Estimator Boosting Framework for Imbalanced Binary Classification
Shazib Sheikh1*, Ethan Debnath2, Zulkarnain Saurav 3, Kamruzzaman Mithu3, Swakkhar Shatabda 4
Data Modeling 5 (1) 1-8 https://doi.org/10.25163/data.5110764
Submitted: 29 September 2024 Revised: 05 December 2024 Accepted: 13 December 2024 Published: 16 December 2024
Abstract
Class imbalance remains one of the more stubborn, frequently underestimated problems in applied machine learning — particularly in domains where the minority class is precisely the one that matters most, such as medical diagnosis, fraud detection, and fault prediction. Conventional classification algorithms tend to optimize for aggregate accuracy, which means minority class instances are often misclassified with little cost to the overall metric. This study introduces KMFusionNet, a hybrid adaptive boosting framework that alternately employs two complementary tree-based weak learners — the C4.5 Decision Tree and the Extra Tree classifier — within a modified AdaBoost architecture, augmented by an early stopping criterion governed by a stagnation window. The model was evaluated against six established benchmarks — AdaBoost, RUSBoost, SMOTEBoost, EUSBoost, DataBoost, and Easy Ensemble — across 12 imbalanced datasets drawn from the KEEL repository, with imbalance ratios ranging from 1.87 to 41.03. Performance was measured using the area under the receiver operating characteristic curve (auROC), with each experiment repeated across 10 independent runs under 5-fold cross-validation. KMFusionNet achieved the highest auROC on 11 of 12 benchmark datasets, with particularly pronounced gains at higher imbalance ratios. Computational cost remained markedly lower than approaches using Random Forest or SVM as base learners, suggesting a practical efficiency advantage. These findings indicate that combining lightweight, structurally diverse tree classifiers within a boosting mechanism can meaningfully improve minority class discrimination without the overhead of more complex ensembles.
Keywords: class imbalance; adaptive boosting; ensemble learning; decision tree; extra tree classifier
References
Alcalá-Fdez, J., Fernandez, A., Luengo, J., Derrac, J., García, S., Sánchez, L., & Herrera, F. (2011). KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2–3), 255–287.
Bühlmann, P., & Yu, B. (2003). Boosting with the L2 loss: Regression and classification. Journal of the American Statistical Association, 98(462), 324–339.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Chawla, N. V., Lazarevic, A., Hall, L. O., & Bowyer, K. W. (2003). SMOTEBoost: Improving prediction of the minority class in boosting. 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, 107–109.
Cortes, C., & Vapnik, V. (1995). Support vector machine. Machine Learning, 20(3), 273–297.
De Souza, É., & Matwin, S. (2011). Extending AdaBoost to iteratively vary its base classifiers. Advances in Artificial Intelligence, 384–389.
Farid, D. M., Al-Mamun, M. A., Manderick, B., & Nowe, A. (2016). An adaptive rule-based classifier for mining big biological data. Expert Systems with Applications, 64, 305–316.
Farid, D. M., Zhang, L., Rahman, C. M., Hossain, M., & Strachan, R. (2014). Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks. Expert Systems with Applications, 41(4), 1937–1946.
Farid, D. M., Zhang, L., Hossain, A., Rahman, C. M., Strachan, R., Sexton, G., & Dahal, K. (2013). An adaptive ensemble classifier for mining concept drifting data streams. Expert Systems with Applications, 40(15), 5895–5906.
Freund, Y., Schapire, R. E., et al. (1996). Experiments with a new boosting algorithm. International Conference on Machine Learning, 96, 148–156.
Galar, M., Fernández, A., Barrenechea, E., & Herrera, F. (2013). EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition, 46(12), 3460–3471.
Geurts, P., Ernst, D., & Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1), 3–42.
Guo, H., & Viktor, H. L. (2004). Learning from imbalanced data sets with boosting and data generation: The DataBoost-IM approach. ACM SIGKDD Explorations Newsletter, 6(1), 30–39.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Jiang, W. (2004). Process consistency for AdaBoost. Annals of Statistics, 13–29.
Kubat, M., & Matwin, S. (1997). Addressing the curse of imbalanced training sets: One-sided selection. International Conference on Machine Learning, 97, 179–186.
Laurikkala, J. (2001). Improving identification of difficult small classes by balancing class distribution. Artificial Intelligence in Medicine, 63–66.
Liaw, A., & Wiener, M. (2002). Classification and regression by random forest. R News, 2(3), 18–22.
Liu, X.-Y., Wu, J., & Zhou, Z.-H. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2), 539–550.
Mani, I., & Zhang, I. (2003). kNN approach to unbalanced data distributions: A case study involving information extraction. Proceedings of Workshop on Learning from Imbalanced Datasets, 126.
Quinlan, J. R. (1996). Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4, 77–90.
Rayhan, F., Ahmed, S., Shatabda, S., Farid, D. M., Mousavian, Z., Dehzangi, A., & Rahman, M. S. (2017). iDTI-ESBoost: Identification of drug target interaction using evolutionary and structural features with boosting. arXiv preprint arXiv:1707.00994.
Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 40(1), 185–197.
Sun, Y., Wong, A. K. C., & Kamel, M. S. (2009). Classification of imbalanced data: A review. International Journal of Pattern Recognition and Artificial Intelligence, 23(4), 687–719.
Sun, Z., Song, Q., Zhu, X., Sun, H., Xu, B., & Zhou, Y. (2015). A novel ensemble method for classifying imbalanced data. Pattern Recognition, 48(5), 1623–1637.
Yao, Y., Rosasco, L., & Caponnetto, A. (2007). On early stopping in gradient descent learning. Constructive Approximation, 26(2), 289–315.
Yen, S.-J., & Lee, Y.-S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36(3), 5718–5727.
Recommended articles
Advancing Healthcare Through Data Analytics Transitioning from Descriptive Insights to Predictive and Prescriptive Solutions
Save
Citation
View
Share