Data Modeling
HCNN-LSTM with BERT Embeddings and TomekLinks for Imbalanced News Text Classification
Kamruzzaman Mithu 1*, Md. Nesar Uddin 1, Md. Ataur Rahman 1, Sayed Rokibul Hossain 1, Mohammad Nurul Huda 1
Data Modeling 1 (1) 1-8 https://doi.org/10.25163/data.1110689
Submitted: 07 January 2026 Revised: 02 March 2026 Accepted: 10 March 2026 Published: 11 March 2026
Abstract
Text classification continues to occupy a central place in natural language processing research. Still, its effectiveness can diminish rather noticeably when datasets contain uneven category distributions. The consequence, perhaps unsurprisingly, is that minority categories—despite sometimes carrying important or nuanced information—tend to be overlooked or misclassified. This imbalance poses a persistent challenge for many real-world text analytics applications. In the present study, we explored this issue by developing a hybrid deep learning framework, referred to as HCNN-LSTM, designed specifically for imbalanced news text classification. The model integrates convolutional neural networks and long short-term memory networks with contextual BERT embeddings and the TomekLinks undersampling technique. The BBC News dataset, which contains 2,225 articles distributed across five categories—business, entertainment, politics, sport, and technology—served as the experimental benchmark for evaluating the proposed approach. Before training the models, a structured preprocessing pipeline was applied. This included text normalization, stop-word removal, lemmatization, and feature preparation, followed by class balancing procedures. The overall goal of this process was to reduce noise while preserving meaningful linguistic patterns within the dataset. Architecturally, the hybrid model was designed to capture both short-range lexical cues and longer contextual dependencies, while BERT-based embeddings provided richer semantic representations of textual content. On the imbalanced dataset, the model achieved an accuracy of 0.95 and a macro-averaged F1-score of 0.95. When evaluated on a balanced version of the dataset, performance improved further, reaching 0.99 for both accuracy and macro-averaged F1-score. Additional analysis using mean squared error indicated that the proposed hybrid architecture produced the lowest prediction error among all compared approaches. Taken together, these results suggest that combining contextual embeddings with hybrid neural architectures may provide a practical and effective strategy for improving classification performance in imbalanced text datasets.
Keywords: Text classification; imbalanced data; BBC News dataset; BERT embeddings; CNN-LSTM; HCNN-LSTM; TomekLinks; natural language processing; deep learning; news categorization
References
Cai, L., & Hofmann, T. (2004). Hierarchical document categorization with support vector machines. In Proceedings of the 13th ACM International Conference on Information and Knowledge Management (pp. 78–87).
Dönicke, T., Damaschk, M., & Lux, F. (2019). Multiclass text classification on unbalanced, sparse and noisy data. In Proceedings of the 1st NLPL Workshop on Deep Learning for Natural Language Processing (pp. 58–65).
Ger, S., Jambunath, Y. S., & Klabjan, D. (2023). Autoencoders and generative adversarial networks for imbalanced sequence classification. In Proceedings of the IEEE International Conference on Big Data (pp. 1101–1108).
Hasib, K. M., et al. (2020). A survey of methods for managing the classification and solution of data imbalance problem. arXiv preprint arXiv:2012.11870.
Hasib, K. M., et al. (2022). BMNet-5: A novel approach of neural network to classify the genre of Bengali music based on audio features. IEEE Access, 10, 108545–108563.
Hasib, K. M., Showrov, M. I., Mahmud, J. A., & Mithu, K. (2022). Imbalanced data classification using hybrid undersampling with cost-sensitive learning method. In Edge analytics: Selected proceedings of the 26th International Conference ADCOM 2020 (pp. 423–435).
Karim, A., Azam, S., Shanmugam, B., & Kannoorpatti, K. (2020). Efficient clustering of emails into spam and ham: The foundational study of a comprehensive unsupervised framework. IEEE Access, 8, 154759–154788.
Kim, J., & Kim, J. (2018). The impact of imbalanced training data on machine learning for author name disambiguation. Scientometrics, 117(1), 511–526.
Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 29, No. 1).
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
Letouzé, E. (2012). Big data for development: Challenges & opportunities. UN Global Pulse.
Misra, R., & Grover, J. (2021). Sculpting data for ML: The first act of machine learning. University of California San Diego.
Navin, J. R. M., & Pankaja, R. (2016). Performance analysis of text classification algorithms using confusion matrix. International Journal of Engineering and Technical Research, 6(4), 75–78.
Selamat, A., Yanagimoto, H., & Omatu, S. (2002). Web news classification using neural networks based on PCA. In Proceedings of the 41st SICE Annual Conference (Vol. 4, pp. 2389–2394).
Shah, K., Patel, H., Sanghvi, D., & Shah, M. (2020). A comparative analysis of logistic regression, random forest and KNN models for text classification. Augmented Human Research, 5(1), 12.
Sun, A., Lim, E. P., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191–201.
Thölke, P., et al. (2023). Class imbalance should not throw you off balance: Choosing the right classifiers and performance metrics for brain decoding with imbalanced data. NeuroImage, 277, 120253.
Tong, X., Wu, B., Wang, S., & Lv, J. (2018). A complaint text classification model based on character-level convolutional network. In Proceedings of the IEEE 9th International Conference on Software Engineering and Service Science (pp. 507–511).
UN Global Pulse. (2012). Big data for development: Opportunities & challenges. Retrieved from http://www.unglobalpulse.org
Yang, Y., et al. (2018). TICNN: Convolutional neural networks for fake news detection. arXiv preprint arXiv:1806.00749.
Yin, J., et al. (2020). A novel model for imbalanced data classification. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 4, pp. 6680–6687).