I. Introduction
Bangladesh's Secondary School Certificate (SSC) examination remains one of the most consequential milestones in a student's academic life, and yet its outcomes continue to raise concern. The pass rate for the 2022 SSC examination stood at 89.20% (The Daily Star, 2019), a figure that, while seemingly respectable, masks a troubling undercurrent: in 2023, forty-eight institutions recorded a 0% pass rate, up from a still-alarming 109 in 2021 (Kolorob, 2020). Numbers like these are not just statistics—they represent thousands of students whose academic trajectories may have been derailed, perhaps avoidably, at a relatively early stage. It is this gap between aggregate success and pockets of near-total failure that motivates the present study, though the underlying causes are, admittedly, more tangled than any single explanation can capture.
One contributing factor—perhaps among several—is the group selection process that occurs at the transition from grade eight to grade nine. At this juncture, students in Bangladesh must choose among three broad academic streams: Science, Business Studies, and Humanities (Kolorob, 2020). The Science stream emphasizes physics, chemistry, biology, and advanced mathematics; Business Studies covers accounting, finance and banking, and entrepreneurship; and Humanities encompasses sociology, geography, history, civics, and economics. This decision, made by adolescents who are often only thirteen or fourteen years old, tends to shape—sometimes irrevocably—the course of their subsequent education and, by extension, their career options. Yet many students approach this choice with limited guidance, relying on peer influence, parental preference, or social prestige rather than any grounded assessment of their own aptitudes and interests. The resulting mismatch between student and stream may well be one of the quieter contributors to the failure rates noted above, though establishing this link empirically is part of what this study hopes to do.
Could a data-driven approach help here? There is growing reason to think so. Educational Data Mining (EDM) and machine learning have, over the past decade or so, been applied with increasing sophistication to a range of pedagogical problems—predicting course failure (Guo et al., 2019; Pedro et al., 2013), forecasting dropout (Sara et al., 2015; Verma & Illés, 2019), and more (Romero & Ventura, 2010). Hussain et al. (2019), for instance, used internal assessment data from over ten thousand students and achieved a classification accuracy exceeding 95% using deep learning techniques. Delen (2010), working with a five-year institutional dataset of over sixteen thousand students, demonstrated that data mining methods could predict freshman attrition with roughly 80% accuracy—a result that, at the time, was considered fairly compelling.
Career and stream prediction specifically has also received some attention, albeit less than one might expect given its importance. Nazareno et al. (2019) applied an artificial neural network to predict career paths from grades in five core subjects, reporting an accuracy of about 74%—respectable, though leaving considerable room for improvement. Bartnick et al. (1985), in an older but still-cited study, examined how psychological inventories could anticipate medical students' career choices, while Ade and Deshmukh (2014) and Roy et al. (2018) explored ensemble and advanced machine learning techniques for similar prediction tasks. Lichtenberger and George-Jackson (2013) took a slightly different angle, using binary logistic regression on a dataset of 2,700 students to examine how individual, family, and school-level factors influenced students' likelihood of pursuing STEM fields after secondary school. Other work has addressed academic performance prediction more broadly (Saa, 2016; Kumar & Salal, 2019) and the factors influencing it (Ramesh et al., 2012).
What is notably absent from this body of work, however—at least as far as we have been able to determine—is any system tailored to the specific context of Bangladeshi secondary education, where the Science/Business Studies/Humanities decision carries such outsized weight. The techniques that underpin many of these prior studies—data mining concepts (Han, Pei, & Kamber, 2011), missing-data imputation (Eekhout, 2019), naive Bayes classification (Langley et al., 1992; Taheri & Mammadov, 2013), support vector machines (Platt, 1998), random forests (Breiman, 2001), and ensemble bagging methods (Hakim et al., 2019)—offer a promising toolkit. It is this toolkit, applied to a problem that has so far gone largely unaddressed, that this study sets out to explore. The objectives, broadly, are to collect real-world data from Bangladeshi institutions and online sources, develop a predictive model for stream selection, and evaluate its performance using multiple metrics—an admittedly ambitious agenda, but one that seems worth attempting given the stakes involved.