Data Modeling

Mathematical and Computational Data Modeling
0
Citations
3.7k
Views
20
Articles
Your new experience awaits. Try the new design now and help us make it even better
Switch to the new experience
Figures and Tables
RESEARCH ARTICLE   (Open Access)

Beyond Grades: A Machine Learning-Based Decision Support System for Academic Stream Selection in Bangladeshi Secondary Schools

Md. Mobarak Hossain

+ Author Affiliations

Data Modeling 3 (1) 1-8 https://doi.org/10.25163/data.3110802

Submitted: 24 February 2022 Revised: 12 April 2022  Published: 22 April 2022 


Abstract

Choosing between Science, Business Studies, and Humanities is, for a thirteen-year-old in Bangladesh, a decision with outsized consequences—one that, more often than not, is still made on the basis of prior exam results, family expectations, or the lingering prestige attached to the Science stream. This review-style study examines an attempt to bring something more systematic to that process: a machine learning–based decision support system trained on data from 364 students across three institutions. Twelve relatively simple attributes—covering academic history, study habits, and personal preferences such as feared subjects or hobbies—were fed into three classifiers: Naïve Bayes, Sequential Minimal Optimization, and Random Forest. The results were, perhaps unsurprisingly, modest but encouraging: Random Forest edged ahead with 84.9% accuracy, followed closely by Naïve Bayes (82.76%) and SMO (80.88%), figures that compare reasonably against prior career-prediction work despite the considerably smaller dataset. What stands out is less the headline number than the underlying argument—that even a handful of easily collectible, non-academic indicators can meaningfully supplement (not replace) the guidance students currently receive. Whether this constitutes a genuine step toward more equitable, individualized stream allocation, or simply a promising proof of concept awaiting larger and more diverse data, remains an open question this paper invites readers to weigh for themselves. Keywords— Artificial Intelligence, Machine Learning, Education, Data Mining

I. Introduction

Bangladesh's Secondary School Certificate (SSC) examination remains one of the most consequential milestones in a student's academic life, and yet its outcomes continue to raise concern. The pass rate for the 2022 SSC examination stood at 89.20% (The Daily Star, 2019), a figure that, while seemingly respectable, masks a troubling undercurrent: in 2023, forty-eight institutions recorded a 0% pass rate, up from a still-alarming 109 in 2021 (Kolorob, 2020). Numbers like these are not just statistics—they represent thousands of students whose academic trajectories may have been derailed, perhaps avoidably, at a relatively early stage. It is this gap between aggregate success and pockets of near-total failure that motivates the present study, though the underlying causes are, admittedly, more tangled than any single explanation can capture.

One contributing factor—perhaps among several—is the group selection process that occurs at the transition from grade eight to grade nine. At this juncture, students in Bangladesh must choose among three broad academic streams: Science, Business Studies, and Humanities (Kolorob, 2020). The Science stream emphasizes physics, chemistry, biology, and advanced mathematics; Business Studies covers accounting, finance and banking, and entrepreneurship; and Humanities encompasses sociology, geography, history, civics, and economics. This decision, made by adolescents who are often only thirteen or fourteen years old, tends to shape—sometimes irrevocably—the course of their subsequent education and, by extension, their career options. Yet many students approach this choice with limited guidance, relying on peer influence, parental preference, or social prestige rather than any grounded assessment of their own aptitudes and interests. The resulting mismatch between student and stream may well be one of the quieter contributors to the failure rates noted above, though establishing this link empirically is part of what this study hopes to do.

Could a data-driven approach help here? There is growing reason to think so. Educational Data Mining (EDM) and machine learning have, over the past decade or so, been applied with increasing sophistication to a range of pedagogical problems—predicting course failure (Guo et al., 2019; Pedro et al., 2013), forecasting dropout (Sara et al., 2015; Verma & Illés, 2019), and more (Romero & Ventura, 2010). Hussain et al. (2019), for instance, used internal assessment data from over ten thousand students and achieved a classification accuracy exceeding 95% using deep learning techniques. Delen (2010), working with a five-year institutional dataset of over sixteen thousand students, demonstrated that data mining methods could predict freshman attrition with roughly 80% accuracy—a result that, at the time, was considered fairly compelling.

Career and stream prediction specifically has also received some attention, albeit less than one might expect given its importance. Nazareno et al. (2019) applied an artificial neural network to predict career paths from grades in five core subjects, reporting an accuracy of about 74%—respectable, though leaving considerable room for improvement. Bartnick et al. (1985), in an older but still-cited study, examined how psychological inventories could anticipate medical students' career choices, while Ade and Deshmukh (2014) and Roy et al. (2018) explored ensemble and advanced machine learning techniques for similar prediction tasks. Lichtenberger and George-Jackson (2013) took a slightly different angle, using binary logistic regression on a dataset of 2,700 students to examine how individual, family, and school-level factors influenced students' likelihood of pursuing STEM fields after secondary school. Other work has addressed academic performance prediction more broadly (Saa, 2016; Kumar & Salal, 2019) and the factors influencing it (Ramesh et al., 2012).

What is notably absent from this body of work, however—at least as far as we have been able to determine—is any system tailored to the specific context of Bangladeshi secondary education, where the Science/Business Studies/Humanities decision carries such outsized weight. The techniques that underpin many of these prior studies—data mining concepts (Han, Pei, & Kamber, 2011), missing-data imputation (Eekhout, 2019), naive Bayes classification (Langley et al., 1992; Taheri & Mammadov, 2013), support vector machines (Platt, 1998), random forests (Breiman, 2001), and ensemble bagging methods (Hakim et al., 2019)—offer a promising toolkit. It is this toolkit, applied to a problem that has so far gone largely unaddressed, that this study sets out to explore. The objectives, broadly, are to collect real-world data from Bangladeshi institutions and online sources, develop a predictive model for stream selection, and evaluate its performance using multiple metrics—an admittedly ambitious agenda, but one that seems worth attempting given the stakes involved.

2. Methodology

Overall methodology of this system is shown in Fig 1. All the steps are described below:

2.1. Dataset & Features

In my research, data were needed from those students who have already passed secondary and higher secondary education level. Because from those students, it is possible to know after choosing groups in secondary level, whether they were compatible with their groups or not. Dataset was collected from different educational institutions but now one in Bangladesh. This is: New Model Degree College, Dhaka. This dataset contains total 364 instances with 12 important features and one class attribute. Fig 2 shows the amount of the data of different groups in percentage. A complete list of attributes obtained from the student databases is given in Table I. The data were collected in a manner that all the data of the attributes could be in nominal form.

2.2. Data Pre-processing

Real world raw data are highly susceptible to noise, missing or inconsistent due to their huge sizes and their multiple as well as heterogeneous sources of origin. Low quality data leads to lower accuracy in mining results. As mentioned earlier, for constructing this model, we need data from those students who are considered as standard students to their respective groups according to their performance. For this, the data of those students were eliminated who had changed their groups in higher secondary level. Changes in secondary and higher secondary attendance percentages have also been examined. However, it has been discovered that variations in attendance have little affect on students' performance, while attendance in class 8 has a slight influence on defining students' attentiveness capability. In certain cases, there were also extremely few missing values. “Most Frequent” was used to solve the problem of missing values. Most Frequent is a method that replaces the missing value on a certain characteristic with the most frequent value of the available cases. Table 2 displays the number of incidences by group.

2.3. Model Building Phase

Fig 1 shows the workflow of predicting groups (Science, Humanities and Arts). We have divided total workflow into three major steps. Data are collected and pre-processed. Classification process is done by applying three machine learning algorithms namely Naïve Bayes, Sequential Minimal Optimization (SMO) and Random Forest and performances are evaluated.

2.4. Classification Using Machine Learning Method

1) Naïve Bayes Classifier: The Naïve Bayes (NB) classifier is based on the assumption that each feature has only the class as a parent. It is commonly utilized in the classification sector due to its speed, efficacy, and simple construction. It is also useful for high-dimensional data because all properties are probability-independent. That is, the likelihood of one attribute has no effect on the probabilities of others. The concepts of the Bayesian theorem are applied by Naïve Bayes. Let X represent the type of observation y. The highest posterior probability of predicting the class of the observation y using the Bayes rule,

P(X|Y) = P(X|y)P(y) / P(X)  ………(1)

Here X is given as,

X = (x1, x2, x3, …, xn)  ………(2)

x1, x2, x3 … can be defined as different features. By giving these features to the actual equation, the Bayesian equation is,

P(y|x1,x2,…,xn) = P(x1|y)P(x2|y)…P(xn|y)P(y) / P(x1)P(x2)…P(xn)  ………(3)

For all entries, the denominator is static and can be removed. So a proportionality can be introduced to the Bayesian theorem,

P(y|x1,x2,…,xn) ∝ P(y) Π P(xi|y)  ………(4)

From this proportionality, y can be identified where class variable has only two outcomes. But in case of multiclass problems, it is needed to find the maximum of probability of y.

y = argmaxᵧ P(y) Π P(xi|y)  ………(5)

From the above equation, anyone can find the class (y) with the highest probability.

2) SMO: I have imposed Sequential Minimal Optimization, or SMO in our system due to less memory and time consuming issues. The solution of a very large quadratic programming (QP) optimization problem is required by the training phase of Support Vector Machine (SVM). This large QP problem is divided into a series of smallest possible QP problems that makes it 1000 times faster than SVM. Due to less memory consumption it can handle large training set.

3) Random Forest: This is a supervised technique that builds many trees and then mixes them to improve accuracy. This approach can be used for both classification and regression. During this process, random data samples are used to generate several decision trees. Each tree is then used to make predictions or make judgments. Random forest makes a conclusion or prediction based on a majority vote over the predictions of individual trees.

2.5. Classification Performance Evaluation

Four evaluation metrics are used to evaluate our proposed classification model (Hakim et al., 2019). The performance evaluation metrics are demonstrated as follows:

Accuracy = (TP + TN) / (TP + FN + FP + TN)  …(6)

Recall = TP / (TP + FN)  …(7)

Precision = TP / (TP + FP)  …(8)

F1 Score = 2 * (Precision * Sensitivity) / (Precision + Sensitivity)  …(9)

Where, TP, FP, FN and TN represent True Positive, False Positive, False Negative and True Negative respectively.

3. Result & Discussion

Before turning to the numbers themselves, it is worth recalling what exactly was being fed into the three classifiers. The twelve attributes listed in Table 1—ranging from fairly mundane items such as roll number in class 8 to more behaviourally revealing ones like hobby, favourite indoor game, study time, and the subject a student feared most—were used as predictors, with the final stream (Science, Business Studies, or Humanities) serving as the class label. Table 2 shows how these 364 records were distributed across the three groups: 142 in Science, 161 in Business Studies, and only 61 in Humanities. That imbalance is not trivial, and we will return to it shortly. The dataset was split 70/30 for training and testing, and 10-fold cross-validation was layered on top of that to get a more stable read on accuracy, since with a sample this size a single train-test split could easily have been a lucky—or unlucky—draw.

As for the headline figures: Random Forest came out on top with an accuracy of 84.9%, ahead of Naïve Bayes at 82.76% and SMO at 80.88%. None of these are dramatic gaps—roughly two to four percentage points separate the best from the worst—but the ordering held up consistently enough across the cross-validation folds that it seems reasonable to call Random Forest the stronger performer here, at least for this particular dataset. Figure 3 lays out the comparison across the additional evaluation metrics (precision, recall, and F1, as defined in equations 6–9), and the pattern there largely mirrors the accuracy story: Random Forest edges ahead on most of the metrics shown, though not by a wide margin, and Naïve Bayes in particular trails only slightly behind despite its much simpler, conditional-independence-based design.

How does this compare with what others have found? Not as favourably as one might hope, at first glance. Hussain et al. (2019) reported classification accuracies above 95% using deep learning on a dataset of more than ten thousand students, and Delen (2010) achieved roughly 80% accuracy in predicting freshman attrition—though that was working with over sixteen thousand records collected across five years. Set against those numbers, 84.9% on 364 instances might look almost unremarkable, or perhaps even suspiciously high given the sample size. But the comparison is not entirely fair, since those studies were addressing different prediction tasks with vastly larger datasets, and the present study is constrained by the practical difficulty of locating Bangladeshi students whose stream choice and subsequent outcomes could both be tracked. Against the career-and-stream-prediction literature specifically, the result looks somewhat better: Nazareno et al. (2019), using an artificial neural network on grades in five subjects, reported around 74% accuracy, which our Random Forest model exceeds by roughly ten percentage points—though again, the underlying populations and feature sets differ enough that this should be read as a loose benchmark rather than a direct comparison.

Why might Random Forest have come out ahead, even if only narrowly? The most obvious explanation lies in its design: by building an ensemble of decision trees over randomly sampled subsets of the data and then aggregating their votes (Breiman, 2001), it tends to be more forgiving of the kind of noisy, nominal, survey-derived attributes that make up this dataset—hobby, favourite subject, feared subject, and so on—than a single model would be. The bagging-style averaging that underlies Random Forest, similar in spirit to the modified bagging approach described by Hakim et al. (2019), seems to smooth over some of the idiosyncrasies in individual responses. SMO, by contrast, was included largely for its computational efficiency—Platt (1998) designed it to break a large quadratic programming problem into much smaller, faster sub-problems—and that efficiency does not appear to have translated into a higher accuracy here, though it remained reasonably competitive at just under 81%. Naïve Bayes, despite resting on the rather strong (and arguably unrealistic) assumption of conditional independence among the twelve features, performed almost as well as Random Forest, which is perhaps a little surprising, though not unprecedented; Langley et al. (1992) and Taheri and Mammadov (2013) have both noted that the classifier often performs better in practice than its simplifying assumptions would suggest.

A few caveats seem worth raising, even at the risk of undercutting the headline result somewhat. The class imbalance visible in Table 2—and again in the proportional breakdown shown in Fig. 2—means that Humanities, with only 61 instances against 161 for Business Studies and 142 for Science, is comparatively under-represented. It is not hard to imagine that this skews the models toward predicting the more populous groups, and an accuracy figure averaged across all three streams could be masking weaker performance specifically on Humanities cases; the precision and recall breakdowns in Fig. 3 hint at this but do not, on their own, settle the question. Sample size is the other obvious limitation: 364 students from what is described as essentially one institution (New Model Degree College, Dhaka) is a modest foundation on which to build claims about Bangladeshi students more broadly, however encouraging the cross-validated numbers might look. None of this is meant to dismiss the result—an 84.9% accuracy from twelve relatively simple, easily collectible attributes is still a meaningful signal, and arguably more actionable than the purely results-based heuristics described in the introduction (Kolorob, 2020) that currently dominate group-selection decisions in Bangladesh.

Taken together, then, the results offer a cautiously positive answer to the question posed in the introduction: yes, a handful of features collected through a simple questionnaire—attendance patterns, study habits, feared subjects, and the like—do appear to carry enough signal to predict, with reasonable though imperfect accuracy, which stream a student is likely to thrive in. Whether 84.9% is “good enough” probably depends on what the model is meant to do. As a hard gatekeeping mechanism, it almost certainly is not; nearly one in six predictions would be wrong, after all. But as one input among several—a decision-support nudge offered alongside guidance counsellors, parents, and the students themselves—it seems plausible that even this level of accuracy could help surface mismatches that might otherwise go unnoticed until results, as outlined in the workflow in Fig. 1, start to disappoint.

4. Limitations

For all its promise, the study rests on a fairly narrow foundation. The dataset—364 students drawn largely from a single institution, New Model Degree College—is small by the standards of educational data mining, where comparable studies have worked with ten thousand or more records; this naturally raises questions about how far the 84.9% accuracy figure would travel beyond this particular sample. There's also the matter of class imbalance: Humanities accounts for only 61 of the 364 instances, against 142 for Science and 161 for Business Studies, which may quietly bias the models toward the more populous streams without this being obvious from accuracy alone. The twelve predictive attributes, while practical and easy to collect, are largely self-reported and nominal in nature—hobbies, feared subjects, and similar items—introducing a degree of subjectivity that's difficult to fully account for. And the model remains, at this stage, a prototype rather than a deployed system tested in real decision-making contexts.

5. Conclusion

Taken as a whole, this study offers a cautiously optimistic case for rethinking how Bangladeshi students choose their academic stream. Random Forest's modest edge over Naïve Bayes and SMO suggests that ensemble methods handle this kind of noisy, survey-derived data somewhat more gracefully, though none of the three models performed so poorly as to be dismissed outright. More importantly, the broader message—that twelve accessible, non-results-based attributes carry meaningful predictive signal—seems worth taking seriously, even if 84.9% accuracy falls short of a standalone gatekeeping tool. As a supplementary input alongside counsellors and parents, it may help surface mismatches earlier, provided future work addresses the sample size and imbalance concerns honestly.

 

References


Daily Star. (2019, May 6). 107 schools with 100% fail rates. The Daily Star. https://www.thedailystar.net/country/news/107-schools-100-fail-rates-1739488

Kolorob. (2020, January). SSC routine 2020 PDF all education board. https://kolorob.com.bd/ssc-routine-2020/

Guo, Q., Chen, M., An, D., & Ye, L. (2019). Prediction of students' course failure based on campus card data. In 2019 International Conference on Robots & Intelligent System (ICRIS) (pp. 361–364). IEEE.

Pedro, M. O., Baker, R., Bowers, A., & Heffernan, N. (2013). Predicting college enrollment from student interaction with an intelligent tutoring system in middle school. In Educational Data Mining 2013.

Sara, N.-B., Halland, R., Igel, C., & Alstrup, S. (2015). High-school dropout prediction using machine learning: A Danish large-scale study. In ESANN 2015 Proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence (pp. 319–324).

Verma, C., & Illés, Z. (2019). Attitude prediction towards ICT and mobile technology for the real-time: An experimental study using machine learning. In The International Scientific Conference eLearning and Software for Education (Vol. 3, pp. 247–254). "Carol I" National Defence University.

Romero, C., & Ventura, S. (2010). Educational data mining: A review of the state of the art. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(6), 601–618.

Hussain, S., Muhsion, Z. F., Salal, Y. K., Theodoru, P., Kurtoglu, F., & Hazarika, G. (2019). Prediction model on student performance based on internal assessment using deep learning. International Journal of Emerging Technologies in Learning (iJET), 14(8), 4–22.

Delen, D. (2010). A comparative analysis of machine learning techniques for student retention management. Decision Support Systems, 49(4), 498–506.

Nazareno, A., Lopez, M., Gestiada, G., Martinez, M., & Roxas-Villanueva, R. (2019). An artificial neural network approach in predicting career strand of incoming senior high school students. Journal of Physics: Conference Series, 1245, 012005.

Bartnick, L., Kappelman, M., Berger, J., & Sigman, B. (1985). The value of the California Psychological Inventory in predicting medical students' career choice. Medical Education, 19(2), 143–147.

Ade, R., & Deshmukh, P. (2014). An incremental ensemble of classifiers as a technique for prediction of student's career choice. In 2014 First International Conference on Networks & Soft Computing (ICNSC2014) (pp. 384–387). IEEE.

Roy, K. S., Roopkanth, K., Teja, V. U., Bhavana, V., & Priyanka, J. (2018). Student career prediction using advanced machine learning techniques. International Journal of Engineering and Technology, 7(2.20).

Lichtenberger, E., & George-Jackson, C. (2013). Predicting high school students' interest in majoring in a STEM field: Insight into high school students' postsecondary plans. Journal of Career and Technical Education, 28(1), 19–38.

Saa, A. A. (2016). Educational data mining & students' performance prediction. International Journal of Advanced Computer Science and Applications, 7(5), 212–220.

Kumar, M., & Salal, Y. K. (2019). Systematic review of predicting student's performance in academics. International Journal of Engineering and Advanced Technology, 8(3), 54.

Ramesh, V., Thenmozhi, P., & Ramar, K. (2012). Study of influencing factors of academic performance of students: A data mining approach. International Journal of Scientific & Engineering Research, 3(7), 1–5.

Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques. Elsevier.

Eekhout, I. (2019). Missing data methods: Imputation methods. https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods

Langley, P., Iba, W., & Thompson, K. (1992). An analysis of Bayesian classifiers. In AAAI (Vol. 90, pp. 223–228).

Taheri, S., & Mammadov, M. (2013). Learning the naive Bayes classifier with optimization models. International Journal of Applied Mathematics and Computer Science, 23(4), 787–795.

Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines.

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

Hakim, M. A., Hasan, M. Z., Alam, M. M., Hasan, M. M., & Huda, M. N. (2019). An efficient modified bagging method for early prediction of brain stroke. In 2019 International Conference on Computer, Communication, Chemical, Materials and Electronic Engineering (IC4ME2) (pp. 1–4). IEEE.

 


Article metrics
View details
0
Downloads
0
Citations
3
Views

View Dimensions


View Plumx


View Altmetric



0
Save
0
Citation
3
View
0
Share