2.1 Study Design and Analytical Framework
We approached this problem as a two-stage predictive modeling task rather than a single end-to-end pipeline, and that distinction matters for how the rest of this section is organized. The first stage concerns itself purely with language: classifying the emotional valence of financial news text. The second stage is concerned with time: forecasting next-day stock price movement using both historical price and the sentiment signal produced in stage one. We deliberately kept these stages separable and independently evaluable, partly because it lets us diagnose where errors originate — a weak sentiment classifier versus a weak time-series model produce very different failure signatures — and partly because it mirrors how most of the prior literature in this space has framed the problem (Li et al., 2014; Li, Wu, & Wang, 2020; Mohan et al., 2019). Figure 1 summarizes the overall architecture, and the subsections below walk through each component in the order data actually flowed through the system: acquisition, preprocessing, feature construction, model training, and evaluation.
2.2 Data Sources and Acquisition
Because no sentiment-labeled dataset existed for our target company's news coverage, we first needed a separate, independently labeled corpus large enough to train a general-purpose financial sentiment classifier. For this purpose we used the India Financial News Headlines Sentiments dataset, retrieved from Kaggle (available at https://www.kaggle.com/datasets/harshrkh/india-financial-news-headlines-sentiments). We chose this corpus for two practical reasons: its scale, and its provenance. It contains more than 200,000 financial news headlines spanning 2017 through 2021, each pre-labeled by sentiment polarity, which gave us enough volume to train a supervised classifier without immediately running into data scarcity problems — a concern that is not trivial in financial NLP, where labeled corpora tend to be small and domain-specific. From the full dataset we retained three fields relevant to our task: the sentiment label, the headline title, and the publication date. After filtering to the positive and negative classes used for binary classification, the working corpus comprised 92,383 positive headlines and 108,118 negative headlines — a class distribution that, while not perfectly balanced, was close enough that we did not apply additional resampling.
2.3 Target-Company News Corpus
The sentiment classifier trained above is only useful if it can be applied to news about the specific company we are trying to forecast, and that corpus had to be built separately. We constructed a custom web scraper to collect news headlines from a recognized financial news portal, restricting collection to outlets with an established editorial track record in order to limit the risk of incorporating fabricated or low-credibility "fake news" into the sentiment pipeline — a concern that, anecdotally, comes up often enough in financial NLP work that we felt it warranted explicit mention rather than assuming source quality. This process yielded a corpus of more than 500 news items pertaining to the target company.
2.4 Historical Stock Price Data
Daily historical price data for the target company were obtained from a publicly available Kaggle repository covering Apple Inc. (ticker: AAPL), spanning a seven-year period. We used Apple as the initial test case for the proposed pipeline, partly because of data availability and partly because a heavily traded, news-saturated stock offers a reasonably stringent test of whether sentiment signal can be extracted at all. The raw download included multiple price fields (open, high, low, close, volume); following standard practice in related forecasting work (Mohan et al., 2019), we retained only the daily closing price as the price-based feature for downstream modeling, since closing price is the figure most consistently used across the literature for end-of-day movement prediction and avoids the added noise of intraday fluctuation.
2.5 Data Preprocessing
2.5.1 Text Preprocessing for Sentiment Analysis
Before any classifier could be trained, the raw headline text needed to be normalized — this step is unglamorous but, in our experience, disproportionately affects downstream accuracy. Two preprocessing operations were applied to every headline in both the training corpus and the target-company corpus: conversion of all text to lowercase, which reduces token sparsity that would otherwise arise from inconsistent capitalization (e.g., treating "Stock" and "stock" as distinct tokens), and removal of common English stop words, which allows the classifier to concentrate its discriminative capacity on lexically meaningful terms rather than high-frequency function words that carry little sentiment information.
2.6 Feature Engineering for Stock Prediction
Following sentiment classification (described below), each headline in the target-company corpus was assigned a predicted sentiment label, and these labels were then aggregated to the daily level: for every trading day, we counted the number of positive, neutral, and negative news items published. This daily aggregation was necessary because the price-prediction model operates on a daily time step, while the raw news data arrives at an irregular, sub-daily frequency — aggregating sentiment counts per day is what allowed us to align the two data streams into a single coherent feature table. The resulting integrated dataset comprised four features per day, summarized in Table I: closing price, compound sentiment polarity, and the confidence scores associated with positive and negative news classifications, alongside a neutral confidence measure.
2.7 Feature Extraction for Text Representation
Raw text cannot be fed directly into a machine learning classifier, so a numerical representation step was required to bridge the gap between unstructured language and the vector-based input that supervised models expect. We considered three families of text-vectorization approaches for this purpose.
Term frequency–inverse document frequency (TF-IDF) was used as a candidate representation; this statistic captures how important a given word is to a document, with the underlying intuition that a term's weight should increase with its frequency within a document but be discounted if it appears ubiquitously across the entire corpus (Su et al., 2011). We also used the CountVectorizer implementation from the Scikit-learn package (Géron, 2022), which converts text into a numerical vector based on simple frequency counts of each token, providing a more straightforward bag-of-words baseline against which the TF-IDF representation could be compared.
In addition, we considered embedding-based approaches — specifically Word2Vec and FastText — both of which pursue the same underlying objective of learning dense vector representations for words, but differ in how they handle the substructure of language. FastText extends the Word2Vec approach by incorporating subword (n-gram) information, which comes at the cost of longer training time, owing to the substantially larger number of n-gram units relative to whole words, but in exchange offers improved handling of rare or out-of-vocabulary terms — a property that is particularly relevant to financial headline text, where novel company names, ticker symbols, or industry-specific terminology routinely fall outside a fixed vocabulary.
2.8 Sentiment Classification Models
Table I. Feature Description. These four features—closing price, compound sentiment, and the positive, negative, and neutral confidence scores—constitute the input feature set for the Long Short-Term Memory (LSTM) stock prediction model described in the Methods section. Sentiment-derived features were generated by aggregating day-level outputs of the Naive Bayes and Support Vector Machine classifiers across all news items published on a given trading day.
|
Feature
|
Meaning
|
|
Price
|
The closing price of a company
|
|
Compound
|
Polarity of news sentiment
|
|
Positive
|
Confidence of positive news
|
|
Negative
|
Confidence of negative news
|
|
Neutral
|
Confidence of neutral news
|

Fig. 1. Architecture of the Proposed Stock Movement Prediction System. Schematic overview of the end-to-end system pipeline. Unlabeled financial news text is first passed through the sentiment analysis model, which outputs a categorical sentiment label (positive, negative, or neutral) for each news item. These outputs, together with daily closing price data obtained independently from stock market records, are combined into a unified feature set comprising sentiment score and closing price. This combined feature set is then supplied to the stock prediction model, which outputs a forecast of stock price movement for the following trading day. Dashed horizontal lines demarcate the three functional stages of the pipeline: text-based sentiment inference, feature integration, and price-movement forecasting.
After vectorization, we trained two supervised classifiers — Naive Bayes and Support Vector Machine — selected because both have demonstrated reasonable performance on short-text sentiment classification tasks in prior financial NLP work (Pavitha et al., 2022), while remaining computationally lightweight relative to transformer-based alternatives. Figure 3 presents the overall architecture of the sentiment classification component.
The Naive Bayes classifier rests on Bayes' theorem, which estimates the probability of an event conditional on prior knowledge and a (admittedly strong, and admittedly often violated in practice) assumption of feature independence:
P(A|B) = [P(A) × P(B|A)] / P(B)
where P(A) denotes the prior probability of class A, P(A|B) denotes the posterior probability of class A given evidence B, and P(B|A) denotes the likelihood of observing evidence B given class A.
For the specific text-classification context here, we implemented the Multinomial Naive Bayes variant, which is the form most commonly applied in natural language processing tasks. This variant operates on term frequencies — counts of how often a given word occurs within a document — which are first normalized by document length and then used to compute maximum-likelihood estimates of the conditional probabilities from the training data (Su et al., 2011).
2.9 Support Vector Machine (SVM) Classifier
We also implemented a Support Vector Machine classifier, one of the most widely adopted supervised algorithms for textual polarity detection. SVM is capable of both classification (predicting a discrete label) and regression (predicting a continuous value) tasks, and for sentiment classification, the algorithm seeks the hyperplane that optimally separates classes when the data are projected into an n-dimensional feature space. Where the data are not linearly separable in their original space, kernel functions — linear, sigmoid, radial basis function (RBF), polynomial, and other non-linear variants — can be used to transform the feature space to permit separation.
For this study, we used a linear kernel, on the grounds that text classification tasks typically involve a very high-dimensional feature space (each distinct word or token effectively constitutes its own feature), and linearly separable structure tends to emerge naturally at that dimensionality. Practically speaking, the linear kernel also trains substantially faster than non-linear alternatives, and tends to perform well whenever a reasonably clear margin separates the classes — both considerations that were relevant given the dataset size involved.
Stock Price Movement Prediction Model
2.10 Rationale for Model Selection
For the price-forecasting component, we selected a Long Short-Term Memory (LSTM) recurrent neural network architecture. LSTM networks have become a common choice in stock market prediction tasks because of their explicit design for capturing long-range temporal dependencies in sequential data (Li, Wu, & Wang, 2020) — a property that is directly relevant here, since closing-price data is itself a time series, and short-term memory architectures (or simple feedforward models) tend to lose access to longer-horizon patterns that may be predictive of future movement.
2.11 Data Partitioning
The combined dataset — closing price plus daily aggregated sentiment features — was partitioned chronologically into a training set comprising the first 80% of observations and a held-out test set comprising the remaining 20%. We deliberately avoided a random train/test split here, since shuffling time-series data would allow the model to be evaluated on dates that precede some of its training examples, producing an artificially optimistic and practically meaningless estimate of forecasting performance. The chronological split instead ensures that the test set strictly represents future time periods relative to training, which is the only evaluation setup that meaningfully approximates how the model would be used in practice.
2.12 Input Normalization
Prior to model training, all input features were normalized to a common scale, typically the [0, 1] interval. This step matters for two reasons: it prevents features with larger raw numeric ranges (such as closing price, which can run into the hundreds of dollars) from dominating the learning process relative to smaller-scale sentiment confidence scores, and it generally improves the numerical stability and convergence behavior of gradient-based optimization during neural network training.
Because LSTM networks operate on sequences rather than single time points, we constructed fixed-length input

Fig. 2. Class Distribution of the Sentiment-Labeled News Headline Corpus. Bar chart showing the number of headlines per sentiment class in the labeled training corpus (India Financial News Headlines Sentiments dataset; n = 200,501) used to train the sentiment classification models. The negative class comprised 108,118 headlines and the positive class comprised 92,383 headlines, reflecting a modest class imbalance that was not corrected with additional resampling prior to model training.

Fig. 3. Architecture of the Financial News Sentiment Analysis Model. Flow diagram of the sentiment classification pipeline, from raw input to final sentiment output. Raw news headlines (DataSet [News]) are first processed through stop-word filtering and noise removal, after which four candidate text-vectorization approaches

Fig. 4. Relationship Between Daily News Sentiment and Stock Closing Price. Stacked bar chart comparing daily closing price (light blue) and net news sentiment polarity (dark blue) for the target stock across the study period (December 2006–February 2007).

Fig. 5. LSTM-Predicted Versus Actual Closing Price Using Closing Price Alone. Line plot comparing actual adjusted closing price (Real price, blue) against price predicted by the Long Short-Term Memory (LSTM) model (Predicted price, orange) on the held-out test set (November 9–December 1, 2016),

Fig. 6. LSTM-Predicted Versus Actual Closing Price Using Closing Price and News Sentiment. Line plot comparing actual adjusted closing price (Real price, blue) against price predicted by the Long Short-Term Memory (LSTM) model (Predicted price, orange) on the same held-out test set (November 9–December 1, 2016)
windows from the normalized time series, where each window comprises a contiguous span of historical daily observations and the associated target is the subsequent day's stock movement. We set the window length to 14 days. This choice reflects a practical compromise: a window too short risks omitting genuinely predictive longer-term patterns, while a window too long increases the parameter burden of the model relative to the size of the training data — and 14 trading days (roughly three calendar weeks) struck a reasonable balance given our dataset size, though we note this parameter was not exhaustively tuned and represents an area for refinement in future work.
2.13 Model Architecture and Training Procedure
The LSTM network was trained on the windowed training data, learning to map each 14-day input sequence to its corresponding target value. Model parameters (weights and biases) were updated iteratively via gradient-based optimization to minimize prediction error between the model's output and the true target. Specifically, we used mean squared error as the loss function and the Adam optimizer to perform parameter updates — a pairing that is fairly standard for regression-style sequence prediction tasks of this kind, owing to Adam's adaptive learning rate behavior, which tends to produce more stable convergence than plain stochastic gradient descent, particularly when feature scales or gradient magnitudes vary across training.
Two variants of this model were trained and compared: one using only the normalized closing price as input (Figure 5), and a second incorporating both closing price and the daily-aggregated sentiment features described above (Figure 6). This comparison was the central manipulation of the study, allowing us to isolate whatever incremental contribution sentiment information makes over and above price history alone.
2.14 Model Evaluation
Once trained, each LSTM variant was applied to the held-out test set to generate predictions on previously unseen data, and predicted closing prices were compared against actual observed prices across the test period. We additionally examined the correlation between daily news sentiment polarity and same-day closing price movement (Figure 4) as a complementary, model-independent check on whether the underlying sentiment-price relationship the system depends on was actually present in the data, rather than relying solely on downstream forecasting accuracy to make that case indirectly. Because the sentiment classifier's own accuracy directly bounds how much useful signal can flow into the price model, we also report classifier-level performance (Naive Bayes and SVM accuracy) separately from the downstream LSTM forecasting results, so that errors attributable to language classification versus time-series modeling can be distinguished rather than conflated into a single end-to-end metric.