Stock Market Prediction using LSTM and Text Mining Method

Download Full-Text PDF Cite this Publication

Text Only Version

Stock Market Prediction using LSTM and Text Mining Method

Nishaat Fakhr

Department of Electronicsand Communication UMIT, SNDTWU, Mumbai

Smruti Dalvi

Department of Electronicsand Communication UMIT, SNDTWU, Mumbai

Divya Dumbre

Department of Electronicsand Communication UMIT, SNDTWU, Mumbai

Bharat Patil

Department of Electronicsand Communication UMIT, SNDTWU, Mumbai

AbstractThe stock market is one of the most complex, difficult to predict yet lucrative ways to earn money. While investing, the focus is always on getting higher benefits. Investing in the stock market may demand the need to study various associated factors and extract useful information for reliable forecasting. The papers presented before have their prime focus on either different machine learning algorithms or use of histor- ical data to provide forecasts. This paper focuses on the use of LSTM (Long-Short Term Memory) and Text Mining methods to predict stock values. Sentiment is examined behind each tweet to determine whether it is a positive or a negative text. This data is then combined with historical data to visualize the stock price. Accuracy of this model depends on the quality of data available. The paper provides a model that visualizes the trends of past 7 days in the form of graphs and predicts value for the next day.

Keywords LSTM, Text Mining, Sentimental Analysis, RNN, Stock Price Prediction


    The stock market is a place where trading of shares of public listed companies is done. Prices of stock are influenced by multiple/several factors like political upheaval, interest rates, war, natural calamities, etc. There are two major benefits of investing in the stock market :-

    • Purchasing a stock essentially represents owning a stake in a company, and therefore the owner is entitled to the companys profits and assets.

    • The difference in purchasing and selling price can gen- erate huge profit.

    A clear understanding of the market helps to decide the best time to buy or sell stocks in order to generate profits or to minimize the loss. These predictions are done on the basis of multiple factors. These factors can be broadly divided into two categories i.e. historical data and textual data.

    Historical data comprises open value, close value, high value and low value shares traded. Textual data comprises news articles, tweets, blogs articles, etc. The availability of a substantial amount of data and advancement in Machine Learning and Artificial Intelligence, helps to technically ana- lyze historical and textual data. This paper focuses on the use of LSTM and Text mining methods to anticipate the future stock prices.

    The dataset has 3 main attributes and that are high value, close value and adjacent close value. The adjusted closing

    price has higher significance because it focuses on all the changes that take place in the entire day during trading. This paper aims to develop a system which presents a sentiment score along with a graph of real vs predicted values of past 7 days and the forecasted value.


    After reading different papers of stock market prediction we recognized many models and machine learning techniques. All papers were very different from each other and they were published between 2019 to 2021. The first paper which we came across was A Survey on Stock Market Prediction Using Machine Learning Techniques. Polamuri Subba Rao, K.Srinivas and A. Krishna Mohan [11] provides a review and comparative analysis of different stock market prediction parameter techniques i.e. ARIMA, RNN, TSLM, ANN [11]. The next paper Stock Market Prediction Using Machine Learning which is an IEEE paper and it was published in May, 2018. Ishita Parmar, Navanshu Agarwal, Sheirsh Saxena, Ridam Arora, Shikhin Gupta, Himanshu Dhiman, Lokesh Chouhan [1] predicted stock prices by using the regression and LSTM based ML model [1]. The authors found the LSTM model more efficient and accurate.

    Sahil Vazirani, Abhishek Sharma and Pavika Sharma [3] proposed a hybrid model with linear regression and achieved minimal error. Similarly, Ankit Thakkar, Kinjal Chaudhari [4] used fusion to predict the stock market. Padmanayana, Varsha, Bhavya K [6] investigated the sentiment behind each tweet and how that can be expressed in stock market prediction. Saloni Mohan, Sahitya Mullapudi, Sudheer Sammeta, Parag Vijayvergia and David C. Anastasiu, [7] takes into considera- tion the financial news of a company along with the past stock prices. However, the authors believe that the amount of textual data collected and analyzed during the past studies has been insufficient and thus the predictions are of low accuracy.

    Noemi Pinto, Luciano da Silva Figueiredo and Ana Cristina Garcia [8] observed that the use of data from social media and internet sites is a compound source of information, and they help in providing a better prediction. Nemanja S. Malinovi, Bratislav B. Predi and Milos Roganovi [5] used a rare architecture that includes the ConvLSTM layer as a part of the network. Noh Yoon Seonga, Kihwan Na [9] experimented

    by using three-year data and compared the result with the existing methods. They predicted the result by using Multiple Kernel Learning.

    Christy Jackson J., Prassanna J., Abdul Quadir Md. and Sivakumar V.[2] was motivated to work on forecasting using the non-linear data as there is no evidence that stock data was linear. The objective of the authors is to understand and predict the stock behavior through statistical calculations and visualizations of historical data analysis.

    Gondaliya, Chetan Patel, Tirthank, Ajay Shah [10] studied with the limelight on the Covid-19 pandemic as an endeavor to investigate the classification accuracy of selected ML algorithms under natural language processing for sentiment analysis and prediction for the Indian stock market [10].

    A. Mahadik, D. Vaghela, and A. Mhaisgawali [14], stated that LSTM outperforms other approaches since it only concen- trates on the important factors for predicting, although it does not perform well when any attributes values are absent [14]. Similarly, A. Maiti and P. Shetty D [12] predicted the stock prices of five companies which are listed on Indias NSE, by using two models, LSTM and GAN model [12]. The authors used the technique of rolling segmentation for the partition of a training and testing dataset to examine the effect of different interval partitions on the prediction performance [12].

    A. Maalla, C. Y. Zhuang, Q. H. Feng and L. Shen [15] used the Comprehensive strategy gradient training method, combined with current market conditions and historical data for automatic trading [15]. They also conducted a single-share trading test and a multi-share mixed investment trading test [15].

    D. Y. N. Le, A. Maag and S. Senthilananthan [13] presented several machine learning and deep learning approaches for stock market prediction [13]. The author found that Deep Learning models such as LSTM outperforms Machine Learn- ing models such as ARIMA and SVM, and that they are the best methodologies for stock market forecasting [13]. N. Adlakha, Ridhima and A. Katal [16] proposed a framework to analyse every companys stock using mathematical technical metrics [16]. The authors made use of stacked LSTM, linear regression, random forest and K-nearest neighbors neural network algorithm, to forecast the stock trends on the basis of the price history [16].


    In this paper, we extracted the data from Yahoo! Finance as well as Twitter API. Firs, we stored the tweets from the Twitter API and processed them for further analysis, which included Natural Language Processing (NLP) and Sentiment Analysis. Yahoo data was also normalized simultaneously and it was combined with twitter data to form the final dataset. While combining twitter and yahoo finance data, the problem of missing stock values on weekends and availability of twitter data on all days was tackled using the formula:

    y = (xprevious + xnext)/2. (1) where, y = missing value

    xprevious = previous known value

    xnext = next known value

    After creating a dataset, we built a model where it tests the prediction. Thus, showing the predicted and sentiment value.

    Fig.1 shows our system architecture of the model.

    Fig. 1. System Architecture of the model


    This paper aims to build the best model to predict and analyze the values of the stock market. The paper presents two methods LSTM and text mining for the stock forecast which will help the investors and buyers to make a more informed decision.

    1. LSTM

      LSTM stands for Long Short-Term Memory which is an advanced version of the recurrent neural network (RNN) model. It is widely used in the field of deep-learning and is adept in learning long-term dependencies. It is also used to overcome two technical issues, i.e. vanishing and exploding gradients. Predictions depend on huge amounts of data which is either historical or textual data. So LSTM monitors the error by helping the RNNs through retaining information for older stages and hence making the prediction more accurate. Thus, it proves that LSTM is a more reliable model compared to other methods. LSTM consists of three gates, as shown in the Fig.2 and a memory cell. In addition to the hidden state, which is an element of simple RNN, LSTM network output depends on two things:

      • Hidden state – Short term memory

      • Cell State – Long term memory

        The Fig.2 shows the memory cell of the LSTM model. There are three parts in the LSTM cell which are known as gates.

        Fig. 2. Memory cell of LSTM

        1. Forget gate: In a cell of the LSTM network, the forget gate selects whether to keep the information from the previous timestamp or if it is irrelevant then it can be forgotten. The equation of the forget gate is as follows:

          ft = (xtUf + ht1Wf ) (2) where, ht1 is hidden state from the previous cell, xt

          is input at the particular timestamp, Uf is weight matrix associated with input and Wf is weight matrix associated with the hidden state

          Later, the sigmoid function is applied over it and that makes ft a number between 0 and 1. So the sigmoid functions main purpose is to decide whether to keep or discard the information. If the output is 0 for a specific value in the cell state then the network will forget that information and if the output is 1 for a specific value in the cell state then the network will remember the entire information.

        2. Input gate: Here, the input gate tries to add new information to the cell. This appears in 3 steps.

      • The first step is very similar to the forget gate. The equation is as follows:

        it = (xtUi + ht1Wi) (3)

        Then sigmoid function is applied and thus, the values are between 0 and 1.

      • The equation of new information is shown as:

        Nt = tanh(xtUc + ht1Wc) (4)

        Here, the tanh function consists of all the values which can be added to the cell state. Due to the tanh function, the Nt values are between -1 and +1.

      • In the last step as Nt cannot be added directly, so the

        cell state is updated as

        Ct = (ftCt1 + itNt) (5)

        where, Ct and Ct1 is the cell state at the current and previous timestamp respectively.

        1. Output gate: In the output gate, the cell proceeds the updated information from the current timestamp to the next timestamp. The equation of the output gate is similar to the two previous gates.

        ot = (xtUo + ht1Wo) (6)

        The value of ot will also lie in between 0 and 1 because the sigmoid function is applied over the two inputs and two weight matrices. To calculate the current hidden state, the product of ot and tanh(Ct) is shown as:

        Ht = ot tanh(Ct) (7)

        Here, the hidden state is a function of cell state i.e. long term memory (Ct) and the current output.

    2. Text Mining

      The process of deriving large amounts of data from the available text is said to be a text mining method. It is referred to as the Data method or also known as text analytics. It works by transforming data into the language. Different dimensions can be taken into account when categorising stock market prediction systems:

      1. Input Data: Analysis of the data takes place here through Twitter historical news.

      2. Prediction Goal: The pre-processed data is classified on the basis of polarity of a given text at the sentence level whether it is positive or negative tweets.

      3. Prediction horizon: It is the time taken to predict the data which depends upon the size of a tweet i.e short tweets consume less time and long tweets require more time to execute.

      Sentiment analysis: It is a circumstantial mining of text which identifies and extracts the subjective information in a material which can be sourced easily. It identifies the emotion behind a series of words. The source of the texts are either from twitter or online financial news.

      To determine whether the sentiment of each tweet is positive or negative it relies on a variety of techniques. It uses Natural Language ToolKit (NLTK) for the following:-

      • Tokenization: In this process the spacing between the tweeted words is removed to make a list of individual tweet.

      • Removing Stop Words: the words that are repeated or the words which do not have any sentiment values are listed as stop words. These stop words are excluded.

      • Twitter Symbols: Twitter symbols such as !, hashtag, @ and URL are filtered out entirely as they do not add any value to the text.

    3. Algorithm

    Input: Historical stock price dataset, Twitter sentiment dataset.

    Output: Stock price prediction based on stock price anal- ysis.

    1. Importing the necessary libraries.

    2. Read data-Tech Mahindra and RELIANCE Dataset.

    3. Wordnet lemmatizer (WNL)

    4. Dataset gets encoded and checks the polarity.

    5. Data head returns 1st n rows.

    6. 2 lakh data is sampled.

    7. Symbols, stopwors and URLs are filtered.

    8. Printing stopwords.

    9. Texts are processed successfully.

    10. Combining tweets.

    11. Checking sentiment (0>0.5 then positive else negative)

    12. Plot the visualization graph.

    13. Pie chart of positive and negative tweets.

    14. Evaluate the accuracy using the RMSE (root mean squared error) function.

    15. The price predicted of RELIANCE and TECH MAHIN- DRA datasets are plotted as Adjusted close price, Clos- ing price and High price.

    16. Hence, the stock price is predicted.

    17. R2 scores is used to determine the accuracy of predic- tion.


    The model was built using data from January 1, 2021, to April 25, 2022.

      1. RELIANCE: The prediction plots of RELIANCE datasets are as shown below in Fig.3. The graph below depicts RELIANCEs real and predicted stock price over the last seven days.

        Fig. 3. Stock Price Prediction of RELIANCE

        TABLE I



        Sr no.


        Real Price (Rs)

        Predicted Price (Rs)





























        RELIANCEs next predicted value, on April 26, 2022, is Rs.

        2578.41. The accuracy of the forecast was calculated using the

        R2 score, which is 0.9853.

        The sentiment of individuals for RELIANCE is shown in the pie chart (see Fig.4)

        Fig. 4. Sentiment analysis of RELIANCE

        Positive tweets account for 23.9% of the pie chart above, while negative tweets account for 76.1%.

        Fig. 5. Visualisation of RELIANCE

        The paper aids in the visualisation (see Fig.5) of an adjusted close price, close price, and high price trend during a 25-month period (January 2021 to January 2023). It also creates a preliminary forecast for the upcoming months.

      2. TECH MAHINDRA: The prediction plots of TECH MAHINDRA datasets are as shown below in Fig.6. The graph below depicts TECH MAHINDRAs real and predicted stock price over the last seven days.

    Fig. 6. Stock Price Prediction of TECH MAHINDRA



    Sr no.


    Real Price (Rs)

    Predicted Price (Rs)





























    TECH MAHINDRAs next predicted value, on April 26, 2022, is Rs. 1419.25. The accuracy of the forecast was calculated using the R2 score, which is 0.9811.

    The sentiment of individuals for TECH MAHINDRA is shown in the pie chart (see Fig.7)

    Positive tweets account for 54.5% of the pie chart below, while negative tweets account for 45.5%.

    Fig. 7. Sentiment analysis of Tech Mahindra

    Fig. 8. Visualization of Tech Mahindra

    The paper aids in the visualisation (see Fig.8) of an adjusted close price, close price, and high price trend during a 25-month period (January 2021 to January 2023). It also creates a preliminary forecast for the upcoming months.

    The MSE and RMSE values are listed below:



    Sr no.









    Tech Mahindra



    Accuracy = 100 RMSE (8)

    By using the formula (8), the accuracy is calculated of each company in the below table:



    Sr no.







    Tech Mahindra



Providing investors with a combination of two methods helps them better comprehend stock patterns. The goal of the paper is to create models that can summarise market movements and offer investors with a general overview of market patterns. The potential of forecasting stock prices was investigated using the LSTM and Text-mining methods. LSTM

was used since it focuses on the essential variables and is more accurate than the other methods.The availability of data is crucial in stock price prediction because the sentimental com- ponent, and hence the accuracy, is dependent on it. Because each organization has a varying amount of data available, each companys accuracy varies.

In the future, we would like to provide more data, both sentimental data as well as historical data, to expand this research. With more time and resources the potential is great. Additionally, in the future, a lexicon of stock terms can be created based on the most frequently used terms to make the predictions more accurate. Other factors, such as ratios and balance sheet data, will be included as well.


[1] I. Parmar et al., Stock Market Prediction Using Machine Learn- ing, 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), 2018, pp. 574-576, doi: 10.1109/IC- SCCC.2018.8703332.

[2] Jackson, Christy; Jayachandran, Prassanna; Md, Abdul and Sivakumar, V.. (2021). Stock market analysis and prediction using time series analysis. Materials Today: Proceedings. 10.1016/j.matpr.2020.11.364.

[3] S. Vazirani, A. Sharma and P. Sharma, Analysis of various machine learning algorithm and hybrid model for stock market prediction using python, 2020 International Conference on Smart Technologies in Com- puting, Electrical and Electronics (ICSTCEE), 2020, pp. 203-207, doi: 10.1109/ICSTCEE49637.2020.9276859.

[4] Ankit Thakkar, Kinjal Chaudhari, Fusion in stock market prediction: A decade survey on the necessity, recent developments, and potential future directions, Information Fusion, Volume 65, 2021, Pages 95-107, ISSN


[5] N. S. Malinovic´, B. B. Predic´ and M. Roganovic´, Multilayer Long Short- Term Memory (LSTM) Neural Networks in Time Series Analysis, 2020 55th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST), 2020, pp. 11-14, doi: 10.1109/ICEST49890.2020.9232710.

[6] Padmanayana, and Varsha, and K, Bhavya. (2021). Stock Market Predic- tion Using Twitter Sentiment Analysis. International Journal of Scientific Research in Science and Technology. 265-270. 10.32628/CSEIT217475.

[7] S. Mohan, S. Mullapudi, S. Sammeta, P. Vijayvergia and D. C. Anastasiu, Stock Price Prediction Using News Sentiment Analysis, 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService), 2019, pp. 205-208, doi: 10.1109/Big- DataService.2019.00035.

[8] N. Pinto, L. da Silva Figueiredo and A. C. Garcia, Automatic Prediction of Stock Market Behavior Based on Time Series, Text Mining and Senti- ment Analysis: A Systematic Review, 2021 IEEE 24th International Con- ference on Computer Supported Cooperative Work in Design (CSCWD), 2021, pp. 1203-1208, doi: 10.1109/CSCWD49262.2021.9437732.

[9] Nohyoon Seong, Kihwan Nam, Predicting stock movements based on financial news with segmentation, Expert Systems with Applications, Volume 164, 2021, 113988, ISSN 0957-4174.

[10] Gondaliya, Chetan Patel, Ajay Shah, Tirthank. (2021). Sentiment analy- sis and prediction of Indian stock market amid Covid-19 pandemic. IOP Conference Series: Materials Science and Engineering. 1020. 012023. 10.1088/1757-899X/1020/1/012023.

[11] Polamuri, Subba and Srinivas, Kudipudi and Mohan, A.. (2020). A Survey on Stock Market Prediction Using Machine Learning Techniques. 10.1007/978-981-15-1420-3-101.

[12] A. Maiti and P. Shetty D, Indian Stock Market Prediction using Deep Learning, 2020 IEEE REGION 10 CONFERENCE (TENCON), 2020, pp. 1215-1220, doi: 10.1109/TENCON50793.2020.9293712.

[13] D. Y. N. Le, A. Maag and S. Senthilananthan, Analysing Stock Market Trend Prediction using Machine and Deep Learning Models: A ComprehensiveReview, 2020 5th International Conference on Inno- vative Technologies in Intelligent Systems and Industrial Applications (CITISIA), 2020, pp. 1-10, doi: 10.1109/CITISIA50690.2020.9371852.

[14] A. Mahadik, D. Vaghela and A. Mhaisgawali, Stock Price Prediction using LSTM and ARIMA, 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC), 2021, pp. 1594-1601, doi: 10.1109/ICESC51422.2021.9532655.

[15] A. Maalla, C. Y. Zhuang, Q. H. Feng and L. Shen, Research on Stock Market Analysis Based on Deep Learning, 2021 IEEE 4th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), 2021, pp. 1776-1780, doi: 10.1109/IM- CEC51613.2021.9482065.

[16] N. Adlakha, Ridhima and A. Katal, Real Time Stock Market Anal- ysis, 2021 International Conference on System, Computation, Au- tomation and Networking (ICSCAN), 2021, pp. 1-5, doi: 10.1109/IC- SCAN53069.2021.9526506.

Leave a Reply

Your email address will not be published.