Tagging Stack-Overflow Questions using Supervised Machine Learning Techniques

DOI : 10.17577/IJERTV11IS050235

Download Full-Text PDF Cite this Publication

Text Only Version

Tagging Stack-Overflow Questions using Supervised Machine Learning Techniques

Ms. Garima Gupta Asst. Prof., MAIT, Delhi

Disha Sharma Student, MAIT, Delhi

Harshit Aggarwal Student,

MAIT, Delhi

Ishan Agarwal Student, MAIT, Delhi

Abstract:- Lately, there has been a tremendous improvement in in- novation and the rise of new ones. This is joined by an ex- pansion in inquiries on sites, for example, Stack- Overflow. These inquiries should be ordered in view of labels. The issue of label proposition arises when the client gives no labels, or the client showed labels are inconsequential to the request. Our venture plans to determine this issue via auto- labeling questions utilizing directed AI procedures.

for various algorithms, namely, Multinomial Naive Bayes Algorithm, Logistic Regression Algorithm [3], Support Vector Machine (SVM) Algorithm [4], and Stochastic Gradient Descent (SGD) Classifier. Further, we com- pare the algorithms results and represent outcomes by comparing them on parameters such as Hamming score and Precision.

Keywords Natural Language Processing, Information Re-

Paper 1:


    trieval, Supervised Learning, SVM, Random Forest , SGD Classi-fier, MultinomialNB, LinearSVC, LogisticRegression


Tagging gives a helpful means to benefit out distinguishing proof tokens to explore papers, articles, or questions, which work with the recommendation, search, and attitude cycle of inquiries. It gives a way to deal with sharing and partner applicable watch- words or tags. The issue of label recommendation emerges when the client gives no tags, or client indicated tags are insignificant to the archive. These sorts of client focused approaches are not ex- ceptionally functional for label recommendation. In tagging sys- tems, manual tagging becomes blunder inclined and tedious when there are an excessive number of papers or questions. This venture will move toward the auto-tagging of StackOverflow questions uti- lizing supervised machine learning techniques and will return pre- cise tags connected with the inquiry. One of the critical pieces of the review is separating and pre-processing the dataset to extricate the most pertinent tags and catchphrases. The methodology de- pends on removing catchphrases utilizing the TF-IDF vectorizer weighted score [1] and afterward applying supervised machine learning calculations to prepare the machine learning model.

We explored different datasets such arxiv library dataset and the 10% StackOverflow QA dataset [2]. We chose the StackOver- flow database as it matches our exploration of other research pa- pers, and we can more efficiently apply supervised machine learn- ing algorithms to this dataset. The dataset contains four relevant columns: the title, question, answer, and tags. We consolidate this data and remove irrelevant columns and tags. Then we fur- ther classify tags based on the most frequently seen tags. We then vectorize the data and title and consolidate it in a single column. Further, we apply necessary pre-processing

In the paper titled Supervised ML-based approach for auto- tagging of scientific literature [5],The paper explores supervised Machine Learning approaches to provide tags for documents related to scientific domain. They approach the solution using natural Language Processing(NLP) by extracting keywords and applying appropriate text classification Techniques.in this paper they have taken arxiv dataset, which is a dataset of library of Cornell university which have data of more than millions of docu- ments.This dataset is based on the ACM Computing Classification System. In this paper main feature extraction techinques that is used includes word embeddings and text vectorization. The paper concludes by giving best result using tf-idf vectorization and using Support Vector Classifier as the algorithm for training.

Paper 2:

In the paper named Natural Language Processing for Information Extraction [6] This examination paper is about Natural Lan- guage Processing (NLP) for information extraction. With the as- cent of the digital age, there is a blast of information in news, arti- cles, web-based entertainment, etc. Quite a bit of this information lies in nebulous structure, and physically overseeing and really uti- lizing it is drawn-out, dull, and work concentrated. This blast of information and the requirement for more refined and proficient in- formation dealing with apparatuses brings about Information Ex- traction (IE) and Information Retrieval (IR) innovation. Informa- tion Extraction systems accept natural language text as informa- tion and produce organized information indicated by unambigu- ous rules pertinent to a specific application. Different sub-errands of IE, for example, Named Entity Recognition, Coreference Res- olution, Named Entity Linking, Relation Extraction, and Knowl- edge Base thinking structure the structure

blocks of different very good quality Natural Language Processing (NLP) assignments, for example, Machine Translation, Question-Answering System, Natural Language Understanding, Text Summarization and Dig- ital Assistants like Siri, Cortana and Google Now. This paper presents Information Extraction innovation, and its different sub- undertakings feature best in class research in various IE subtasks, momentum difficulties, and future examination headings.

Paper 3:

In the paper named Automatically Labeling Low Quality Con- tent on Wikipedia By Leveraging Patterns in Editing Behaviors [7] creators start by spreading out A methodology in light of meta- physics for auto-tagging is proposed in.Ontology is an informa- tion model in which terms are addressed as an ordered progres- sion. It incorporates characterization and tag-choice. Term- weight matrix and cosine-similarity is utilized for the characterization cy- cle. Label choice interaction relies upon metaphysics weight. A contribution of a huge train dataset comprising of title, unique is utilized and TF-IDF is utilized for building term weight matrix. In this work, labels are positioned in terms of recurrence as well as in terms of similarity moreover. A cross breed approach is uti- lized for removing the header information which can be utilized for investigating the elements in the examination region among research networks. Data integration and validation utilizing extri- cated header information assets like GROBID, Parsit are utilized as it is contended that any a single instrument cant give effectiveoutcomes on all example research articles.


    • We choose the 10% StackOverflow questions dataset and use the Python Pandas library to import the dataset from the CSV file. Then we create two files, i.e., the questions.csv file and the Tags.csv file, in which we get the title body and tags related to the question.

    • The first step of pre-processing involves consol- idating the questions and tags into single data from the input dataset and mapping the labels ac- cording to the question id given in the dataset.

    • The next step involved the grouping of tags and removal of repeated tags.

    • Then we filtered the data in the questions data frame, and we dropped the columns such as ownerUserId, CreationDate, ClosedDate.

    • Then we further disregarded the rows having scores less than 5. The score is the sum of up- votes and negative downvotes for a question.

    • After filtering based on id and score, we dropped these columns because they are not needed for model training.

    • Then we processed th tags and filtered unique tags; we ran frequency distribution using NLTK and found that more than 14000+ tags occurred more than 220000+ times. So, using this data, we extracted the top 100 tags in the dataset, which came out to be primarily related to the tech stack.

    • The next task was to filter the content, body, and title column data. We need to remove HTML tags and unnecessary elements from the text,and we used the bs4 library for this purpose.

    • Then we do the essential part of the project, which is pro- cessing the body column. Here, we first clean the data, i.e., remove the redundant spaces between the words, and then we clean the punctuation in which we removed the special characters, if any, present in the data.

    • Then we perform lemmatization on the data. Lemmatization refers to reduce the word to simplest form , such that its vo-cabulary and morphological analysis

    • Then we applied StopWords removal using Natural Language Toolkit, where the stopwords in NLTK refers to word in documents which occur frequently. Also stopwords are the word which do not define the con- text to the document so it is essential to remove them.

    • Then we used Mulilabel binarizer to convert the tags into binary data. The step is also called Bucketization.

    • Then we used the TF-IDF vectorizer to obtain scores of different words in the documents, and then we applied the vectorization to the dataset.

    • After processing the dataset, we train the model on 80% of the dataset and then test it on the remaining 20%. Then, we compare the results given by different models based on hamming loss and Precision.


    After training and testing the model on four different algo- rithms separately, we compared them based on hamming loss and Precision. The result is shown in the following figures:

    The best model is the SGDClassifier model because it has the lowest hamming loss (0.0096) and highest Precision (82.7%). Based on hamming loss alone, the LinearSVC and SGDClassifier model gives the best result with the lowest score of 0.0096.

    The SGDClassifier model gives the best result with thehighest Precision of 82.6% based on Precision alone.


    • our solution provides an efficient as well as effecion way of tagging stackoverflow question using relevant tags.

    • The imported dataset is processed to perform tagging, and noisy data is removed.

    • Our project also generates reports showing the influence of different models and their behavior.

  • The max Precision of the ML model came to be around 80% which suggest we can use it in genuine life software to au- tomate tagging of question to help readers reach the most relevant search results.

  • The algorithm can also be improved with time with an in- creasing database of websites like stack overflow, and algo- rithms can also be enhanced with newer deep learning tech-niques.

  • This project can be used to recommend and analyze ques- tions based on a specific tag.


[1] sklearn.feature extraction.text.TfidfVectorizer. [Online]. Available: https://scikit-learn/stable/modules/generated/ sklearn.feature extraction.text.TfidfVectorizer.html 1

[2] StackSample: 10% of Stack Overflow Q&A. [On- line].

Available: https://www.kaggle.com/stackoverflow/ stacksample 1 [3] Logistic Regression in Machine Learning – Javat- point.

[Online]. Available: https://www.javatpoint.com/ logistic- regression-in-machine-learning 1

[4] Support Vector Machine (SVM) Algorithm – Javat- point. [Online]. Available: https://www.javatpoint.com/ machine- learning-support-vector-machine-algorithm 1

[5] M. Zdravkovic´, Supervised ml-based approach for auto- tagging of scientific literature, in 2021 20th International Symposium INFOTEH-JAHORINA (INFOTEH), 2021, pp. 15. 1‌

[6] S. Singh, Natural language processing for information ex- traction, arXiv preprint arXiv:1807.02383, 2018. 2

[7] S. Asthana, S. Tobar Thommel, A. L. Halfaker, and

[8] N. Banovic, Automatically labeling low quality content on wikipedia by leveraging patterns in editing behaviors, Proc. ACM Hum.-Comput. Interact., vol. 5, no. CSCW2, oct 2021. [Online]. Available: https://doi.org/10.1145/3479503 2

Leave a Reply