Paper summary; An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation

Table of contents

No heading

No headings in the article.

Introduction

The current era has made it easy for people to share their thoughts/opinions online and this current decade has become a digital book where everybody's thoughts/opinions contribute to the sentiments of different topics. This has helped companies gather user reviews/opinions about their products and services. It is very important to have an efficient way to predict user sentiments about a product or service as the future of sales depends on the sentiments and perception of the previous buyers.

    Method

This paper has demonstrated a study on 3 different techniques to build models for text classification. The first 2 techniques are the binary bag of words model and the TF-IDF model while the third technique is the technique this paper proposes for text sentiment classification using the Term Frequency-Inverse Document Frequency (TF-IDF) along with next word negation.

What is binary bag of words model?

The binary bag of words model is a representation of text that describes the occurrence of words in a binary(0 and 1) format. Bow deals with the counts of the total occurrences of frequent words. However, the bag of word model disregards the grammatical details and the word order.

What is TF-IDF model?

TF-IDF model assigns each term in a document a weight based on its term frequency (TF) and inverse document frequency (IDF). The terms with higher weight scores are considered to be more important. The TF-IDF model works better than the BOW model by giving relevance to uncommon words rather than treating all words as equal like in the case of the BOW.

image.png image.png

What is TF-IDF-NWN

TF-IDF with next word negation which is the method proposed in this research work and this is when a negation strategy is applied in which words are negated based on prior knowledge of polar expressions. In this model whenever a negation word s tracked some changes are made to the words succeeding it.

Many earlier proposed models have also used this strategy before. However, for the earlier models whenever a negation word is tracked all the words right after it is preceded with a not_ until punctuation is received. This approach introduced a lot of unwanted words in the corpus. Hence, doesn't seem realistic.

Overview of proposed work

For this research work three datasets were used:
  • The IMDB movie review dataset
  • Amazon product review dataset
  • SMS spam collection dataset
  • Each of these datasets is to go through the preprocessing in the image below. image.png

    This research work tested the performance of 3 different text representation models.

    Starting with the simple binary bag of words of model where each document is represented as a fixed size vector of 0s and 1s where if a word appears in a document it gets a 1 and if it doesn’t then it gets a 0. The sentences below were used as examples to aid further understanding.

      D1: the movie was a very indulging cinematic experience.
      D2: standard of this movie is above its contemporaries.
      D3: director brought out the best of the pair.
      D4: moviegoers won’t mind seeing the pair again.

    image.png

    The binary bag of words model represents only the existence of words but does not take into account the importance of specific words in a document, like in the first document "indulging" seem to be a much more important one for measuring polarity than some other words. This model is a two-dimensional sentiment analysis model.

    image.png

    The second model tested is the bag of words model with TERM FREQUENCY INVERSE DOCUMENT FREQUENCY scores(TF-IDF): The documents are also represented as vectors but not vectors of 0's and 1's but vectors that contain scores for each of the words. The scores of any word in any document can be represented as per the equation below:

    image.png

    Read more on TFIDF here

    image.png The algorithm for preprocessing in the case of the TF-IDF-NWN model is shown above. The preprocessing phase of removing the punctuations, stop words has been omitted in the algorithm and only the negation part is displayed to ease understanding. The model takes in punctuationless documents as input. It then loops through the whole document and for each document performs the NWN technique.

    After this preprocessing, a TFIDF model is formed in the same way as before. The proposed example sentences converted into the TFIDF NWN model is shown below

    image.png

    The proposed model actually does better when it comes to attaching importance to words. The word "not_mind" has a higher score than both "won't" and "mind" in the regular TF-IDF model.

    Three different ML algorithms( Linear SVM, Multinomial Naïve Bayes, and Max Entropy Random Forest) were used to fit the preprocessed text data with a split ratio of 80:20 training and testing respectively. The results of this experiments has been summarized in the conclusion below.

    CONCLUSION

    image.png This research work conducted experiments on three datasets, IMDB movie reviews, Amazon product reviews, and SMS spam detection dataset with the IMDB movie reviews as the primary dataset. After performing sentiment analysis on these datasets using binary bag of words model and TF-IDF model we found out the accuracy as 86.75% and 89% respectively. But, after conducting the experiment with our proposed model, i.e, using NWN with TF-IDF, there was a good increase in the accuracy level(around 89.91%).

    image.png The accuracy percentages for IMDB review datasets, Amazon product review and SMS spam datasets came as 89.91%, 88.86%, 96.83% respectively. So, from the experiments, the researchers have concluded that when TF-IDF model is coupled with Next Word Negation then the performance of the sentiment classifier increases by a significant percentage

    Personal thoughts and Closing notes.

    I find this particular research work very interesting as it does not just propose a new text preprocessing model but also compares it and outperforms the previous ones. The preprocessed texts were fitted on 3 popular ML algorithms and they all performed well but what happens when we use a different dataset and a different ML algorithm?

    In the future, we seek to further improve the accuracy of this model by working on the contextual opposite of the word following the negation word.

    This research paper was published in 2018 and has been cited by 42 different research works.

    Thank you very much for reading, I am open to suggestions on how to improve my paper summary series and also collaborations. Feel free to ask your questions/drop your thoughts in the comments.