Paper summary; An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation

Introduction

The current era has made it easy for people to share their thoughts/opinions online and this current decade has become a digital book where everybody's thoughts/opinions contribute to the sentiments of different topics. This has helped companies gather user reviews/opinions about their products and services. It is very important to have an efficient way to predict user sentiments about a product or service as the future of sales depends on the sentiments and perception of the previous buyers.

Method

This paper has demonstrated a study on 3 different techniques to build models for text classification. The first 2 techniques are the binary bag of words model and the TF-IDF model while the third technique is the technique this paper proposes for text sentiment classification using the Term Frequency-Inverse Document Frequency (TF-IDF) along with next word negation.

What is binary bag of words model?

The binary bag of words model is a representation of text that describes the occurrence of words in a binary(0 and 1) format. Bow deals with the counts of the total occurrences of frequent words. However, the bag of word model disregards the grammatical details and the word order.

What is TF-IDF model?

TF-IDF model assigns each term in a document a weight based on its term frequency (TF) and inverse document frequency (IDF). The terms with higher weight scores are considered to be more important. The TF-IDF model works better than the BOW model by giving relevance to uncommon words rather than treating all words as equal like in the case of the BOW.

What is TF-IDF-NWN

TF-IDF with next word negation which is the method proposed in this research work and this is when a negation strategy is applied in which words are negated based on prior knowledge of polar expressions. In this model whenever a negation word s tracked some changes are made to the words succeeding it.

Many earlier proposed models have also used this strategy before. However, for the earlier models whenever a negation word is tracked all the words right after it is preceded with a not_ until punctuation is received. This approach introduced a lot of unwanted words in the corpus. Hence, doesn't seem realistic.

Overview of proposed work

For this research work three datasets were used:

The IMDB movie review dataset

Amazon product review dataset

SMS spam collection dataset

Each of these datasets is to go through the preprocessing in the image below.

This research work tested the performance of 3 different text representation models.

Starting with the simple binary bag of words of model where each document is represented as a fixed size vector of 0s and 1s where if a word appears in a document it gets a 1 and if it doesn’t then it gets a 0. The sentences below were used as examples to aid further understanding.

D1: the movie was a very indulging cinematic experience.

D2: standard of this movie is above its contemporaries.

D3: director brought out the best of the pair.

D4: moviegoers won’t mind seeing the pair again.

The binary bag of words model represents only the existence of words but does not take into account the importance of specific words in a document, like in the first document "indulging" seem to be a much more important one for measuring polarity than some other words. This model is a two-dimensional sentiment analysis model.

The second model tested is the bag of words model with TERM FREQUENCY INVERSE DOCUMENT FREQUENCY scores(TF-IDF): The documents are also represented as vectors but not vectors of 0's and 1's but vectors that contain scores for each of the words. The scores of any word in any document can be represented as per the equation below:

CONCLUSION

This research work conducted experiments on three datasets, IMDB movie reviews, Amazon product reviews, and SMS spam detection dataset with the IMDB movie reviews as the primary dataset. After performing sentiment analysis on these datasets using binary bag of words model and TF-IDF model we found out the accuracy as 86.75% and 89% respectively. But, after conducting the experiment with our proposed model, i.e, using NWN with TF-IDF, there was a good increase in the accuracy level(around 89.91%).

The accuracy percentages for IMDB review datasets, Amazon product review and SMS spam datasets came as 89.91%, 88.86%, 96.83% respectively. So, from the experiments, the researchers have concluded that when TF-IDF model is coupled with Next Word Negation then the performance of the sentiment classifier increases by a significant percentage

Personal thoughts and Closing notes.

I find this particular research work very interesting as it does not just propose a new text preprocessing model but also compares it and outperforms the previous ones. The preprocessed texts were fitted on 3 popular ML algorithms and they all performed well but what happens when we use a different dataset and a different ML algorithm?

In the future, we seek to further improve the accuracy of this model by working on the contextual opposite of the word following the negation word.

This research paper was published in 2018 and has been cited by 42 different research works.

Thank you very much for reading, I am open to suggestions on how to improve my paper summary series and also collaborations. Feel free to ask your questions/drop your thoughts in the comments.

Paper summary; An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation

Table of contents

No headings in the article.

Introduction

Method

What is binary bag of words model?

What is TF-IDF model?

What is TF-IDF-NWN

Overview of proposed work

D1: the movie was a very indulging cinematic experience.

D2: standard of this movie is above its contemporaries.

D3: director brought out the best of the pair.

D4: moviegoers won’t mind seeing the pair again.

CONCLUSION

Personal thoughts and Closing notes.

Paper summary; An Improved Text Sentiment Classification Model Using TF-IDF and Next Word Negation

Table of contents

No headings in the article.

Introduction

Method

What is binary bag of words model?

What is TF-IDF model?

What is TF-IDF-NWN

Overview of proposed work

D1: the movie was a very indulging cinematic experience. D2: standard of this movie is above its contemporaries. D3: director brought out the best of the pair. D4: moviegoers won’t mind seeing the pair again.

CONCLUSION

Personal thoughts and Closing notes.

D1: the movie was a very indulging cinematic experience.

D2: standard of this movie is above its contemporaries.

D3: director brought out the best of the pair.

D4: moviegoers won’t mind seeing the pair again.