Movie Reviews: Text Sentiment Analysis
Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, scikit-learn, NLTK, Gensim
Final
This project focuses on building a binary text classification system to predict the sentiment (positive or negative) of movie reviews from the IMDb dataset. Using machine learning pipeline, the project compares different text vectorization methods—CountVectorizer, TfidfVectorizer, and Word2Vec—combined with logistic regression classifiers. Model evaluation and error analysis are conducted to examine the trade-offs of each approach and to identify common misclassification patterns.
Models
CountVectorizer
CountVectorizer transforms text into frequency-based vectors by counting how often each word appears in a review. In this project, it served as the baseline approach, achieving an accuracy of 0.883 and an F1-score of 0.88 with a logistic regression classifier. While computationally efficient, it lacks semantic understanding and treats all words equally, making the model sensitive to frequent but sentiment-neutral words. This resulted in some misclassified features, such as incorrectly identifying “disappoint” as a positive term.
TfidfVectorizer
TfidfVectorizer improves upon simple word counts by incorporating inverse document frequency (IDF), which highlights words that are more distinctive within individual documents. In this project, it significantly improved model performance, achieving an accuracy of 0.897 and an F1-score of 0.90. The selected features aligned better with sentiment polarity—keywords like “excellent” and “worst” were weighted appropriately—while reducing the influence of frequent but uninformative words. This method was ultimately chosen for the final model.
Word2Vec
Word2Vec leverages pre-trained word embeddings to map words into dense, low-dimensional vectors. In this project, sentence representations were generated by averaging all word vectors in each review. Although this method theoretically captures semantic relationships, it suffered from diluted feature importance, lack of word order, and missing vocabulary coverage. As a result, classification performance was lower than TF-IDF, with an accuracy of 0.828 and an F1-score of 0.83. Word2Vec is more suitable for tasks involving richer contextual modeling rather than simple linear classification.
Conclusion
This project builds a sentiment classification model based on the IMDb movie reviews dataset and compares three common text representation methods: CountVectorizer, TfidfVectorizer, and Word2Vec. Among them, TfidfVectorizer showed better performance in terms of accuracy and stability, as it reduces the influence of frequent but uninformative words and gives more weight to distinctive terms. In contrast, CountVectorizer is more sensitive to common words, while Word2Vec, although capable of capturing semantic meaning, performed less effectively in this case due to the use of averaged word embeddings, which weakened the emotional details in the text.
The error analysis reveals that the model struggles with more complex sentence structures, such as mixed sentiments, sentiment reversals, or words whose meaning depends on context. These cases highlight the limitations of the current approach in capturing contextual and structural nuances in text.