Movie Reviews: Text Sentiment Analysis

电影评论:文本情感分析

Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, scikit-learn, NLTK, Gensim

Final Project

This project focuses on building a binary text classification system to predict the sentiment (positive or negative) of movie reviews from the IMDb dataset. Using machine learning pipeline, the project compares different text vectorization methods—CountVectorizer, TfidfVectorizer, and Word2Vec—combined with logistic regression classifiers. Model evaluation and error analysis are conducted to examine the trade-offs of each approach and to identify common misclassification patterns.


本项目致力于构建一个二元文本分类系统,以预测 IMDb 数据集中电影评论的情绪(正面或负面)。该项目利用机器学习流程,比较了不同的文本向量化方法——CountVectorizer、TfidfVectorizer 和 Word2Vec——并与逻辑回归分类器相结合。通过模型评估和误差分析,考察了每种方法的优缺点,并识别了常见的误分类模式。

Models

CountVectorizer

CountVectorizer transforms text into frequency-based vectors by counting how often each word appears in a review. In this project, it served as the baseline approach, achieving an accuracy of 0.883 and an F1-score of 0.88 with a logistic regression classifier. While computationally efficient, it lacks semantic understanding and treats all words equally, making the model sensitive to frequent but sentiment-neutral words. This resulted in some misclassified features, such as incorrectly identifying “disappoint” as a positive term.

CountVectorizer 将文本转换为词频向量,统计每个词在评论中出现的次数。在本项目中,该方法作为基线模型,配合逻辑回归分类器实现了 0.883 的准确率和 0.88 的 F1 分数。尽管计算效率高,但由于忽略了词语的重要性和上下文信息,模型容易受到高频无效词的干扰,导致部分特征词与真实情感不符,例如将 “disappoint” 错误识别为正向词。

TfidfVectorizer

TfidfVectorizer improves upon simple word counts by incorporating inverse document frequency (IDF), which highlights words that are more distinctive within individual documents. In this project, it significantly improved model performance, achieving an accuracy of 0.897 and an F1-score of 0.90. The selected features aligned better with sentiment polarity—keywords like “excellent” and “worst” were weighted appropriately—while reducing the influence of frequent but uninformative words. This method was ultimately chosen for the final model.

TfidfVectorizer 在词频的基础上引入了逆文档频率(IDF),增强了文本中具有辨识度的词语权重。在该项目中,该方法显著提升了模型的分类效果,准确率为 0.897,F1 分数为 0.90。所选特征词更加符合情感倾向,如 “excellent” 和 “worst” 等关键词具有较高的区分度,同时减少了高频噪音词对模型的干扰,是最终选用的文本表示方法。

Word2Vec

Word2Vec leverages pre-trained word embeddings to map words into dense, low-dimensional vectors. In this project, sentence representations were generated by averaging all word vectors in each review. Although this method theoretically captures semantic relationships, it suffered from diluted feature importance, lack of word order, and missing vocabulary coverage. As a result, classification performance was lower than TF-IDF, with an accuracy of 0.828 and an F1-score of 0.83. Word2Vec is more suitable for tasks involving richer contextual modeling rather than simple linear classification.

Word2Vec 使用预训练词向量将单词映射为低维稠密向量,在本项目中通过对每条评论的词向量进行平均获得句子表示。虽然该方法理论上可捕捉语义信息,但实际结果受限于词序缺失、重要词稀释以及词表缺失等问题,分类表现不如 TF-IDF,准确率为 0.828,F1 分数为 0.83,适用于更复杂语义建模任务而非简单线性分类。

Conclusion

This project builds a sentiment classification model based on the IMDb movie reviews dataset and compares three common text representation methods: CountVectorizer, TfidfVectorizer, and Word2Vec. Among them, TfidfVectorizer showed better performance in terms of accuracy and stability, as it reduces the influence of frequent but uninformative words and gives more weight to distinctive terms. In contrast, CountVectorizer is more sensitive to common words, while Word2Vec, although capable of capturing semantic meaning, performed less effectively in this case due to the use of averaged word embeddings, which weakened the emotional details in the text.

The error analysis reveals that the model struggles with more complex sentence structures, such as mixed sentiments, sentiment reversals, or words whose meaning depends on context. These cases highlight the limitations of the current approach in capturing contextual and structural nuances in text.

本项目基于 IMDb 影评数据构建了一个情感分类模型,并对三种常见的文本表示方法——CountVectorizer、TfidfVectorizer 和 Word2Vec——进行了对比。结果显示,TfidfVectorizer 在分类准确率和模型稳定性方面表现更优,因为它能降低高频通用词的影响,更突出在情感判断中具有区分度的词汇。相比之下,CountVectorizer 容易受到常见词干扰,而 Word2Vec 虽然能引入词义信息,但本项目采用的平均词向量方式忽略了上下文与词序,导致部分情绪表达被弱化。在误差分析中可以发现,模型对某些复杂语句的判断存在困难,如正负情绪混合、前后语义反转的句子结构,或依赖上下文理解词义的情况。这说明当前方法在处理语境和句法结构方面仍存在一定局限性。