Movie Reviews: Text Sentiment Analysis

电影评论:文本情感分析

Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, scikit-learn, NLTK, Gensim

Final Project

NoteBook Link - Recommended to view in Google Colab

This project focuses on classifying IMDb movie reviews as either positive or negative using linear models. It investigates various text representation methods—including CountVectorizer, TfidfVectorizer, and Word2Vec embeddings—and evaluates their effects on model performance. The project further explores the impact of different regularization strategies (L1 vs. L2) and hyperparameter tuning (e.g., different values of C) to optimize classification accuracy and interpretability. Feature importance analysis and error inspection are also conducted to gain deeper insight into model behavior.

Applied techniques

  • Text preprocessing (HTML removal, non-alphabetic filtering, contraction expansion)

  • Text vectorization (CountVectorizer, TfidfVectorizer, Word2Vec embeddings)

  • Logistic Regression with L1 and L2 regularization

  • Hyperparameter tuning using different C values (0.1, 1, 10)

  • Feature importance extraction and coefficient visualization

  • Manual error analysis on misclassified reviews for interpretability