
Movie Reviews: Text Sentiment Analysis
电影评论:文本情感分析
Tools: Python, Pandas, NumPy, Matplotlib, Seaborn, scikit-learn, NLTK, Gensim
Final Project
NoteBook Link - Recommended to view in Google Colab
This project focuses on classifying IMDb movie reviews as either positive or negative using linear models. It investigates various text representation methods—including CountVectorizer, TfidfVectorizer, and Word2Vec embeddings—and evaluates their effects on model performance. The project further explores the impact of different regularization strategies (L1 vs. L2) and hyperparameter tuning (e.g., different values of C) to optimize classification accuracy and interpretability. Feature importance analysis and error inspection are also conducted to gain deeper insight into model behavior.
Applied techniques
Text preprocessing (HTML removal, non-alphabetic filtering, contraction expansion)
Text vectorization (CountVectorizer, TfidfVectorizer, Word2Vec embeddings)
Logistic Regression with L1 and L2 regularization
Hyperparameter tuning using different
C
values (0.1, 1, 10)Feature importance extraction and coefficient visualization
Manual error analysis on misclassified reviews for interpretability