
YouTube Trending Video Analysis and Prediction
Youtube 视频趋势分析与预测
Tools: Python, pandas, numpy, seaborn, matplotlib, scikit-learn, gensim + pyLDAvis, TF-IDF, KMeans
Final Project
Github Link - Recommended to view in Google Colab.
This project analyzes over 3 million trending YouTube videos from 113 countries to explore the factors behind video virality. It consists of three main parts:
Exploratory Data Analysis (EDA):
Cleaned and explored a large-scale dataset to understand view counts, engagement metrics, and distribution patterns over time and across countries.
Supervised Learning (Trend Prediction):
Built classification models (Random Forest and Logistic Regression) to predict whether a video will become trending based on features such as view count, like count, comment count, and movement in rankings. Achieved up to 93% accuracy with the Random Forest model.
Unsupervised Learning (Content Clustering and Topic Modeling):
Used KMeans to cluster video titles (TF-IDF) and identify content styles with higher trend potential.
Applied LDA topic modeling on video tags to extract popular content themes like comedy, music, and sports, and analyzed which topics are more likely to trend.
Combined clustering results with prediction outcomes to validate meaningful content groupings.