DataProjects-2-youtube

YouTube Trending Video Analysis and Prediction

Youtube 视频趋势分析与预测

Tools: Python, pandas, numpy, seaborn, matplotlib, scikit-learn, gensim + pyLDAvis, TF-IDF, KMeans

Final Project

Github Link - Recommended to view in Google Colab.

This project analyzes over 3 million trending YouTube videos from 113 countries to explore the factors behind video virality. It consists of three main parts:

Exploratory Data Analysis (EDA):
- Cleaned and explored a large-scale dataset to understand view counts, engagement metrics, and distribution patterns over time and across countries.
Supervised Learning (Trend Prediction):
- Built classification models (Random Forest and Logistic Regression) to predict whether a video will become trending based on features such as view count, like count, comment count, and movement in rankings. Achieved up to 93% accuracy with the Random Forest model.
Unsupervised Learning (Content Clustering and Topic Modeling):
- Used KMeans to cluster video titles (TF-IDF) and identify content styles with higher trend potential.
- Applied LDA topic modeling on video tags to extract popular content themes like comedy, music, and sports, and analyzed which topics are more likely to trend.
- Combined clustering results with prediction outcomes to validate meaningful content groupings.