YouTube Trending Video Analysis and Prediction

Youtube 视频趋势分析与预测

Tools: Python, pandas, numpy, seaborn, matplotlib, scikit-learn, gensim + pyLDAvis, TF-IDF, KMeans

Final Project

Github Link - Recommended to view in Google Colab.

This project analyzes over 3 million trending YouTube videos from 113 countries to explore the factors behind video virality. It consists of three main parts:

  • Exploratory Data Analysis (EDA):

    • Cleaned and explored a large-scale dataset to understand view counts, engagement metrics, and distribution patterns over time and across countries.

  • Supervised Learning (Trend Prediction):

    • Built classification models (Random Forest and Logistic Regression) to predict whether a video will become trending based on features such as view count, like count, comment count, and movement in rankings. Achieved up to 93% accuracy with the Random Forest model.

  • Unsupervised Learning (Content Clustering and Topic Modeling):

    • Used KMeans to cluster video titles (TF-IDF) and identify content styles with higher trend potential.

    • Applied LDA topic modeling on video tags to extract popular content themes like comedy, music, and sports, and analyzed which topics are more likely to trend.

    • Combined clustering results with prediction outcomes to validate meaningful content groupings.