
YouTube Trending Video Analysis and Prediction
Youtube 视频趋势分析与预测
Tools: Python, Random Forest, Logistic Regression, KMeans, LDA Topic Modeling
Final Project
Can we use machine learning to predict which YouTube videos will become trending?
In this project, I explored over 3 million trending video records from 113 countries to understand what makes a video go viral (dataset:Trending Youtube Video Statistics). I applied a mix of supervised learning methods (such as Random Forest and Logistic Regression) and unsupervised learning techniques (like KMeans clustering and LDA topic modeling) to analyze both the engagement metrics and the content patterns behind popular videos.
By analyzing everything from view counts and posting dates to the language in video titles and tags, I tried to understand not just which videos might trend, but also what content styles are more likely to go viral.
我们能否利用机器学习来预测哪些 YouTube 视频会成为热门视频? 在这个项目中,我研究了来自 113 个国家/地区的 300 多万条热门视频记录,以了解视频走红的原因 (数据集:Trending Youtube Video Statistics)。我结合运用了监督学习方法(例如随机森林和逻辑回归)和非监督学习技术(例如 KMeans 聚类和 LDA 主题建模),以分析热门视频背后的参与度指标和内容模式。 从观看次数和上传时间到标题和标签中的词汇,我试图了解哪些视频可能成为热门视频,以及哪些内容风格更容易走红。
Explore
Supervised Learning: Predicting Trending Videos (监督学习:预测热门视频)
To predict whether a video would become trending, I trained classification models using both video metadata and engagement signals.
I used features like view count, like count, comment count, and changes in ranking. Categorical variables like country, weekday, and channel were encoded appropriately.
To avoid data leakage, the label “is_trend” was created based on weekly top tags, not video titles.
I tested two models:
Random Forest performed better overall, reaching an F1 score of 0.65. It highlighted that view count, likes, and ranking movement are strong indicators of virality.
Logistic Regression gave more interpretable results, showing how different features positively or negatively affect the likelihood of trending.
Key insight:
Some trending videos with high views and likes were still misclassified. Many of them had low momentum (slow movement in rankings), suggesting that engagement alone is not always enough to signal a trend.
为了预测视频是否会成为热门视频,我使用视频元数据和参与度信号训练了分类模型。 我使用了观看次数、点赞次数、评论次数和排名变化等特征。国家/地区、工作日和频道等分类变量也进行了适当的编码。
为了避免数据泄露,“is_trend”标签是根据每周热门标签(而非视频标题)创建的。
我测试了两种模型: 随机森林模型总体表现更佳,F1 得分达到 0.65。它显示了观看次数、点赞次数和排名变化是病毒式传播的有力指标。 逻辑回归模型说明了不同特征如何对热门视频成为热门视频的可能性产生正面或负面影响。
关键洞察: 一些观看次数和点赞次数较高的热门视频仍然被错误分类。其中许多视频的动量较低(排名变化缓慢),这表明仅凭参与度并不足以预示趋势。
Unsupervised Learning: Discovering Content Patterns(无监督学习:发现内容模式)
To understand the kinds of content that tend to perform well, I applied unsupervised learning methods to group similar videos.
Using KMeans clustering on video titles (TF-IDF vectors), I discovered clusters that share common themes.
One cluster, for example, focused on product demos and gadgets, and had a very high trend rate.
I also applied LDA topic modeling on video tags to extract common themes.
One topic that stood out contained tags like funny, prank, short, family, and was strongly linked to trending videos.
Key insight:
Videos that are funny, light, family-friendly, or product-based tend to have a much higher chance of becoming trending. These content types seem to match platform and user preferences across countries.
为了了解哪些类型的内容往往表现良好,我运用无监督学习方法对相似的视频进行分组。 我使用 KMeans 聚类分析视频标题(TF-IDF 向量),发现了一些具有共同主题的聚类。 例如,一个聚类专注于产品演示和小工具,并且趋势率非常高。 我还运用 LDA 主题模型对视频标签进行分析,以提取共同主题。 一个突出的主题包含“搞笑”、“恶作剧”、“短篇”、“家庭”等标签,并且与热门视频密切相关。
关键洞察: 搞笑、轻松、适合家庭观看或以产品为主题的视频往往更有可能成为热门视频。这些内容类型似乎与不同国家/地区的平台和用户偏好相符。
Findings
Through this project, I found that different types of machine learning methods can reveal different aspects of how trending videos emerge.
Supervised learning showed that while metrics like view count and like count are useful, videos that actually become trending often show a clear upward movement in ranking.In other words, both the rate of engagement growth and the raw interaction numbers are important for a video to become trending.
Unsupervised learning helped me analyze the structure of trending content. By clustering video titles and modeling tag topics, I discovered that videos which are short, entertaining, and emotionally engaging appear more often in trending groups—this pattern is consistent across countries and content types.
This also helps explain why short-form videos have become more popular today: they connect with emotions more easily, fit better with recommendation systems, and are more likely to spread quickly.
通过这个项目,我发现不同类型的机器学习方法可以从不同角度揭示“爆款视频”的形成机制。监督学习显示,虽然观看数和点赞数具有一定参考价值,但真正成为 trending 的视频,往往还具备明显的排名上升速度。也就是说,“热度增长的速度”与“互动指标的绝对值”——例如观看量和点赞数——都对视频能否会成为爆款很重要。而无监督学习则帮助我从内容层面分析了热门视频的结构。通过对标题的聚类和对标签的主题建模,我发现,那些短小、有趣、具有情绪吸引力的视频,更频繁地出现在 trending 群体中,而且这一特征在不同国家和内容类型中都较为一致。这也从另一个角度解释了为什么短视频成为今天用户更偏好的形式:它们更容易激发情绪共鸣,更适合平台的推荐算法,也更容易被快速传播。