
Music Popularity: Decision Tree Model
音乐流行度:决策树模型
Tools: Python, Pandas, NumPy, scikit-learn, Matplotlib, Seaborn, Graphviz, NLTK
Final Project
This project investigates the extent to which the popularity of a song on Spotify can be predicted using audio and metadata features. Using a dataset containing over 114,000 tracks across 125 genres, the project applies supervised machine learning models to classify songs as either "popular" or "not popular" based on characteristics such as tempo, energy, danceability, and acousticness. The project focuses on both predictive modeling and the interpretation of key features that contribute to a song’s popularity. It serves as a practical implementation of tree-based classification techniques.
该项目旨在探究如何利用音频和元数据特征预测 Spotify 上歌曲的流行度。该项目使用一个包含 125 种音乐类型、超过 114,000 首曲目的数据集,运用监督式机器学习模型,根据节奏、活力、舞蹈性以及声学特性等特征,将歌曲分为“流行”或“不流行”。该项目既关注预测建模,也关注对影响歌曲流行度的关键特征的解读。它是基于树的分类技术的实际应用。
Models
Decision Tree Classifier (决策树分类器)
The model used a decision tree classifier to predict whether a song is popular based on its audio and metadata features. After data preprocessing and feature selection, the model achieved an accuracy of 72.7% on the test set. Adjusting model parameters such as maximum tree depth or splitting criteria did not significantly improve performance. Feature importance analysis indicated that genre popularity (whether the genre belongs to a high-average-popularity group), acousticness (likelihood that the sound is acoustic rather than electronic), valence (positivity of musical emotion), duration, and danceability (suitability for dancing) had relatively higher influence in the model.
该模型使用决策树分类器来预测歌曲是否流行,依据包括音频特征和元数据在内的变量。经过预处理与特征选择后,模型在测试集上达到 72.7% 的准确率。尝试调整模型参数(如树的最大深度、划分标准)后未带来明显性能提升。特征重要性分析显示,流派流行程度(是否属于平均流行度较高的音乐流派)、声学程度(声音为原声而非电子/合成的可能性)、情绪正向性(音乐所表达的情绪是否积极)、歌曲时长、以及舞蹈性(歌曲是否适合跳舞)等变量在模型中具有较高影响力。
Random Forest Classifier (随机森林分类器)
To improve model stability and performance, a random forest classifier was introduced. After parameter tuning (e.g., using the entropy criterion, 200 trees, and a maximum depth of 30), the model achieved an accuracy of 78.4% on the test set. Compared to a single decision tree, the random forest generally offers more robust predictions and a more balanced distribution of feature importance. The analysis indicated that variables such as valence, acousticness, track duration, tempo, loudness, and danceability had relatively high influence in the model.
为了提升模型的稳定性与表现,项目引入了随机森林分类器。经过参数调整(例如使用 entropy 划分标准、200 棵树、最大深度为 30),模型在测试集上取得 78.4% 的准确率。与单棵决策树相比,随机森林一般在预测效果上更为稳健,特征重要性分布也更均衡,特征重要性分析显示,情绪正向性、声学程度,歌曲时长,节奏快慢、响度和舞蹈性等变量有较高影响力。
Decision Tree Regressor (决策树回归器)
In a regression task, a decision tree regressor was used to predict the actual popularity score (0–100) for each track. The model achieved an R² score of approximately 0.10 on the test set, suggesting weak predictive power. This outcome indicates that while classification of popularity is achievable to some extent, predicting precise popularity scores is more difficult, possibly due to missing behavioral or contextual data.
在回归任务中,项目使用了决策树回归模型来预测每首歌曲的具体流行度得分(范围为 0 到 100)。模型在测试集上的 R² 得分约为 0.10,预测效果较弱。结果表明,相较于流行与否的分类,预测具体分数更为困难,可能是因为缺乏用户行为或上下文等更复杂的数据特征。