Spotify’s API gives the ability to query any song and return information on its relative popularity, which can range from 0 to 100. Additionally, one can retrieve the attributes of songs like length, release date, dancibility, time signature, explicit language, etc.
Coupling this API with a homemade random song query-er, we constructed a dataset of over 3000 songs to answer the research question of can we predict a song’s popularity on Spotify based on the song’s characteristics?
The dataset we constructed contains many variables on particular attributes of a song, like dancibility. Seen below, there seems to be a positive relationship between dancibility and popularity.
Towards the goal of creating the best model for this data, the pre-processing steps in the recipe include removing predictors with near-zero variance, encoding a date variable, one hot encoding categorical variables, adding interactions, and normalizing all predictors.
Random forest, boosted tree, MARS, and SVM models were fit and tuned. Additionally, an ensemble model was created. The ensemble model achieved the lowest RMSE score, and thus was selected.
The ensemmble model performed fairly well on the testing data, achieving an RMSE of about 18. Since popularity can range from 1 to 100, so a root mean squared error of about 18 indicates our model is somewhat useful at predicting a songs popularity.
Our model could likely be improved with more training data, and with more variables. Particularly, we think that more variables about the artist’s attributes might improve prediction.
In the future, we would like to extend this project to investigate whether predictors of a song’s popularity vary by country or region.
Our data source is the Spotify Web API. Documentation and information on the API can be found at https://developer.spotify.com/documentation/web-api/guides/
The Python script that we used to scrape our data can be found in this repository as data_scrape.ipynb.